Interleaved online testing in large-scale systems
2023
Online testing is indispensable in decision making for information retrieval systems. Interleaving emerges as an online testing method with orders of magnitude higher sensitivity than the pervading A/B testing. It merges the compared results into a single interleaved result to show to users, and attributes user actions back to the systems being tested. However, its pairwise design also brings practical challenges to real-world systems, in terms of effectively comparing multiple (more than two) systems and interpreting the magnitude of raw interleaving measurement. We present two novel methods to address these challenges that make interleaving practically applicable. The first method infers the ordering of multiple systems based on interleaving pairwise results with false discovery control. The second method estimates A/B effect size based on interleaving results using a weighted linear model that adjust for uncertainties of different measurements. We showcase the effectiveness of our methods in large-scale e-commerce experiments, reporting as many as 75 interleaving results, and provide extensive evaluations of their underlying assumptions.
Research areas