Measuring service-level learning effects in search via query-randomized experiments

Paul Musgrave; Cuize Han; Parth Gupta

Publication

Measuring service-level learning effects in search via query-randomized experiments

By Paul Musgrave, Cuize Han, Parth Gupta

2023

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

In order to determine the relevance of a given item to a query, most modern search ranking systems make use of features which aggregate prior user behavior for that item and query (e.g. click rate). For practical reasons, when running A/B tests on ranking systems, these features are generally shared between all treatments. For the most common experiment designs, which randomize traffic by user or by session, this creates a pathway by which the behavior of units in one treatment can effect the outcomes for units in other treatments, violating the Stable Unit Treatment Value Assumption (SUTVA) and biasing measured outcomes. Moreover, for experi-ments targeting improvements to the behavior data available to such features (e.g. online exploration), this pathway is precisely the one we are trying to affect; if such changes occur identically in treatment and control, then they cannot be measured. To address this, we propose the use of experiments which instead randomize traffic based on the search query. To validate our approach, we perform a pair of A/B tests on an explore-exploit framework in the Amazon search page: one under query randomization, and one under user randomization. In line with the theoretical predictions, we find that the query-randomized iteration is able to measure a statistically significant effect (+0.66% Purchases, p=0.001) where the user-randomized iteration does not (-0.02% Purchases, p=0.851).

Measuring service-level learning effects in search via query-randomized experiments

Latest news

Work with us