Measuring service-level learning effects in search via query-randomized experiments
In order to determine the relevance of a given item to a query, most modern search ranking systems make use of features which aggregate prior user behavior for that item and query (e.g. click rate). For practical reasons, when running A/B tests on ranking systems, these features are generally shared between all treatments. For the most common experiment designs, which randomize traffic by user or by session, this creates a pathway by which the behavior of units in one treatment can effect the outcomes for units in other treatments, violating the Stable Unit Treatment Value Assumption (SUTVA) and biasing measured outcomes. Moreover, for experi-ments targeting improvements to the behavior data available to such features (e.g. online exploration), this pathway is precisely the one we are trying to affect; if such changes occur identically in treatment and control, then they cannot be measured. To address this, we propose the use of experiments which instead randomize traffic based on the search query. To validate our approach, we perform a pair of A/B tests on an explore-exploit framework in the Amazon search page: one under query randomization, and one under user randomization. In line with the theoretical predictions, we find that the query-randomized iteration is able to measure a statistically significant effect (+0.66% Purchases, p=0.001) where the user-randomized iteration does not (-0.02% Purchases, p=0.851).