How well do offline metrics predict online performance of product ranking models?
Online evaluation techniques are widely adopted by industrial search engines to determine which ranking models perform better under a certain business metric. However, online evaluation can only evaluate a small number of rankers and people resort to offline evaluation to select rankers that are likely to yield good online performance. To use offline metrics for effective model selection, a major challenge is to understand how well offline metrics predict which ranking models perform better in online experiments. This paper aims to address this challenge in product search ranking. Towards this end, we collect gold data in the form of preferences over ranker pairs under a business metric in e-commerce search engine. For the first time, we use such gold data to evaluate offline metrics in terms of directional agreement with the business metric. Furthermore, we analyze offline metrics in terms of discriminative power through paired sample t-test and rank correlations among offline metrics. Through extensive online and offline experiments, we studied 36 offline metrics and observed that: (1) Offline metrics align well with online metrics: they agree on which one of two ranking models is better up to 97% of times; (2) Offline metrics are highly discriminative on large-scale search ranking data, especially NDCG (Normalized Discounted Cumulative Gain) which has a discriminative power over 99%.