Online adaptive metrics for model evaluation on non-representative offline test data
A major challenge encountered in the offline evaluation of machine learning models before being released to production is the discrepancy between the distributions of the offline test data and of the online data, due to, e.g., biased sampling scheme, data aging issues and occurrence(s) of regime shift. Consequently, the offline evaluation metrics often do not reflect the actual performance of the model online. In this paper, we propose online adaptive metrics, a computationally efficient method which re-weights the offline metrics based on calculating the joint distributions of the model hypothesis over the offline test data VS. the online data. It provides offline metrics which estimate the production performance of the model by taking into account the test data biases. The proposed method is demonstrated by real life examples on text classification and a commercial natural language understanding system. We show that the online adaptive metrics can provide accurate predictions of online recall and precision even with a small test dataset.