De-biasing training data distribution using targeted data enrichment techniques
In this paper, we introduce a targeted data enrichment framework to mitigate the problem of biased training data distribution. In real world applications, it is often observed that the training data distribution differs from the online live traffic data due to multiple reasons such as topic changes, seasonalities, the nature of users. Our targeted data augmentation techniques generate samples that are most similar to those that are missing in the training data. The main idea behind the selection strategy is to fill in samples that are not yet well represented in the training data. Our framework consists of a semi-supervised learning (SSL) component and a synthetic data generation part. For SSL, we use a retrieval module with guided weights learned from a data drift model. We further discuss the problems of accumulated errors in SSL by introducing a low confidence SSL data selection strategy. For synthetic data augmentation, we use masked language model data generation by using a concept of word replaceability to produce meaningful samples. We report our results on two large commercial datasets in real world applications and show that our framework could improve the error rates in almost all domains, and on average up to 4.6%. We also report the results of the data augmentation techniques on two public datasets, where we see improvements in both cases.