RoBERTaIQ: An efficient framework for automatic interaction quality estimation of dialogue systems
Automatically evaluating large scale dialogue systems’ response quality is a challenging task in dialogue research. Existing automated turn-level approaches train supervised models on Interaction Quality (IQ) labels or annotations provided by experts, which is costly and time-sensitive. Moreover, the small quantity of annotated data limits the trained model’s ability to generalize to the long tail and out of domain cases. In this paper, we propose a learning framework that improves the model’s generalizability by leveraging various unsupervised data sources available in large-scale conversational AI systems. We mainly rely on the following three techniques to improve the performance of dialogue evaluation models: First, we propose extending the RoBERTa model to encode multi-turn dialogues to capture the temporal differences between different turns. Second, we add two additional pretraining processes on top of enhanced multi-turn RoBERTa to take advantage of large quantity of existing historical dialogue data through self-supervised training. Third, we perform fine-tuning on IQ labels in a multi-task learning setup, leveraging domain-specific information from other tasks. We show that the above techniques significantly reduce annotated data requirements. We achieve the same F1 score on IQ prediction task as our baseline with only 5% of IQ training data and further beat the baseline by 5.4% absolute F1 score if we use all of the training data.