Efficient evaluation of task oriented dialogue systems
2020
Smart voice assistants have gained much popularity in the past years. People can leverage them to accomplish a variety of daily tasks nowadays. To provide great services and ensure satisfactory user experiences, it is crucial to continuously measure and monitor how the assistant performs. One metric for such purposes is called goal success rate (GSR), which measures how often the assistant successfully fulfills a user’s goal. In order to generate annotations for GSR calculation, human labelers need to examine randomly sampled goals from the users, where a goal consists of consecutive utterances in which the user attempts to direct the assistant to accomplish a particular task. A key challenge here is to identify all the relevant contextual utterances that make up a goal. As we will demonstrate, an existing rule based solution incurs substantial wasted labeling efforts, and also introduces potential bias into the GSR metric. Inspired by related work in question answering, we propose a BERT-based span prediction model to optimize the identification of relevant contextual utterances. Through our experiments, we show that the proposed model consistently reduces labeling waste by 5%-10%.
Research areas