Automatically assessing conversations with Alexa
Model for estimating customer satisfaction with interactions that span multiple domains improves on predecessors by 27%.
Dialogue models, like all deployed AI models, require regular evaluation to ensure that they’re meeting customers’ needs. But evaluating a conversational interaction is a challenge; historically, it’s required human judgment, which makes evaluation slow and costly.
Last week, at the Conference on Empirical Methods in Natural Language Processing (EMNLP), we presented a new neural-network-based model that attempts to estimate how customers would rate their satisfaction with dialogue interactions.
In tests involving three different groups of users across 28 domains (such as music, weather, and movie and restaurant booking), our model estimated customer satisfaction 27% more accurately than a prior neural-network-based model.
The new model was also 7% more accurate than an earlier model from our group. The earlier model took advantage of features specific to Alexa’s previous dialogue manager. The new model does not, which means that it should generalize to new dialogue managers (such as Alexa Conversations) or alternative approaches to dialogue management.
The intuitive way to train a dialogue assessment model is with sample dialogues labeled according to how satisfying they are. This has proved challenging, however: people frequently disagree in their overall assessments of the same interaction, and customer evaluations are noisy.
Instead, researchers typically use training data in which each dialogue turn is rated individually; there tends to be more agreement on turn-by-turn assessments. This is the approach we took in our previous work.
In our new work, however, we train a model jointly on turn-by-turn data and overall user assessments. We use an attention mechanism to weight the contributions of the turn-by-turn scores to the final score. Those weights are learned from the data and can generalize across multiple skills and tasks.
A more-general model
In our previous work — which we presented last year in two papers (paper 1 | paper 2) — we identified 48 distinct features of the input data that a dialogue model should use to predict customer satisfaction. Some of those features were general, such as the speech recognizer’s confidence in its transcription of the input utterance. Other features, however, referred to specific dialogue acts — such as affirmation, negation, interrogation, or termination — tracked by an earlier version of Alexa’s dialogue manager.
In the new work, we keep only 12 of the most general features from the original set of 48, and we add five new ones, based on the Universal Sentence Encoder (USE). USE is a model for embedding input texts, or representing them as points in a multidimensional space, such that points representing related texts cluster together. Our new input features include the USE embeddings of customer and system utterances and measures of the similarities between them.
This feature set is much more general than the one we used in our earlier work, so it applies to a range of dialogue managers and domains. Yet a model trained using that feature set outperformed our earlier model — even when the test data included the specific dialogue acts on which the earlier model was trained.
In our paper, we first consider a model that predicts turn-by-turn ratings using a long-short-term-memory (LSTM) network. LSTMs process sequential inputs in order, so that the output corresponding to each input factors in both the inputs and outputs that preceded it.
Then we present an iteration of the model that replaces the LSTM with a bi-directional LSTM (bi-LSTM), an LSTM that processes the same data both forward and backward. The bi-LSTM jointly predicts the turn-by-turn ratings and the overall dialogue rating.
The outputs of the bi-LSTM pass through the attention layer, which accords some dialogue turns greater weight than others, before passing to the final layers of the network, which perform the classification. The loss function used to evaluate the model during training is a weighted combination of turn-level ratings and the overall dialogue rating.
In ongoing work, we plan to expand the model to factor in the preferences of individual users.