Three years ago, Alexa began using an industry-leading self-learning model that learns to correct improperly phrased or misheard customer queries without human involvement.
The model detects instances where a user reformulates a query due to an unsatisfactory response and learns to map the failed utterance to a subsequent successful one. For example, speech recognition errors may lead to the erroneous transcript “play alien bridges”, when the user actually said “play Leon Bridges.” By identifying instances where customers successfully play Leon Bridges’ music after failed interactions, the model learns to map “play alien bridges” to “play leon bridges.”
In the new industry track at EMNLP 2022, we presented two papers that expand on this approach. In “CGF: Constrained generation framework for query rewriting in conversational AI”, rather than mining past interactions for rewrite candidates, we use a generative model to produce them, with a resulting increase in accuracy.
In “Improving large-scale conversational assistants using model interpretation based training sample selection”, we address a limitation of the rewrite approach, which is that it does not correct errors in Alexa’s underlying AI models. In this paper, we leverage implicit positive feedback and model interpretation techniques to identify samples from live traffic to automatically augment and retrain our production NLU models.
Constrained generation
In the constrained-generation paper, the rewrite generator is an encoder-decoder model. The encoder produces an embedding of the customer query, as understood — and possibly misunderstood — by the automatic-speech-recognition (ASR) model, and the decoder converts it back into a text string.
A drawback of neural-language-generation approaches is that they sometimes hallucinate content. To mitigate this risk, we constrain the output of the decoder, limiting it to utterances that have been successfully used to elicit responses from Alexa.
To impose that constraint, we use a data structure known as a trie. A trie is a tree each of whose nodes represents a word, and a path through the trie, from root to leaf, encodes a valid utterance.
The inputs to the encoder are the previous dialogue context and the user’s current request. The decoder is autoregressive, which means that each output token is conditioned on the inputs and outputs that precede it. Consequently, it directly captures the relationship between the contextual input and target rewrites and effectively cross-encodes both.
The size of the trie varies with the number of words in its vocabulary, not the number of distinct strings it encodes, which greatly reduces the model’s memory footprint.
In our implementation, we construct a global trie, which captures interactions across Alexa, and a personalized trie, which captures a given customer’s preferences. If either rewrite model fails to find a likely match to the input string, it produces no output. If both models generate rewrite candidates, we prioritize the personal model’s.
We conducted extensive offline experiments on both global and personalized query rewriting, using two state-of-the-art models as benchmarks. We found that our approach improved precision by 14% and 21%, respectively, relative to the benchmarks. Online A/B experiments on Alexa traffic demonstrated a 28.97% reduction in the customer-perceived defect rate (CPDR).
Data selection
In a typical voice agent, the output of the ASR model — the text of a customer’s request — passes to a natural-language-understanding (NLU) model, which decides how to handle that request. The constrained-generation framework rewrites the ASR output, but it leaves the underlying model unchanged — and no less error prone.
In “Improving large-scale conversational assistants using model interpretation based training sample selection”, we focus on improving one of Alexa’s underlying AI models —the NLU model. Our main concern is how to select data to retrain the model.
Most interactions with Alexa are successful. Although we limit ourselves to requests that are frequently repeated across customers — and thus can’t be associated with any one customer — Alexa interactions still generate far more data than could practically be used for retraining. And even if we could use it all, it could degrade model performance, by overwriting the NLU model weights learned from prior training.
In selecting examples for retraining the NLU model, we need to distill only the most informative utterances. We do this in two steps. First, we filter out instances with low ASR recognition scores, and restricting ourselves to the second turns of successful reformulations.
Second, we use the integrated-gradients (IG) model interpretability technique to score the individual words of each input sentence according to their contribution to the NLU model’s output. IG sweeps through a sequence of slightly varied inputs, determining how each variation affects the output.
We begin by training the base NLU model and evaluating it on a held-out validation set. For observed misclassifications, we use IG to identify the words that have either negative scores with respect to the correct class or positive scores with respect to the incorrect class. The idea is to prioritize training examples that associate these words with their proper classes.
We score utterances by summing the influence scores for all occurring words. Only a small subset with the highest importance scores is chosen to augment the original training set and retrain the model.
For our Alexa NLU application, we added a set of utterances that is only 0.05% the size of the total training set. Nevertheless, our offline experiments showed a statistically significant 0.27% reduction in semantic error rate (SEMER) on all traffic and 0.45% on infrequent tail traffic. On live traffic in two domains (General and Information), retraining an intent classification/named-entity-recognition model resulted in reductions in customer-perceived defect rate (CPDR) of 0.27% and 1.04%, respectively, and of 1.32% and 1.64% respectively on tail traffic. The improved models have been launched to the production system.
Over the long term, we plan to build on these works to enable large-scale, continuous learning across all Alexa modules, without requiring human supervision.