Conversational AI

Data-efficient continual learning in Alexa

EMNLP papers examine constrained generation of rewrite candidates and automatic selection of information-rich training data.

December 14, 2022

5 min read

Three years ago, Alexa began using an industry-leading self-learning model that learns to correct improperly phrased or misheard customer queries without human involvement.

The model detects instances where a user reformulates a query due to an unsatisfactory response and learns to map the failed utterance to a subsequent successful one. For example, speech recognition errors may lead to the erroneous transcript “play alien bridges”, when the user actually said “play Leon Bridges.” By identifying instances where customers successfully play Leon Bridges’ music after failed interactions, the model learns to map “play alien bridges” to “play leon bridges.”

Absorbing-Markov-chain models for three different sequences of utterances

Constrained generation

In the constrained-generation paper, the rewrite generator is an encoder-decoder model. The encoder produces an embedding of the customer query, as understood — and possibly misunderstood — by the automatic-speech-recognition (ASR) model, and the decoder converts it back into a text string.

A drawback of neural-language-generation approaches is that they sometimes hallucinate content. To mitigate this risk, we constrain the output of the decoder, limiting it to utterances that have been successfully used to elicit responses from Alexa.

To impose that constraint, we use a data structure known as a trie. A trie is a tree each of whose nodes represents a word, and a path through the trie, from root to leaf, encodes a valid utterance.

Amazon Science -Trie-Graph-01.png — An example of an utterance trie. The special tokens “BOS” and “EOS” represent the beginning of a string and the end of a string, respectively. When the rewrite model has generated the sequence “[BOS] play staring at” during the decoding process, it may generate only “the” or “it” at the next step. If it generates “the” next, it may generate only “sun”,“moon”, or “sky” in the next step.

The inputs to the encoder are the previous dialogue context and the user’s current request. The decoder is autoregressive, which means that each output token is conditioned on the inputs and outputs that precede it. Consequently, it directly captures the relationship between the contextual input and target rewrites and effectively cross-encodes both.

CGF framework.png — The constrained generation framework (CGF) for query rewriting.

The size of the trie varies with the number of words in its vocabulary, not the number of distinct strings it encodes, which greatly reduces the model’s memory footprint.

Data selection

In a typical voice agent, the output of the ASR model — the text of a customer’s request — passes to a natural-language-understanding (NLU) model, which decides how to handle that request. The constrained-generation framework rewrites the ASR output, but it leaves the underlying model unchanged — and no less error prone.

In “Improving large-scale conversational assistants using model interpretation based training sample selection”, we focus on improving one of Alexa’s underlying AI models —the NLU model. Our main concern is how to select data to retrain the model.

Diagram depicting example of paraphrase alignment

Researchers propose a method to automatically generate training data for Alexa by identifying cases in which customers rephrase unsuccessful requests.

Most interactions with Alexa are successful. Although we limit ourselves to requests that are frequently repeated across customers — and thus can’t be associated with any one customer — Alexa interactions still generate far more data than could practically be used for retraining. And even if we could use it all, it could degrade model performance, by overwriting the NLU model weights learned from prior training.

In selecting examples for retraining the NLU model, we need to distill only the most informative utterances. We do this in two steps. First, we filter out instances with low ASR recognition scores, and restricting ourselves to the second turns of successful reformulations.

Second, we use the integrated-gradients (IG) model interpretability technique to score the individual words of each input sentence according to their contribution to the NLU model’s output. IG sweeps through a sequence of slightly varied inputs, determining how each variation affects the output.

Word importance scores.png — Example of word importance scores for the task of domain classification. The true domain of the input utterance “tell us a bedtime story” is Books, but the model wrongly predicts Information.

We begin by training the base NLU model and evaluating it on a held-out validation set. For observed misclassifications, we use IG to identify the words that have either negative scores with respect to the correct class or positive scores with respect to the incorrect class. The idea is to prioritize training examples that associate these words with their proper classes.

We score utterances by summing the influence scores for all occurring words. Only a small subset with the highest importance scores is chosen to augment the original training set and retrain the model.

Data augmentation framework.png — Overview of our method for augmenting training data using sample importance scores.

For our Alexa NLU application, we added a set of utterances that is only 0.05% the size of the total training set. Nevertheless, our offline experiments showed a statistically significant 0.27% reduction in semantic error rate (SEMER) on all traffic and 0.45% on infrequent tail traffic. On live traffic in two domains (General and Information), retraining an intent classification/named-entity-recognition model resulted in reductions in customer-perceived defect rate (CPDR) of 0.27% and 1.04%, respectively, and of 1.32% and 1.64% respectively on tail traffic. The improved models have been launched to the production system.

Over the long term, we plan to build on these works to enable large-scale, continuous learning across all Alexa modules, without requiring human supervision.

About the Author

Pradeep Natarajan

Pradeep Natarajan is a senior principal scientist in the Alexa AI organization.

Data-efficient continual learning in Alexa

EMNLP papers examine constrained generation of rewrite candidates and automatic selection of information-rich training data.

Constrained generation

Data selection

Related content

Work with us