How Alexa knows when you’re talking to her
Leveraging semantic content improves performance of acoustic-only model for detecting device-directed speech.
Follow-up Mode makes interacting with Alexa more natural. With Follow-up Mode enabled, a customer can ask, “Alexa, what’s the weather?”, then follow up by asking “How about tomorrow?”, without having to repeat the wake word “Alexa”.
Dispensing with the wake word means that Alexa-enabled devices must distinguish between speech that is and is not device directed. They have to distinguish, that is, between phrases like “How about tomorrow?” and children’s shouts or voices from the TV.
In the past, Alexa researchers have dramatically improved the detection of device-directed speech by leveraging components of Alexa’s speech recognition system. In a paper that we’re presenting (virtually) this week at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), we show how to extend those improvements by adding information about semantic and syntactic features of customer utterances.
In the experiments we report in our paper, our machine learning model demonstrated a 14% improvement over the best-performing baseline in terms of equal error rate, or the error rate that results if false-positive and false-negative rates are set equal.
Requests directed to Alexa are different from ordinary human conversation in terms of topic, content, conversational flow, and syntactic and semantic structure. For instance, non-device-directed speech often consists of fragments such as “break at a bigger” or “weather talking about hal”. The fractured syntax of these fragments is something that a machine learning system should be able to recognize.
Of course, follow-up remarks can also be fragmentary: for instance, a customer might follow up the question “Alexa, what’s the weather for today?” with “and for tomorrow?” But such fragments usually gain in coherence when they’re combined with their predecessor questions. So as input to our model, we use both the current utterance and the one that preceded it.
Other utterances (“thank you,” “stop,” “okay”) remain ambiguous even in conjunction with their predecessors. For this reason, our system doesn’t just rely on high-level, semantic and syntactic features. We also use acoustic features that represent the speech recognizer’s confidence in its transcriptions of customers’ utterance. This is a lightweight version of the approach adopted by the Alexa team in its state-of-the-art system for detecting device directedness.
Their basic insight: if the speech recognizer’s confidence in its transcriptions is low, then it’s probably dealing with utterances that are unlike its training data. And as it was trained on device-directed utterances, utterances unlike its training data are more likely to be non-device-directed.
Because the semantic features we add are intended to exploit sentence structure, word sequence matters. Consequently, our system uses a machine learning model known as a long-short-term-memory (LSTM) network.
LSTMs process inputs in sequence, so that each output factors in both the inputs and the outputs that preceded it. With linguistic inputs, the LSTM proceeds one word at a time, producing a new output after each new word. The final output encodes information about the sequence of the words that preceded it.
Centers of attention
In many natural-language-understanding settings, LSTMs work better if they also incorporate attention mechanisms. Essentially, the attention mechanism determines how much each word of the input should contribute to the final output. In many applications, for instance, the names of entities (“Blinding Lights”, “Dance Monkey”) are more important than articles (“a”, “the”) or prepositions (“to”, “of”); an attention mechanism would thus assign them greater weight. We use an attention mechanism to help the model key in on input words that are particularly useful in distinguishing device-directed from non-device-directed speech.
Finally, we also use transfer learning to improve our model’s performance. That is, we pre-train the model on one-shot interactions before fine-turning it on multiturn interactions. During pre-training, we use both positive and negative examples, so the network will learn features of both device-directed and non-device-directed speech.
In our experiments, we compared our system to both the state-of-the-art acoustic-only model for recognizing device-directed speech and to a version of our model that used a deep neural network (DNN) rather than an LSTM. To make the comparison fair, the acoustic-only model was trained on both the pre-training (single-interaction) data set and the fine-tuning (multiple-interaction) data set we used for transfer learning.
The DNN represents inputs in a way that captures semantic information about all the words in an utterance but doesn’t reflect their order. Its performance was significantly worse than that of the acoustic-only baseline — an equal-error rate of 19.2%, versus a baseline of 10.6%. But our proposed LSTM model lowered the equal-error rate to 9.1%, an improvement of 14%.
In our paper, we also report promising results of some initial experiments with semi-supervised learning, in which the trained network itself labels a large body of unlabeled data, which are in turn used to re-train the network. We plan to — to coin a phrase — follow up on these experiments in our future work.