Using wake word acoustics to filter out background speech improves speech recognition by 15%
At this year’s International Conference on Acoustics, Speech, and Signal Processing, my colleagues and I will present a new technique for doing this, which could complement the techniques that Alexa already uses.
We assume that the speaker who activates an Alexa-enabled device by uttering its “wake word” — usually “Alexa” — is the one Alexa should be listening to. Essentially, our technique takes an acoustic snapshot of the wake word and compares subsequent speech to it. Speech whose acoustics match those of the wake word is judged to be intended for Alexa, and all other speech is treated as background noise.
Rather than training a separate neural network to make this discrimination, we integrate our wake-word-matching mechanism into a standard automatic-speech-recognition system. We then train the system as a whole to recognize only the speech of the wake word utterer. In tests, this approach reduced speech recognition errors by 15%.
We implemented our technique using two different neural-network architectures. Both were variations of a sequence-to-sequence encoder-decoder network with an attention mechanism. A sequence-to-sequence network is one that processes an input sequence — here, a series of “frames”, or millisecond-scale snapshots of an audio signal — in order and produces a corresponding output sequence — here, phonetic renderings of speech sounds.
In an encoder-decoder network, the encoder summarizes the input as a vector — a sequence of numbers — of fixed length. Typically, the vector is more compact than the original input. The decoder then converts the vector into an output. The entire network is trained together, so that the encoder learns to produce summary vectors well suited to the decoder’s task.
Finally, the attention mechanism tells the decoder which elements of the encoder’s summary vector to focus on when producing an output. In a sequence-to-sequence model, the attention mechanism’s decision is typically based on the current states of both the encoder and decoder networks.
Our first modification to this baseline network was simply to add an input to the attention mechanism. In addition to receiving information about the current states of the encoder and decoder networks, our modified attention mechanism also receives the raw frame data corresponding to the wake word. During training, the attention mechanism automatically learns which acoustic characteristics of the wake word to look for in subsequent speech.
In another experiment, we trained the network more explicitly to emphasize input speech whose acoustic profile matches that of the wake word. First, we added a mechanism that directly compares the wake word acoustics with those of subsequent speech. Then we used the result of that comparison as an input to a mechanism that learns to suppress — or “mask” — some elements of the encoder’s summary vector before they even pass to the attention mechanism. Otherwise, the attention mechanism is the same as in the baseline model.
We expected the masking approach to outperform the less explicitly supervised attention mechanism, but in fact it fared slightly worse, reducing the error rate of the baseline model by only 13%, rather than 15%. We suspect that this is because the decision to mask encoder outputs is based solely on the state of the encoder network, whereas the modified attention mechanism factored in the state of the decoder network, too. In future work, we plan to explore a masking mechanism that also considers the decoder state.
Acknowledgments: Yiming Wang, I-Fan Chen, Yuzong Liu, Tongfei Chen, Björn Hoffmeister