Metadata-aware end-to-end keyword spotting

Hongyi Liu; Apurva Abhyankar; Yuriy Mishchenko; Thibaud Sénéchal; Gengshen Fu; Brian Kulis; Noah Stein; Anish Shah; Shiv Naga Prasad Vitaladevuni

Publication

Metadata-aware end-to-end keyword spotting

By Hongyi Liu, Apurva Abhyankar, Yuriy Mishchenko, Thibaud Sénéchal, Gengshen Fu, Brian Kulis, Noah Stein, Anish Shah, Shiv Naga Prasad Vitaladevuni

2020

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

As a crucial part of Alexa products, our on-device keyword spotting system detects the wakeword in conversation and initiates subsequent user-device interactions. Convolutional neural networks (CNNs) have been widely used to model the relationship between time and frequency in the audio spectrum. However, it is not obvious how to appropriately leverage the rich descriptive information from device state metadata (such as player state,device type, volume, etc) in a CNN architecture. In this paper,we propose to use metadata information as an additional input feature to improve the performance of a single CNN keyword-spotting model under different conditions. We design a new network architecture for metadata-aware end-to-end keyword spotting which learns to convert the categorical metadata to a fixed length embedding, and then uses the embedding to: 1)modulate convolutional feature maps via conditional batch normalization, and 2) contribute to the fully connected layer via feature concatenation. The experiment shows that the proposed architecture is able to learn the meta-specific characteristics from combined data sets, and the best candidate achieves an average relative false reject rate (FRR) improvement of 14.63% at the same false accept rate (FAR) compared with CNN that does not use device state metadata.

Metadata-aware end-to-end keyword spotting

Latest news

Work with us