In this paper, we extend our previous work on device-directed utterance detection, which aims to distinguish voice queries in-tended for a smart-home device from background speech. The task can be phrased as a binary utterance-level classification problem that we approach with a DNN-LSTM model using acoustic features and features from the automatic speech recognition (ASR) decoder as input. In this work, we study the performance of the model for different dialog types and for different categories of decoder features. To address different dialog types, we found that a model with a separate output branch for each dialog type outperforms a model with a shared output branch by a relative12.5% of equal error rate (EER) reduction. We also found the average number of arcs in a confusion network to be one of the most informative ASR decoder features. In addition, we explore different frequencies of back-ward propagation for training the acoustic embedding for every k frames (k=1,3,5,7), and mean and attention pooling methods for generating an utterance representation. We found that attention pooling provides the most discriminative utterance representation and outperforms mean pooling by a relative4.97% of EER reduction. Index
A Study for Improving Device-Directedness Detection toward Frictionless Human-Machine Interaction
2019
Research areas