Top-down attention in end-to-end spoken language understanding
2021
Spoken language understanding (SLU) is the task of inferring the semantics of spoken utterances. Traditionally, this has been achieved with a cascading combination of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) modules that are optimized separately, which can lead to a suboptimal overall performance. More recently, End-to-End SLU (E2E SLU) was proposed to perform SLU directly from speech through a joint optimization of the modules, addressing some of the traditional SLU shortcomings. A key challenge of this approach is how to best integrate the feature learning of the ASR and NLU sub-tasks to maximize their performance. While it is known that in general, ASR models focus on low-level features, and NLU models need higher-level contextual information, ASR models can nonetheless also leverage top-down syntactic and semantic information to improve their recognition. Based on this insight, we propose Top-Down SLU (TD-SLU), a new transformer-based E2E SLU model that uses top-down attention and an attention gate to fuse high-level NLU features with low-level ASR features, which leads to a better optimization of both tasks. We have validated our model using the public FluentSpeech set, and a large custom dataset. Results show TD-SLU is able to outperform selected baselines both in terms of ASR and NLU quality metrics, and suggest that the added syntactic and semantic high-level information can improve the model’s performance.
Research areas