Contextual-utterance training for automatic speech recognition
Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming ASR systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling “in-place” the knowledge of a teacher (non-streaming mode), which is able to see both past and future contextual utterances, to the student (streaming mode) which can only see the current and past contextual utterances. The experimental results show that a state-of-the-art conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6 % and 40 ms relative, respectively.