Can unpaired textual data replace synthetic speech in ASR model adaptation?
2023
To boost training and adaptation of end to end (E2E) automatic speech recognition (ASR) models, several approaches to use paired speech-text input together with unpaired text input have emerged. They aim at improving the model performance on rare words, personalisation, and long tail. In this work, we present a systematic study of the impact of such training/adaptation and compare it to training with synthetic utterances generated by text-to-speech engines. We experiment with in-house and CommonVoice datasets and conclude that using text data for adaptation is effective, but is outperformed by adapting with synthetic audio even when the TTS engine is sub-optimal. This challenges recent literature on the difficulties of using TTS data including catastrophic forget-ting, feature misalignment, and pronunciation errors, which motivated the use of text-only adaptation.
Research areas