Bootstrapping Conversational Speech Recognition System using Neural Machine Translation
Building a conversational speech recognition system for a new language is constrained by the availability of interaction style utterances. Data collection is often expensive and limited by the speed of manual transcription. In this work, we advocate the use of neural machine translation as a data augmentation technique for bootstrapping language models in factored speech recognition systems. Translation offers a systematic way to incorporate live collections from the mature, resource-rich languages. However, the strategy of ingesting raw translations from a general purpose MT system is not effective owing to the presence of named entities, intra sentential code-switching and the domain mismatch between conversational data being translated and the parallel text used for training translation system. We explore sentence embeddings based data selection and model fine tuning for adaptation. We derive guidance from in-domain data by rescoring beams and filtering translations. A combination of these techniques yields a relative word error rate reduction of 7.8-15.6 % depending on the bootstrapping phase. Fine grained analysis reveals that translation aids the underrepresented interaction categories in particular. Experimental evidence establishes the efficacy of translation for supplementing transcribed collections, a strategy which could be instrumental for rapid language expansion.