Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion
Grapheme-to-phoneme (G2P) models are a key component in Automatic Speech Recognition (ASR) systems, such as the ASR system in Alexa, as they are used to generate pronunciations for out-of-vocabulary words that do not exist in the pronunciation lexicons (mappings like ”e c h o” → ”E k oU”). Most G2P systems are monolingual and based on traditional joint-sequence-based n-gram models. As an alternative, we present a single end-to-end trained neural G2P model that shares same encoder and decoder across multiple languages. This allows the model to utilize a combination of universal symbol inventories of Latin-like alphabets and cross-linguistically shared feature representations. Such a model is especially useful in the scenarios of low-resource languages and code switching/foreign words, where the pronunciations in one language need to be adapted to other locales or accents. We further experiment with a word language distribution vector as an additional training target in order to improve system performance by helping the model decouple pronunciations across a variety of languages in the parameter space. We show 7.2% average improvement in phoneme error rate over low-resource languages and no degradation over high-resource ones compared to monolingual baselines.