Multi-dialect acoustic modeling using phone mapping and online i-vectors
2019
This paper proposes a simple phone mapping approach to multi-dialect acoustic modeling. In contrast to the widely used shared hidden layer (SHL) training approach (hidden layers are shared across dialects whereas output layers are kept separate), phone mapping simplifies model training and maintenance by allowing all the network parameters to be shared; it also simplifies online adaptation via HMM-based i-vectors by allowing the same T-matrix to be used for all the dialects. Using the LSTM-HMM framework, we compare phone mapping with transfer learning and SHL training, and we also compare the efficacy of online i-vectors with that of one-hot dialect encoding. Experiments with a 2K hour dataset comprising four English dialects show that (1) phone mapping yields significant WER reductions over dialect-specific training (14%, on average) and transfer learning (5%, on average); (2) SHL training is only slightly better than phone mapping; and (3) i-vectors provide useful additional reductions (3%, on average) while one-hot encoding has little effect. Even with a large 40K hour dataset (comprising the same four English dialects) and fully optimized sequence discriminative training, we show that phone mapping provides healthy WER reductions over dialect-specific models (10%, on average).
Research areas