Leveraging unlabeled speech for sequence discriminative training of acoustic models
State-of-the-art Acoustic Modeling (AM) techniques use long short term memory (LSTM) networks, and apply multiple phases of training on large amount of labeled acoustic data - initial cross-entropy (CE) training or connectionist temporal classiﬁcation (CTC) training followed by sequence discriminative training, such as state-level Minimum Bayes Risk (sMBR). Recently, there is considerable interest in applying Semi-Supervised Learning (SSL) methods that leverage substantial amount of unlabeled speech for improving AM. This paper proposes a novel Teacher-Student based knowledge distillation(KD) approach for sequence discriminative training,where reference state sequence of unlabeled data are estimated using a strong Bi-directional LSTM Teacher model which is then used to guide the sMBR training of a LSTM Student model. We build a strong supervised LSTM AM baseline by using 45,000 hours of labeled multi-dialect English data for initial CE or CTC training stage, and 11,000 hours of its British English subset for sMBR training phase. To demonstrate the efﬁcacy of the proposed approach, we leverage an additional 38,000 hours of unlabeled British English data at only sMBR stage, which yields a relative Word Error Rate (WER) improvement in the range of 6%−11% over supervised baselines in clean and noisy test conditions.