Using data augmentation and consistency regularization to improve semi-supervised speech recognition
State-of-the-art automatic speech recognition (ASR) networks use attention mechanism and optimize transducer loss on labeled acoustic data. Recently, Semi-Supervised Learning (SSL) techniques that leverage large amount of unlabeled data have become an active area of interest to improve the performance of ASR networks. In this paper we approach SSL based on the framework of consistency regularization, where data augmentation transforms are used to make ASR network predictions invariant to perturbations in the acoustic data. To increase data diversity we present a combination technique that randomly fuses multiple waveform and feature transforms. For each unlabeled acoustic waveform, two versions, i.e., a weakly augmented and a strongly augmented version of the unaugmented input are generated. During training, a semi-supervised loss is assigned that enforces consistent outputs between the weak and strong augmentations of the unlabeled input. Moreover, we employ model averaging technique to generate stable outputs over time. We compare and demonstrate the benefits of the proposed approach against standard SSL strategies like iterative self-labeling. We leverage over 100000 hours of unlabeled data to train the ASR network using streaming transducer loss and reach improvements in the range of 8%-12% over self-labeling baseline.