Robust online i-vectors for unsupervised adaptation of DNN acoustic models: A study in the context of digital voice assistants
Supplementing log filter-bank energies with i-vectors is a popular method for adaptive training of deep neural network acoustic models. While offline i-vectors (the target utterance or other relevant adaptation material is available for i-vector extraction prior to decoding) have been well studied, there is little analysis of online i-vectors and their robustness in multi-user scenarios where speaker changes can be frequent and unpredictable. The authors of  showed that online adaptation could be achieved through segmental i-vectors computed using the hidden Markov model (HMM) state alignments of utterances decoded in the recent past. While this approach works well in general, it could be rendered ineffective by speaker changes. In this paper, we study robust extensions of the ideas proposed in  by: (a) updating i-vectors on a per-frame basis based on the incoming target utterance, and (b) using lattice posteriors instead of one-best HMM state alignments. Experiments with different i-vector implementations show that: (a) when speaker changes occur, lattice-based frame-level i-vectors provide up to 6% word error rate reduction relative to the baseline , and (b) online i-vectors are more effective, in general, when the microphone characteristics of test utterances are not seen in training.