Residual adapters for targeted updates in RNN-transducer based speech recognition system
This paper investigates an approach for adapting RNNTransducer (RNN-T) based automatic speech recognition (ASR) model to improve the recognition of unseen words during training. Prior works have shown that it is possible to incrementally fine-tune the ASR model to recognize multiple sets of new words. However, this creates a dependency between the updates which is not ideal for the hot-fixing use-case where we want each update to be applied independently of other updates. We propose to train residual adapters on the RNN-T model and combine them on-the-fly through adapter-fusion. We investigate several approaches to combine the adapters so that they maintain the ability to recognize new words with only a minimal degradation on the usual user requests. Specifically, the sum-fusion which sums the outputs of the adapters inserted in parallel shows over 90% recall on the new words with less than 1% relative WER degradation on the usual data compared to the original RNN-T model.