AMuSE: Attentive multilingual speech encoding for zero-prior ASR
2024
Multilingual ASR offers training, deployment and overall performance benefits, but models trained via simple data pooling are known to suffer from cross-lingual interference. Oracle language information (exact-prior) and language-specific parameters are usually leveraged to overcome this, but such approaches cannot enable seamless, truly multilingual experiences. Existing methods try to overcome this limitation by relying on inferred language information or language agnostic mixture-of-experts, but they incur additional runtime complexity and/or training cost in addition to being less effective in streaming scenarios. Building on previous studies where models were trained to handle mixed-prior (knowledge that the underlying language belongs to a known group), we propose Attentive Multilingual Speech Encoding (AMuSE), a training framework designed to match exact-prior performance even in the absence of underlying language information at runtime (zero-prior), thereby making the model prior-agnostic. Leveraging AMuSE, we build a zero-prior enabled LLM-based ASR system that outperforms several exact-prior driven state-of-the-art benchmarks.
Research areas