Interleaved audio/audiovisual transfer learning for AV-ASR in low-resourced languages

By Zhengyang Li, Patrick Blumenberg, Jing Liu, Thomas Graave, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt
2024
Download Copy BibTeX
Copy BibTeX
Cross-language transfer learning from English to a target language has shown effectiveness in low-resourced audiovisual speech recognition (AV-ASR). We first investigate a 2-stage protocol, which performs fine-tuning of the English pre-trained AV encoder on a large audio corpus in the target language (1st stage), and then carries out cross-modality transfer learning from audio to AV in the target language for AV-ASR (2nd stage). Second, we propose an alternative interleaved audio/audiovisual transfer learning to avoid catastrophic forgetting of the video modality and to overcome 2nd stage overfitting to the small AV corpus. We use only 10h AV training data in either German or French target language. Our proposed interleaved method outperforms the 2-stage method in all low-resource conditions and both languages. It also excels the former state of the art both in the noisy benchmark (babble 0dB, 53.9% vs. 65.9%) and in clean condition (34.9% vs. 48.1%) on the German MuAVIC test set.

Latest news

CA, ON, Toronto
Are you motivated to explore research in ambiguous spaces? Are you interested in conducting research that will improve associate, employee and manager experiences at Amazon? Do you want to work on an interdisciplinary team of scientists that collaborate rather than compete? Join us at PXT Central Science! The People eXperience and Technology Central Science Team (PXTCS) uses economics, behavioral science, statistics, and machine learning to proactively identify mechanisms and process improvements which simultaneously improve Amazon and the lives, wellbeing, and the value of work to Amazonians. We are an interdisciplinary team that combines the talents of science and engineering to develop and deliver solutions that measurably achieve thisRead more