Interleaved audio/audiovisual transfer learning for AV-ASR in low-resourced languages

By Zhengyang Li, Patrick Blumenberg, Jing Liu, Thomas Graave, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt
2024
Download Copy BibTeX
Copy BibTeX
Cross-language transfer learning from English to a target language has shown effectiveness in low-resourced audiovisual speech recognition (AV-ASR). We first investigate a 2-stage protocol, which performs fine-tuning of the English pre-trained AV encoder on a large audio corpus in the target language (1st stage), and then carries out cross-modality transfer learning from audio to AV in the target language for AV-ASR (2nd stage). Second, we propose an alternative interleaved audio/audiovisual transfer learning to avoid catastrophic forgetting of the video modality and to overcome 2nd stage overfitting to the small AV corpus. We use only 10h AV training data in either German or French target language. Our proposed interleaved method outperforms the 2-stage method in all low-resource conditions and both languages. It also excels the former state of the art both in the noisy benchmark (babble 0dB, 53.9% vs. 65.9%) and in clean condition (34.9% vs. 48.1%) on the German MuAVIC test set.

Latest news

US, CA, Santa Clara
Amazon is looking for a passionate, talented, and inventive Applied Scientist with a strong machine learning background to help build industry-leading language technology. Our mission is to provide a delightful experience to Amazon’s customers by pushing the envelope in Natural Language Processing (NLP), Generative AI, Large Language Model (LLM), Natural Language Understanding (NLU), Machine Learning (ML), Retrieval-Augmented Generation, Responsible AI, Agent, Evaluation, and Model Adaptation. As part of our AI team in Amazon AWS, you will work alongside internationally recognized experts to develop novel algorithms and modeling techniques to advance the state-of-the-art in human language technology. Your work will directly impactRead more