Interleaved audio/audiovisual transfer learning for AV-ASR in low-resourced languages

Zhengyang Li; Patrick Blumenberg; Jing Liu; Thomas Graave; Timo Lohrenz; Siegfried Kunzmann; Tim Fingscheidt

Publication

Interleaved audio/audiovisual transfer learning for AV-ASR in low-resourced languages

By Zhengyang Li, Patrick Blumenberg, Jing Liu, Thomas Graave, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt

2024

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Cross-language transfer learning from English to a target language has shown effectiveness in low-resourced audiovisual speech recognition (AV-ASR). We first investigate a 2-stage protocol, which performs fine-tuning of the English pre-trained AV encoder on a large audio corpus in the target language (1st stage), and then carries out cross-modality transfer learning from audio to AV in the target language for AV-ASR (2nd stage). Second, we propose an alternative interleaved audio/audiovisual transfer learning to avoid catastrophic forgetting of the video modality and to overcome 2nd stage overfitting to the small AV corpus. We use only 10h AV training data in either German or French target language. Our proposed interleaved method outperforms the 2-stage method in all low-resource conditions and both languages. It also excels the former state of the art both in the noisy benchmark (babble 0dB, 53.9% vs. 65.9%) and in clean condition (34.9% vs. 48.1%) on the German MuAVIC test set.

Interleaved audio/audiovisual transfer learning for AV-ASR in low-resourced languages

Latest news

Work with us