Joint ASR and language identification using RNN-T: An efficent approach to dynamic language switching
2021
Conventional dynamic language switching enables seamless multilingual interactions by running several monolingual ASR systems in parallel and triggering the appropriate downstream components using a standalone language identification (LID) service. Since this solution is neither scalable nor cost- and memory-efficient, especially for on-device applications, we propose end-to-end, streaming, joint ASR-LID architectures based on the recurrent neural network transducer framework. Two key formulations are explored: (1) joint training using a unified output space for ASR and LID vocabularies, and (2) joint training viewed as multi-task optimization. We also evaluate the benefit of using auxiliary language information obtained on-thefly from an acoustic LID classifier. Experiments with the EnglishHindi language pair show that: (a) multi-task architectures perform better overall, and (b) the best joint architecture surpasses monolingual ASR (6.4–9.2% word error rate reduction) and acoustic LID (53.9–56.1% error rate reduction) baselines while reducing the overall memory footprint by up to 46%.
Research areas