Multi-lingual multi-task speech emotion recognition using wav2vec 2.0

2022
Download Copy BibTeX
Copy BibTeX
Speech Emotion Recognition (SER) has several use cases for Digital Entertainment Content (DEC) in Over-the-top (OTT) services, emotive Text-to-Speech (TTS) engines and voice assistants. In this work, we present a Multi-Lingual (MLi) and Multi-Task Learning (MTL) audio only SER system based on the multi-lingual pre-trained wav2vec 2.0 model. The model is fine-tuned on 25 open source datasets in 13 locales across 7 emotion categories. We show that, a) Our wav2vec 2.0 single task based model outperforms Pre-trained Audio Neural Network (PANN) based single task pre-trained model by 7.2% (relative), b) The best MTL model outperforms the PANN based and wav2vec 2.0 based single task models by 8.6% and 1.7% (relative) respectively, c) The MTL based system outperforms pre-trained single task wav2vec 2.0 model in 9 out of 13 locales in terms of weighted F1 scores, and d) The MTL-MLi wav2vec 2.0 outperforms the state-of-the-art for the languages contained in the pre-training corpora.
Research areas

Latest news

GB, MLN, Edinburgh
We’re looking for a Machine Learning Scientist in the Personalization team for our Edinburgh office experienced in generative AI and large models. You will be responsible for developing and disseminating customer-facing personalized recommendation models. This is a hands-on role with global impact working with a team of world-class engineers and scientists across the Edinburgh offices and wider organization. You will lead the design of machine learning models that scale to very large quantities of data, and serve high-scale low-latency recommendations to all customers worldwide. You will embody scientific rigor, designing and executing experiments to demonstrate the technical efficacy and business value of your methods. You will work alongside aRead more