Confidence estimation for Speech Emotion Recognition (SER) is instrumental in improving the reliability in the behavior of downstream applications. In this work we propose (1) a novel confidence metric for SER based on the relationship between emotion primitives: arousal, valence, and dominance (AVD) and emotion categories (ECs), (2) EmoConfidNet - a DNN trained alongside the EC recognizer to predict the proposed confidence metric, and (3) a data filtering technique used to enhance the training of EmoConfidNet and the EC recognizer. For each training sample, we calculate distances from corresponding AVD annotation vectors to centroids of each EC in the AVD space, and define EC confidences as functions of the evaluated distances. EmoConfidNet is trained to predict confidence from the same acoustic representations used to train the EC recognizer. EmoConfidNet outperforms state-of-the-art confidence estimation methods on the MSP-Podcast and IEMOCAP datasets. For a fixed EC recognizer, after we reject the same number of low confidence predictions using EmoConfidNet, we achieve a higher
F1 and unweighted average recall (UAR) than when rejecting using other methods.
Confidence estimation for speech emotion recognition based on the relationship between emotion categories and primitives
2022
Research areas