Smoothing model predictions using adversarial training procedures for speech based emotion recognition
2018
Training discriminative classifiers involves learning a conditional distribution p(yi|xi), given a set of feature vectors xi and the corresponding labels yi, i = 1..N. For a classifier to be generalizable and not overfit to training data, the resulting conditional distribution p(yi|xi) is desired to be smoothly varying over the inputs xi. Adversarial training procedures enforce this smoothness using manifold regularization techniques. Manifold regularization makes the model’s output distribution more robust to local perturbation added to a datapoint xi. In this paper, we experiment with the application of adversarial training procedures to increase the accuracy of a deep neural network based emotion recognition system using speech cues. Specifically, we investigate two training procedures: (i) adversarial training where we determine the adversarial direction based on the given labels for the training data and, (ii) virtual adversarial training where we determine the adversarial direction based only on the output distribution of the training data. We demonstrate the efficacy of adversarial training procedures by performing a k-fold cross validation experiment on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and a cross-corpus performance analysis on three separate corpora. Results show improvement over a purely supervised approach, as well as better generalization capability to cross-corpus settings.
Research areas