Sampling bias in NLU models: Impact and mitigation

Zefei Li; Anil Ramakrishna; Anna Rumshisky; Andy Rosenbaum; Saleh Soltan; Rahul Gupta

Publication

Sampling bias in NLU models: Impact and mitigation

By Zefei Li, Anil Ramakrishna, Anna Rumshisky, Andy Rosenbaum, Saleh Soltan, Rahul Gupta

2023

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Natural Language Understanding (NLU) systems such as chatbots or virtual assistants have seen a significant rise in popularity in recent times, thanks to availability of large volumes of user data. However, typical user data collected for training such models may suffer from sampling biases due to a variety of factors. In this paper, we study the impact of bias in the training data for intent classification task, a core component of NLU systems. We experiment with three kinds of data bias settings: (i) random down-sampling, (ii) class-dependent bias, and (iii) class-independent bias injection. For each setting, we report the loss in model performance and survey strategies to mitigate the loss from two families of methods: (i) semi-supervised learning (SSL), and (ii) synthetic data generation. Overall, we find that while both methods perform well with random down-sampling, synthetic data generation out-performs SSL when only biased training data is available.

Sampling bias in NLU models: Impact and mitigation

Latest news

Work with us