Sampling bias in NLU models: Impact and mitigation

By Zefei Li, Anil Ramakrishna, Anna Rumshisky, Andy Rosenbaum, Saleh Soltan, Rahul Gupta
2023
Download Copy BibTeX
Copy BibTeX
Natural Language Understanding (NLU) systems such as chatbots or virtual assistants have seen a significant rise in popularity in recent times, thanks to availability of large volumes of user data. However, typical user data collected for training such models may suffer from sampling biases due to a variety of factors. In this paper, we study the impact of bias in the training data for intent classification task, a core component of NLU systems. We experiment with three kinds of data bias settings: (i) random down-sampling, (ii) class-dependent bias, and (iii) class-independent bias injection. For each setting, we report the loss in model performance and survey strategies to mitigate the loss from two families of methods: (i) semi-supervised learning (SSL), and (ii) synthetic data generation. Overall, we find that while both methods perform well with random down-sampling, synthetic data generation out-performs SSL when only biased training data is available.
Research areas

Latest news

GB, MLN, Edinburgh
We’re looking for a Machine Learning Scientist in the Personalization team for our Edinburgh office experienced in generative AI and large models. You will be responsible for developing and disseminating customer-facing personalized recommendation models. This is a hands-on role with global impact working with a team of world-class engineers and scientists across the Edinburgh offices and wider organization. You will lead the design of machine learning models that scale to very large quantities of data, and serve high-scale low-latency recommendations to all customers worldwide. You will embody scientific rigor, designing and executing experiments to demonstrate the technical efficacy and business value of your methods. You will work alongside aRead more