Chasing the tail with domain generalization: A case study on frequency-enriched datasets
2022
Natural language understanding (NLU) tasks are typically defined by creating an annotated dataset in which each utterance is encountered once. Such data does not resemble real-world natural language interactions in which certain utterances are encountered frequently, others rarely. For deployed NLU systems, this is a vital problem, since the underlying machine learning (ML) models are often fine-tuned on typical NLU data, in which utterance frequency is never factored in, and then applied to real-world data with a very different distribution. Such systems need to maintain interpretation consistency for the high-frequency (head) utterances, while also doing well on low-frequency (tail) utterances. We propose an alternative strategy that explicitly uses utterance frequency in training data to learn models that are more robust to unknown distributions. We present a methodology to simulate utterance usage in two public corpora and create two new corpora with head, body and tail segments. We evaluate several methods for joint intent classification and named entity recognition (referred to as IC-NER), and propose to use two domain generalization (DG) approaches that we adapt to sequence labeling task. The DG approaches demonstrate up to 7.02% relative improvement in semantic accuracy over baselines on the tail data. We provide insights as to why the proposed approaches work and show that the reasons for the observed improvements do not align with those reported in previous work.
Research areas