Entity-controlled synthetic text generation using contextual question and answering with pre-trained language models

By Karan Aggarwal, Henry Jin, Aitzaz Ahmad
2022
Download Copy BibTeX
Copy BibTeX
Recent advancements in Natural Language Processing (NLP) algorithms have resulted in state-of-the-art performance on Named Entity Recognition (NER) tasks. These algorithms typically require high-quality labeled datasets for training models. However, training NLP models effectively can suffer from issues such as scarcity of labeled data, data bias and under-representation, and privacy concerns with using sensitive data for training. Generating synthetic data to train models is a promising solution to mitigate these problems. We propose a contextual question and answering approach using pre-trained language models to synthetically generate entity-controlled text. Entity-controlled text generation is then used to augment small labeled datasets for downstream NER tasks. We evaluate this proposed method on two publicly available datasets, and measure the quality of generated texts quantitatively. We find that the model is capable of producing full text samples with the desired entities appearing in a stochastically controllable way, while retaining sentence coherence closest to the real world data. Evaluations on downstream NER tasks show significant improvements in low-labeled data regime, and in using purely synthetic data for NER to alleviate privacy concerns.
Research areas

Latest news

GB, MLN, Edinburgh
We’re looking for a Machine Learning Scientist in the Personalization team for our Edinburgh office experienced in generative AI and large models. You will be responsible for developing and disseminating customer-facing personalized recommendation models. This is a hands-on role with global impact working with a team of world-class engineers and scientists across the Edinburgh offices and wider organization. You will lead the design of machine learning models that scale to very large quantities of data, and serve high-scale low-latency recommendations to all customers worldwide. You will embody scientific rigor, designing and executing experiments to demonstrate the technical efficacy and business value of your methods. You will work alongside aRead more