Self supervised LLM customizer (SSLC): Customizing LLMs on unlabeled data to enhance contextual question answering
2024
While we can customize large language models (LLMs) on specific domains by finetuning using the domain specific labeled data, performance of the customized models is highly dependent on the quality of the labeled data. Obtaining high-quality labeled data for custom domains often requires considerable human effort and associated costs. However, in many cases, unlabeled data is readily available at little or no cost. Existing methods either rely on continued pre-training or use general purpose models trained for synthesis. But, continued pre-training necessitates vast amounts of data and adversely affects instruction tuned models. On the other hand, general purpose synthesis models might not capture the nuances of custom data. We present a framework (SSLC) for customizing LLMs using unlabeled text to enhance contextual question answering on custom data. Our approach employs few-shot synthesis using an instruction-tuned model, curates the synthesized data using a LLM response scorer and finetunes the model on this synthesized data. We demonstrate that the approach significantly improves contextual question answering performance compared to the baselines. It outperforms baselines in 75% (9/12) of the experiments as evidenced by both quantitative and qualitative metrics. On an average, it outperforms un-customized models by 19.3 percentage points and state-of-the-art approach by 4.4 percentage points in human evaluation(proxy) accuracy.
Research areas