Conversational AI

Detoxification of large language models via regularized fine-tuning

Attribute-controlled fine-tuning can produce LLMs that adhere to policy while achieving competitive performance on general benchmarks.

November 21, 2024

3 min read

Large language models (LLMs) have demonstrated impressive performance across a variety of tasks, but, as has been clear in multiple instances, they carry the risk of producing inappropriate, unsafe, or biased outputs. When generating responses, a successfully trained LLM should comply with a set of policies specified by its creator; for example, the developer may want to restrain the LLM from generating toxic responses. We refer to this as attribute control, as it regulates an attribute of the LLM output.

In a paper we presented at EMNLP 2024, we propose a novel method for training an LLM to adhere to a set of constraints while preserving its performance. We first define a successfully trained LLM as one that can satisfy the following constraints: (1) Attribute control — the LLM output adheres to a policy, defined by the creator in most cases; (2) Utility preservation — the LLM maintains performance comparable to that of the original LLM on utility benchmarks; and (3) Training efficiency — the cost of fine-tuning with attribute control is similar to that of typical fine-tuning.

Related content

Responsible AI in the generative era

Generative AI raises new challenges in defining, measuring, and mitigating concerns about fairness, toxicity, and intellectual property, among other things. But work has started on the solutions.

Our work is inspired by the classic idea of constraint-driven learning and posterior regularization, in which the model output is forced to adhere to a particular distribution. Specifically, we train an auxiliary model to control a specific output attribute — in this case, toxicity. During fine-tuning, the auxiliary model estimates the closest distribution that, given the current state of the LLM, satisfies the constraints, and it penalizes the gap between that estimate and the LLM’s current distribution.

The natural way to do this is to iteratively push the LLM closer to the feasible region of generations, making the estimation progressively more accurate. However, this approach is sequential, and it causes a significant increase in run time. In our paper, we also present a parallelized algorithm that updates the base LLM and regularizer simultaneously, based on their status in the last iteration. Empirically, parallelization achieves the same level of performance as sequential fine-tuning, and the time complexity is the same as that of typical, unregularized fine-tuning.

A comparison of sequential *(left)* and parallel *(right)* fine-tuning over three iterations.

We also explore adaptive regularization (i.e., the use of a domain-specific regularizer on related parts of the training data) to improve performance and prevent catastrophic forgetting.

Utility is preserved

In experiments, we fine-tuned Llama-7B and Falcon-7B models on a mixture corpus including ToxiGen (data containing toxic responses) and Wikitext (general corpus) in equal proportions. With the adaptive regularizer, our approach, overall, preserved performance better than the standard approaches of reinforcement learning (RL) and filtering, while meeting toxicity control standards.

Benchmark performance of Llama-7B and Falcon-7B with toxicity control

Model		ToxiGen (lower is better)	MMLU (5-shot) (higher is better)	Com. reasoning (0-shot) (higher is better)
Llama-7B	Baseline	23	35.1	75.6
	Filtering	21.9	34.6	75.1
	RL	15.2	33.6	73.2
	NADO decoding	15.2	31.1	71.4
	Ours w/o adaptive	15.2	30.4	71.9
	Ours w/ adaptive	14.2	33.9	73.6
Falcon-7B	Baseline	14	27.2	76.1
	Filtering	13.6	26.4	74.9
	RL	9.8	25.4	74.4
	NADO decoding	7.3	23.6	72.5
	Ours w/o adaptive	7.1	23.1	71.8
	Ours w/ adaptive	7.3	26.1	74.5

Generation quality is preserved

Sequences produced by our model were indistinguishable, in terms of quality, from those produced by the base model, when OPT-30B acted as a judge. This demonstrates that our method retains the quality of generation. Our model also outperformed models trained using filtering and RL approaches.

Win rate against baseline

Win rate	Base	Filter	RL	Ours
Base	N/A	44.3	45.1	51.4
Filtering	55.7	N/A	53.4	61.6
RL	54.9	46.6	N/A	61.3
Ours	48.6	38.4	38.7	N/A

Toxicity classification and generation

One of the most interesting aspects of our method is that it allows the LLM to learn from toxic content. In experiments, we fine-tuned Llama-7B models on a toxicity classification task using the Jigsaw dataset of toxic content. With standard supervised fine-tuning, the model’s performance on the classification task improved, but the increased exposure to toxic content made it more likely to generate toxic content itself. With our method, on the other hand, improving performance on the classification task reduced the generation toxicity.

Jigsaw performance using Llama-7B model with toxicity control

Model	API tox.	Classify ROC
Baseline	0.315	0.910
SFT (LLM loss)	0.344	0.966
Ours (LLM loss)	0.288	0.959
SFT (classification)	0.314	0.972

Acknowledgements: I would like to acknowledge our intern, Tao Meng (UCLA), who led the work on this paper, and our coauthors, Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Aram Galstyan, Richard Zemel, Kai-Wei Chang, and Rahul Gupta, for their contributions.

About the Author

Charith Peris

Charith Peris is a senior applied scientist in Amazon's Artificial General Intelligence (AGI) organization.

Detoxification of large language models via regularized fine-tuning

Attribute-controlled fine-tuning can produce LLMs that adhere to policy while achieving competitive performance on general benchmarks.

Utility is preserved

Generation quality is preserved

Toxicity classification and generation

Related content

Work with us