Conversational AI

Compressing BART models for resource-constrained operation

Combination of distillation and distillation-aware quantization compresses BART model to 1/16th its size.

June 06, 2022

Machines hoping to converse convincingly with humans have several natural-language-processing (NLP) skills to master, including text summarization, information extraction, and question answering. It’s these skills that enable virtual personal assistants to search for dinner recipes online or answer random questions.

Recent advances in sequence-to-sequence pretrained language models such as BART (bidirectional autoregressive Transformers) have led to great performance on many NLP tasks. These models’ successes, however, come at the cost of significant computation and memory resources; a typical BART model might have hundreds of millions of parameters.

For resource-constrained devices — such as cell phones or smart home appliances — this can make BART totally unusable. At ACL 2022, scientists from Amazon Web Services’ AI Labs presented a paper that addresses this problem by using a combination of distillation and quantization to compress a BART model to less than 1/16th its size, with little drop in performance.

A two-pronged approach

Innovative training methods and model compression techniques combine with clever engineering to keep speech processing local.

Quantization, which maps high-precision values to a limited menu of lower-precision values, and distillation, in which a smaller, more efficient student model is trained to emulate a larger, more powerful teacher, are common techniques for reducing neural networks’ memory footprints.

In the new paper, the Amazon researchers combine distillation and distillation-aware quantization to reduce the model’s footprint.

The researchers begin by fine-tuning a BART model — the teacher model — on a particular NLP task, such as question answering or text summarization. They then copy the weights from select layers of the trained teacher model over to a student model. This is the distillation process, which reduces the model footprint.

The next step is distillation-aware quantization. The student model is quantized, producing a low-precision model — although the full-precision student model is also kept on hand, as it’s necessary for the next step in the process.

The quantized student model then processes the dataset that was used to train the teacher model, and its outputs are assessed according to two metrics: the standard task-based loss, which measures how far the outputs diverge from ground truth, and a distillation loss, which measures the difference between the quantized-and-distilled student model and the teacher model.

Quantization + distillation.png — An example of distillation-aware quantization, in which both task and distillation losses are computed for a quantized student model, but back-propagation updates the parameters of a differentiable full-precision model. The student models distill the teacher model by copying over only a subset of its network layers.

Both of these losses are then used to update the parameters of, not the quantized student model, but the full-precision student model. This is because the standard algorithm for updating a neural network depends on gradient descent, which requires differentiable — that is, continuously variable — model parameters. The parameters of a quantized model have discrete values, so they’re not differentiable.

Once the full-precision student model has been updated — to minimize not only its error on the training set but also its difference from the teacher model — it is quantized once again, to reduce its memory footprint.

Experiments

The researchers compared the efficiency of their distilled and quantized BART model against three different benchmarks, on the tasks of both text summarization and long-form question answering. They also investigated how distillation-aware quantization would work on a more complex model like mBART, a multilingual model designed to translate phrases between languages — in this case, between English and Romanian.

Compressing BART models for resource-constrained operation

Combination of distillation and distillation-aware quantization compresses BART model to 1/16th its size.

A two-pronged approach

Experiments

Related content

Work with us