Conversational AI

Domain data trumps teacher knowledge for distilling NLU models

On natural-language-understanding tasks, student models trained only on task-specific data outperform those trained on a mix that includes generic data.

January 18, 2023

3 min read

Knowledge distillation is a popular technique for compressing large machine learning models into manageable sizes, to make them suitable for low-latency applications such as voice assistants. During distillation, a lightweight model (referred to as a student) is trained to mimic a source model (referred to as the teacher) over a specific data set (the transfer set).

The choice of the transfer set is crucial to producing high-quality students, but how to make that choice is far from obvious. In natural-language-understanding (NLU) applications, teacher models are usually pretrained on generic corpora, which can differ from the task-specific corpora used for fine-tuning. This raises a natural question: Should the student be distilled over the generic corpora, so as to learn from high-quality teacher predictions, or over the task-specific corpora that aligns better with fine-tuning?

Distillation diversity

In our experiments, we distilled a set of multilingual students from a large multilingual teacher model, using generic and task-specific data mixed in three different ratios:

Ratio 1: generic-only (baseline)
Ratio 2: 7:3 generic to task-specific (mimicking a low-resource setting)
Ratio 3: task-specific-only

So what are generic and task-specific data? Generic data is usually publicly available, non-annotated data unrelated to any specific task. Model training on unannotated data typically involves self-supervised learning; in our case, that means masking out words of a text and training the model to supply them (masked language modeling).

With an encoder-decoder architecture — rather than decoder only — the Alexa Teacher Model excels other large language models on few-shot tasks such as summarization and machine translation.

Task-specific data is data that has been annotated to indicate the proper performance of a task. In our case, we explored two downstream tasks, domain classification (DC) and joint intent classification and named-entity recognition (ICNER), and our task-specific data is annotated accordingly.

We evaluated our models on two types of test sets — test and tail_test — and four languages of interest, namely German, French, Italian, and Spanish. The set test comprises the full test split, while tail_test is the subset of data points within test that have a frequency of occurrence of three or less. The tail_test set allows us to measure the generalizability of our models to data that they have rarely seen during training.

Knowledge distillation models.16x9.png — A schematic of the two baseline and four experimental models that we investigated and how they were evaluated.

All our experimental and baseline models had the same number of parameters. The generic-distilled baseline was created by distilling a student using only generic data (Ratio 1). The directly pretrained baseline was pretrained from scratch using the generic data and fine-tuned on the task-specific data.

Self-supervised training, distributed training, and knowledge distillation have delivered remarkable results, but they’re just the tip of the iceberg.

We created four distilled student encoders, two of which were directly distilled using Ratio 2 and Ratio 3 datasets. The remaining two were created in the same way, but the teacher was fine-tuned with the task-specific datasets for a million steps each before distillation. This enabled benchmarking teacher adaptation to the target task.

When evaluating performance for the DC and ICNER tasks, we added either a DC or ICNER decoder to each encoder. Change in F1 score (which factors in both false-negative and false-positive rate) relative to baseline was taken as the improvement for DC, and the change in semantic error rate (SemER) relative to baseline was taken as the improvement for ICNER.

Distillation results 1.png — The percentage improvements for each distilled encoder and each language against the generic distilled baseline. Positive is better for change in F1 score.

Distillation results 2.png — The results for the joint ICNER task. In this case, negative is better.

On the DC task, our results show improvements across the board when task-specific data is included in the transfer sets, with the greatest improvement coming from using only task-specific data. We see similar results in the case of ICNER, where improvements are greater for encoders distilled using only task-specific data.

Acknowledgements: We would like to acknowledge our coauthors in the paper for their contributions to this work: Lizhen Tan, Turan Gojayev, Pan Wei, and Gokmen Oz.

About the Author

Charith Peris

Charith Peris is a senior applied scientist in Amazon's Artificial General Intelligence (AGI) organization.

Thomas Gueudre

Thomas Gueudre is a senior applied scientist in the Alexa AI Natural Understanding organization.

Domain data trumps teacher knowledge for distilling NLU models

On natural-language-understanding tasks, student models trained only on task-specific data outperform those trained on a mix that includes generic data.

Distillation diversity

Related content

Work with us