Multitask Distillation.png — Each task in a multitask model *(left)* typically has its own loss function, and during training, the functions converge at different rates *(right)*. During training, a new method attempts to preserve gains *(dotted lines)* on tasks whose performance has peaked.

Machine learning

Knowledge distillation for better convergence in multitask learning

Allowing separate tasks to converge on their own schedules and using knowledge distillation to maintain performance improves accuracy.

By Weiyi Lu

July 13, 2022

2 min read

Multiple convergence.png — Validation curves in a five-task multitask learning setup, where training minimizes the sum of the task losses. The tasks corresponding to the blue, purple, and red curves show signs of overfitting, while the tasks corresponding to the orange and green curves are underfitted at the end of training.

Multitask learning (MTL) typically involves jointly optimizing the losses of a set of tasks. One naive approach is to simply minimize the sum of the losses. However, the convergence speeds of the tasks can differ according to task difficulty. This naive training approach is usually suboptimal, because the model can end up overfitting some tasks and underfitting others.

To address this issue, many existing methods aim to balance the learning speed across tasks, by facilitating or inhibiting the learning of each individual task, such that all tasks have roughly the same convergence rate. These methods include applying static loss weights, dynamically adjusting loss weights during training, and manipulating the gradients of different tasks.

Switch to KD.png — Illustration of the idea of the proposed approach. As the validation curve of each task reaches its peak point, we switch to a knowledge distillaion loss for that task from that point going forward, in the hope that we will be able to achieve the dotted lines, where the performance of each task is kept at its peak level until the end of training.

In a paper we presented in the NAACL 2022 industry track, we propose a method for achieving convergence in MTL that improves on approaches that artificially enforce the same convergence rate across tasks. Instead, we let the tasks converge on their own schedules, and when a task converges, we switch to a knowledge distillation (KD) loss in order to keep the task's performance at the best level while the model learns the remaining tasks. The figure below illustrates the idea.

We evaluate the proposed method in two five-task MTL setups consisting of proprietary e-commerce datasets. The results show that our method consistently outperforms existing loss-weighting and gradient-balancing approaches, achieving average improvements of 0.9% and 1.5%, respectively over the best-performing baseline model in the two setups.

Asynchronous convergence via knowledge distillation

Our proposed method works as follows:

After the model converges on a task, we use its best-performing parameter values and run inference on the task’s training set, recording the predictions.
For the remaining training steps, we use these predictions as soft labels to train the model on the converged task, while using real labels to train on the remaining tasks.
We repeat this until all tasks converge.

Experiments and results

We evaluate our approach using two five-task setups, where the tasks are proprietary e-commerce tasks. The tasks in the first setup are more similar to each other and are all classification tasks, while the ones in the second setup are more diverse in terms of application and task type. We evaluate on these two benchmarks to test the effectiveness and robustness of our method in different MTL scenarios.

Second-pass language models that rescore automatic-speech-recognition hypotheses benefit from multitask training on natural-language-understanding objectives.

In both setups, both the joint and sequential settings substantially outperform the baseline methods. Our best results are, on average, higher by 0.9% and by 1.5% than the best-performing baseline, respectively, in the two setups.

Below are the validation curves of the baseline that simply minimizes the sum of task losses and of our proposed joint and sequential settings in the first five-task setup. We can observe that none of the tasks in the joint and sequential settings shows a downward trend, suggesting that the method is indeed effective in maintaining the performance of converged tasks at the best level while the model learns the remaining tasks.

Baseline comparison.png — Validation curves of the baseline that simply minimizes the sum of task losses and of our proposed joint and sequential settings.

About the Author

Weiyi Lu

Weiyi Lu is an applied scientist at Amazon.

Knowledge distillation for better convergence in multitask learning

Allowing separate tasks to converge on their own schedules and using knowledge distillation to maintain performance improves accuracy.

Asynchronous convergence via knowledge distillation

Experiments and results

Related content

Work with us