Knowledge distillation method for better vision-language models

Method preserves knowledge encoded in teacher model’s attention heads even when student model has fewer of them.

Large machine learning models based on the transformer architecture have recently demonstrated extraordinary results on a range of vision and language tasks. But such large models are often too slow for real-time use, so practical systems frequently rely on knowledge distillation to distill large models’ knowledge into leaner, faster models.

The defining characteristic of the transformer model is its reliance on attention mechanisms, which determine the influence that previously seen data should have on the model’s handling of the data at hand. The attention mechanisms are typically organized into multiple heads, each of which attends to a different aspect of the data.

Typically, large-transformer distillation involves aligning the attention heads of the large, trained model — the teacher — with the attention heads of the leaner, target model — the student — on a one-to-one basis. But limiting the number of attention heads is one of the ways in which the student model can reduce model complexity.

At this year’s meeting of the Association for the Advancement of Artificial Intelligence (AAAI), we proposed an alternative, in which the knowledge of all the attention heads in the teacher model is distilled into all the attention heads of the student model. Since the student has fewer heads than the teacher, a single attention head in the student model may end up encoding information contained in several of the teacher’s attention heads.

Related content
Novel architectures and carefully prepared training data enable state-of-the-art performance.

We evaluated our approach on two different vision-language models, which map images and texts to the same vector space. The models had been fine-tuned on a visual-question-answering task, an image-captioning task, and a translation task based on image context, and we compared our distillation approach to two state-of-the-art baselines. Our approach outperformed the baselines across the board.

Target tasks

Typically, a vision-language model (VLM) has a separately pretrained sub-module for each of its modalities, and the whole network is then further pretrained to learn a multimodal representation. Finally, the pretrained model is then fine-tuned on a specific task.

In our experiments, we distilled the student model only on the fine-tuned task. We did, however, consider the case in which the teacher model did not have any multimodal pretraining and found that our distillation method could, to a great extent, compensate for that lack.

Weighting game

For a given input or set of inputs, each attention head of a transformer constructs an attention map, a matrix that indicates the influence that each element of the input exerts on each of the other elements. In an LLM, the attention map maps the words of a text sequence against themselves; when deciding on each new output word, the LLM uses the attention weights in the matrix column corresponding to that word. In a vision model, the map might represent the influence that each region of an image exerts on the interpretation of every other region.

Related content
Attention-based representation of multi-image inputs improves performance on downstream vision-language tasks.

The rows of any matrix can be concatenated to produce a single vector, and our approach to knowledge distillation relies on the vector versions — or “flattened” versions — of attention maps.

The loss function for the distillation process has two components. One is a function that seeks to minimize the difference between the teacher and student outputs; obviously, it’s crucial that the student reproduce the functionality of the teacher model as accurately as possible. The other component of the loss function aligns attention maps.

Specifically, for a given training example and a given attention head in the teacher model, the attention-map-alignment loss seeks to minimize the distance between the teacher’s attention map and a weighted sum of the maps generated by all the student attention heads.

These schematics compare conventional attention-head knowledge distillation (right) and a new approach, attention map alignment distillation (AMAD)  on the left. The image contains a series of 3 by 3 grids with labels like head 1, head 2, and head 3. Each grid has some colored squares and arrows of different thickness and colors are connecting some of the grids. The grids on the right show the conventional attention-head knowledge distillation approach and the grids on the left show the new approach.
Schematics comparing conventional attention-head knowledge distillation (right) and our approach, attention map alignment distillation (AMAD). In the conventional approach, each teacher attention head is mapped to exactly one student head; extra teacher heads are simply discarded. In our approach, each teacher head is mapped to multiple student heads in a weighted fashion. The thickness of the colored lines illustrates the weights.

The weights of that weighted sum are based on the cosine similarities between the flattened teacher map and the flattened student maps. In other words, student maps that are already similar to the teacher map count more toward the weighted sum. Over successive steps of the training process, that similarity should increase, and so should the weights assigned to the similar student maps.

If the student had exactly the same number of attention heads as the teacher, and there were no correlations whatever between the maps generated by the teacher’s attention heads, this process might result in something similar to the one-to-one mapping of the standard distillation process. But of course, the point of the approach is to preserve attention map information even when the student has fewer attention heads than the teacher.

And empirically, there’s usually some correlation between attention maps generated by different heads. Indeed, those correlations may explain the success of our method; it’s because of them that multiple attention maps generated by the teacher can be distilled into a single map generated by the student.

Acknowledgments: Srikar Appalaraju, Peng Tang, Vijay Mahadevan, R. Manmatha, Ying Nian Wu.

Related content

US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
LU, Luxembourg
The Decision, Science and Technology (DST) team part of the global Reliability Maintenance Engineering (RME) is looking for a Senior Operations Research Scientist interested in solving challenging optimization problems in the maintenance space. Our mission is to leverage the use of data, science, and technology to improve the efficiency of RME maintenance activities, reduce costs, increase safety and promote sustainability while creating frictionless customer experiences. As a Senior OR Scientist in DST you will be focused on leading the design and development of innovative approaches and solutions by leading technical work supporting RME’s Predictive Maintenance (PdM) and Spare Parts (SP) programs. You will connect with world leaders in your field and you will be tackling customer's natural language challenges by carrying out a systematic review of existing solutions. The appropriate choice of methods and their deployment into effective tools will be the key for the success in this role. The successful candidate will be a self-starter comfortable with ambiguity, with strong attention to detail and outstanding ability in balancing technical leadership with strong business judgment to make the right decisions about model and method choices. Key job responsibilities • Provide technical expertise to support team strategies that will take EU RME towards World Class predictive maintenance practices and processes, driving better equipment up-time and lower repair costs with optimized spare parts inventory and placement • Implement an advanced maintenance framework utilizing Machine Learning technologies to drive equipment performance leading to reduced unplanned downtime • Provide technical expertise to support the development of long-term spares management strategies that will ensure spares availability at an optimal level for local sites and reduce the cost of spares A day in the life As a Senior OR Scientist in DST you will be focused on leading the design and development of innovative approaches and solutions by leading technical work supporting RME’s Predictive Maintenance (PdM) and Spare Parts (SP) programs. You will connect with world leaders in your field and you will be tackling customer's natural language challenges by carrying out a systematic review of existing solutions. The appropriate choice of methods and their deployment into effective tools will be the key for the success in this role. About the team Our mission is to leverage the use of data, science, and technology to improve the efficiency of RME maintenance activities, reduce costs, increase safety and promote sustainability while creating frictionless customer experiences. We are open to hiring candidates to work out of one of the following locations: Luxembourg, LUX
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists in the Forecasting, Macroeconomics & Finance field document, interpret and forecast Amazon business dynamics. This track is well suited for economists adept at combining cutting edge times-series statistical methods with strong economic analysis and intuition. This track could be a good fit for candidates with research experience in: macroeconometrics and/or empirical macroeconomics; international macroeconomics; time-series econometrics; forecasting; financial econometrics and/or empirical finance; and the use of micro and panel data to improve and validate traditional aggregate models. Economists at Amazon are expected to work directly with our senior management and scientists from other fields on key business problems faced across Amazon, including retail, cloud computing, third party merchants, search, Kindle, streaming video, and operations. The Forecasting, Macroeconomics & Finance field utilizes methods at the frontier of economics to develop formal models to understand the past and the present, predict the future, and identify relevant risks and opportunities. For example, we analyze the internal and external drivers of growth and profitability and how these drivers interact with the customer experience in the short, medium and long-term. We build econometric models of dynamic systems, using our world class data tools, formalizing problems using rigorous science to solve business issues and further delight customers. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists in the Forecasting, Macroeconomics & Finance field document, interpret and forecast Amazon business dynamics. This track is well suited for economists adept at combining cutting edge times-series statistical methods with strong economic analysis and intuition. This track could be a good fit for candidates with research experience in: macroeconometrics and/or empirical macroeconomics; international macroeconomics; time-series econometrics; forecasting; financial econometrics and/or empirical finance; and the use of micro and panel data to improve and validate traditional aggregate models. Economists at Amazon are expected to work directly with our senior management and scientists from other fields on key business problems faced across Amazon, including retail, cloud computing, third party merchants, search, Kindle, streaming video, and operations. The Forecasting, Macroeconomics & Finance field utilizes methods at the frontier of economics to develop formal models to understand the past and the present, predict the future, and identify relevant risks and opportunities. For example, we analyze the internal and external drivers of growth and profitability and how these drivers interact with the customer experience in the short, medium and long-term. We build econometric models of dynamic systems, using our world class data tools, formalizing problems using rigorous science to solve business issues and further delight customers. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Economists in the Forecasting, Macroeconomics & Finance field document, interpret and forecast Amazon business dynamics. This track is well suited for economists adept at combining cutting edge times-series statistical methods with strong economic analysis and intuition. This track could be a good fit for candidates with research experience in: macroeconometrics and/or empirical macroeconomics; international macroeconomics; time-series econometrics; forecasting; financial econometrics and/or empirical finance; and the use of micro and panel data to improve and validate traditional aggregate models. Economists at Amazon are expected to work directly with our senior management and scientists from other fields on key business problems faced across Amazon, including retail, cloud computing, third party merchants, search, Kindle, streaming video, and operations. The Forecasting, Macroeconomics & Finance field utilizes methods at the frontier of economics to develop formal models to understand the past and the present, predict the future, and identify relevant risks and opportunities. For example, we analyze the internal and external drivers of growth and profitability and how these drivers interact with the customer experience in the short, medium and long-term. We build econometric models of dynamic systems, using our world class data tools, formalizing problems using rigorous science to solve business issues and further delight customers. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA