Conversational AI

Do large language models really need all those layers?

Finding that 70% of attention heads and 20% of feed-forward networks can be excised with minimal effect on in-context learning suggests that large language models are undertrained.

July 9, 2023

3 min read

Large language models (LLMs) have been around for a while but have really captured the attention of the public this year, with the advent of ChatGPT. LLMs are typically pretrained on massive volumes of data; recent variants are additionally tuned to follow instructions and incorporate human feedback using reinforcement learning.

A fascinating ability that these LLMs demonstrate is in-context learning, where a model can learn to perform a task just by following a few (or sometimes even zero) good examples provided along with a new input. Following this paradigm of learning, larger LLMs also proved more capable of performing a wide variety of tasks than smaller ones, when the amount of pretraining data was fixed.

In a paper we’re presenting at this year’s meeting of the Association for Computational Linguistics (ACL), we investigate the importance of model scale for in-context learning, from the perspective of architectural interpretability. We specifically ask the question Are all LLM components really needed to perform in-context learning?

LLM building blocks

Modern LLMs use the Transformer architecture, which depends on an attention mechanism: the model learns to predict which prior tokens in the sequence it should attend to when predicting the current token.

Amazon’s Yang Liu, general chair of this year’s meeting of the Association for Computational Linguistics, on the road ahead for LLMs.

Specifically, LLMs use multihead attention, meaning that they apply multiple attention mechanisms, or heads, in parallel. OPT-66B has 64 layers with 72 attention heads in each layer. The output of the multihead attention passes through a separate feed-forward network (FFN) at each layer.

Our first method for analyzing OPT-66B was to assign a score to each attention head and FFN indicating how important they were to a given task. On the basis of those scores, we then pruned the model.

We found that important attention heads are primarily clustered in the model’s intermediate layers, and important FFNs are primarily in later layers. The ability to perform zero-/few-shot in-context learning on 14 different natural-language-processing (NLP) datasets/tasks stayed nearly intact when up to 70% (~15.7B parameters in OPT-66B) of the attention heads are removed.

Attention head heat map.png — A heat map representing attention heads’ aggregate importance scores for five-shot in-context learning across 14 NLP tasks, at each layer of the OPT-66B model.

The attention heads that are important (and unimportant) for in-context learning also seemed to overlap across tasks and shots. This indicates that a common task-agnostic subset of the attention heads is responsible for in-context learning. We also found that up to 20% of the FFNs (~8.5B parameters) can be removed with minimal decline in performance on zero-/few-shot in-context learning.

Our second analytic technique was to quantify the capacity of all attention heads in OPT-66B to perform a pair of task-agnostic primitive operations associated with in-context learning. Those primitives are prefix matching and copying: explicitly searching for a prior occurrence of the current token in context and copying over the token that succeeded it (its suffix).

Suffix copying.png — Prefix matching and copying.

Heads specialized for these two operations were first discovered by the machine learning research company Anthropic and termed induction heads. We found that a small set of heads in OPT-66B have nontrivial scores for both primitives. We also found that these heads overlap (to varying degrees) with the heads important for specific tasks identified earlier. This indicates that induction heads are capable of more sophisticated behaviors associated with in-context learning, such as latent concept matching, but are not the only heads with such capabilities.

Do large language models really need all those layers?

Finding that 70% of attention heads and 20% of feed-forward networks can be excised with minimal effect on in-context learning suggests that large language models are undertrained.

LLM building blocks

Related content

Work with us