Conversational AI

Pruning network nodes on the fly to improve LLM efficiency

Language models inspired by specialized processing regions in the brain offer significant time and cost savings.

July 21, 2025

Foundation models (FMs) such as large language models and vision-language models are growing in popularity, but their energy inefficiency and computational cost remain an obstacle to broader deployment.

To address those challenges, we propose a new architecture that, in our experiments, reduced an FM’s inference time by 30% while maintaining its accuracy. Our architecture overcomes challenges in prior approaches to improving efficiency by maintaining both the model’s adaptability and its structural integrity.

With the traditional architecture, when an FM is presented with a new task, data passes through all of its processing nodes, or neurons — even if they’re irrelevant to the current task. Unfortunately, this all-hands-on-deck approach leads to high computational demands and increased costs.

Finding that 70% of attention heads and 20% of feed-forward networks can be excised with minimal effect on in-context learning suggests that large language models are undertrained.

Our goal was to build a model that can select the appropriate subset of neurons on the fly, depending on the task; this is similar to, for instance, the way the brain relies on clumps of specialized neurons in the visual or auditory cortex to see or hear. Such an FM could adapt to multiple kinds of inputs, such as speech and text, over a range of languages, and produce multiple kinds of outputs.

In a paper we presented at this year’s International Conference on Learning Representations (ICLR), we propose a novel context-aware FM for multilingual speech recognition, translation, and language identification. Rather than activating the whole network, this model selects bundles of neurons — or modules — to activate, depending on the input context. The input context includes characteristics such as what language the input is in, speech features of particular languages, and whether the task is speech translation, speech recognition, or language identification.

Dynamic pruning.png — Two sparse architectures used in the researchers’ experiments — the E-Branchformer *(left)* and Transformer *(center)* and the method for embedding language/task information *(right)*. The gate predictor calculates the gate probability for each module in each layer.

Once the model identifies the context, it predicts the likelihood of activating each of the modules. We call those likelihoods gate probabilities, and each one constitutes a filter that we call a gate predictor. If a gate probability exceeds some threshold, the corresponding module is activated.

For instance, based on a few words of spoken German the model might predict, with a likelihood that crosses the gate threshold, that the context is “German audio.” That prediction opens up a subset of appropriate pathways, shutting down others.

Prior approaches to pruning have focused on fine-grained pruning of model layers and of convolutional kernels. Layer pruning, however, can impair a model’s structural integrity, while fine-grained kernel pruning can inhibit a model’s ability to adapt to different kinds of inputs.

Pruning network nodes on the fly to improve LLM efficiency

Language models inspired by specialized processing regions in the brain offer significant time and cost savings.

Related content

Work with us