Conversational AI

A better path to pruning large language models

A new philosophy for developing LLM architectures reduces energy requirements, speeds up runtime, and preserves pretrained-model performance.

By Kai Zhen

August 8, 2025

5 min read

In recent years, large language models (LLMs) have revolutionized the field of natural-language processing and made significant contributions to computer vision, speech recognition, and language translation. One of the keys to LLMs’ effectiveness has been the exceedingly large datasets they’re trained on. The trade-off is exceedingly large model sizes, which lead to slower runtimes and higher consumption of computational resources. AI researchers know these challenges well, and many of us are seeking ways to make large models more compact while maintaining their performance.

To this end, we’d like to present a novel philosophy, “Prune Gently, Taste Often”, which focuses on a new way to do pruning, a compression process that removes unimportant connections within the layers of an LLM’s neural network. In a paper we presented at this year’s meeting of the Association for Computational Linguistics (ACL), we describe our framework, Wanda++, which can compress a model with seven billion parameters in under 10 minutes on a single GPU.

Measured according to perplexity, or how well a probability distribution predicts a given sample, our approach improves the model’s performance by 32 percent over its leading predecessor, called Wanda.

A brief history of pruning

Pruning is challenging for a number of reasons. First, training huge LLMs is expensive, and once they’re trained, runtime is expensive too. While pruning can make runtime cheaper, if it’s done later in the build process, it hurts performance. But if it’s done too early in the build process, it further exacerbates the first problem: increasing the cost of training.

When a model is trained, it builds a map of semantic connections gleaned from the training data. These connections, called parameters, gain or lose importance, or weight, as more training data is introduced. Pruning during the training stage, called “pruning-aware training,” is baked into the training recipe and performs model-wide scans of weights, at a high computational cost. What’s worse, pruning-aware training comes with a heavy trial burden of full-scale runs. Researchers must decide when to prune, how often, and what criteria to use to keep pretraining performance viable. Tuning such “hyperparameters” requires repeated model-wide culling experiments, further driving up cost

The other approach to pruning is to do it after the LLM is trained. This tends to be cheaper, taking somewhere between a few minutes and a few hours — compared to the weeks that training can take. And post-training pruning doesn’t require a large number of GPUs.

In this approach, engineers scan the model layer by layer for unimportant weights, as measured by a combination of factors such as how big the weight is and how frequently it factors into the model’s final output. If either number is low, the weight is more likely to be pruned. The problem with this approach is that it isn’t “gentle”: it shocks the structure of the model, which loses accuracy since it doesn’t learn anything from the absence of those weights, as it would have if they had been removed during training.

Striking a balance

Here’s where our philosophy presents a third path. After a model is fully trained, we scan it piece by piece, analyzing weights neither at the whole-model level nor at the layer level but at the level of decoding blocks: smaller, repeating building blocks that make up most of an LLM.

Within each decoding block, we feed in a small amount of data and collect the output to calibrate the weights, pruning the unimportant ones and updating the surviving ones for a few iterations. Since decoder blocks are small — a fraction of the size of the entire model — this approach requires only a single GPU, which can scan a block within minutes.

We liken our approach to the way an expert chef spices a complex dish. In cooking, spices are easy to overlook and hard to add at the right moment — and even risky, if handled poorly. One simply cannot add a heap of tarragon, pepper, and salt at the beginning (pruning-aware training) or at the end (layer-wide pruning) and expect to have the same results as if spices had been added carefully throughout. Similarly, our approach finds a balance between two extremes. Pruning block by block, as we do, is more like spicing a dish throughout the process. Hence the motto of our approach: Prune Gently, Taste Often.

From a technical perspective, the key is focusing on decoding blocks, which are composed of a few neural-network layers such as attention layers, multihead attention layers, and multilayer perceptrons. Even an LLM with seven billion parameters might have just 32 decoder blocks. Each block is small enough — say, 200 million parameters — to easily be scanned by a single GPU. Pruning a model at the block level saves resources by not consuming much GPU memory.

And while all pruning processes initially diminish performance, ours actually brings it back. Every time we scan a block, we balanceg pruning with performance until they’re optimized. Then we move on to the next block. This preserves both performance at the block level and overall model quality. With Wanda++, we’re offering a practical, scalable middle path for the LLM optimization process, especially for teams that don’t control the full training pipeline or budget.

Wanda++.png — Pruning at the level of the decoder block is "gentle" because the effects of the pruning are localized; they exert less influence on the overall behavior of the model. Repeating the pruning process for each block is like the practice of a chef who "tastes often" to ensure that the spices in the meal under preparation remain in balance.

What’s more, we believe our philosophy also helps address a pain point of LLM development at large companies. Before the era of LLMs, each team built its own models, with the services that a single LLM now provides achieved via orchestration of those models. Since none of the models was huge, each model development team received its own allocation of GPUs. Nowadays, however, computational resources tend to get soaked up by the teams actually training LLMs. With our philosophy, teams working on runtime performance optimization, for instance, could reclaim more GPUs, effectively expanding what they can explore.

Further implementations of Prune Gently, Taste Often could apply to other architectural optimizations. For instance, calibrating a model at the decoder-block level could convert a neural network with a dense structure, called a dense multilayer perceptron, to a less computationally intensive neural network known as a mixture of experts (MoE). In essence, per-decoder-block calibration can enable a surgical redesign of the model by replacing generic components with more efficient and better-performing alternatives such as Kolmogorov-Arnold Networks (KAN). While the Wanda++ philosophy isn’t a cure-all, we believe it opens up an exciting new path for re-thinking model compression and exploring future LLM architectures.

About the Author

Kai Zhen

Kai Zhen is a senior applied scientist in Amazon's Artificial General Intelligence organizations.

A better path to pruning large language models

A new philosophy for developing LLM architectures reduces energy requirements, speeds up runtime, and preserves pretrained-model performance.

A brief history of pruning

Striking a balance

Related content

Work with us