Conversational AI

Making LLMs faster without sacrificing accuracy

A new scaling law that relates particular architectural choices to loss helps identify models that improve throughput by up to 47% with no loss of accuracy.

May 15, 2026

5 min read

Overview by Amazon Nova

Surefire models match or exceed LLaMA-3.2 accuracy while improving throughput by up to 47%, with gains consistent across A100 and H200 GPUs and multiple serving frameworks.
The optimal MLP-to-attention ratio of LLaMA-3.2-style models is around 1.0, far lower than that of existing open-weight versions (e.g., 4.8 for LLaMA-3.2-1B).

Was this answer helpful?

Large language models (LLMs) keep getting bigger and better. But the cost of running them — generating text, answering questions, powering real-time applications — is scaling up, too. Obviously, model accuracy is important, but for real-time AI-based web applications, it can’t come at the expense of efficiency.

In a paper we presented at the International Conference on Learning Representations (ICLR), we provide a framework for navigating this accuracy-versus-efficiency tradeoff, by connecting scaling laws directly to architectural-design decisions.

The gap in current scaling laws

In 2022, Google DeepMind announced the results of a study involving an experimental LLM called Chinchilla. The DeepMind researchers demonstrated a scaling law that enabled joint optimization of model size and training data to achieve a desired loss level, given a particular computational budget.

More precisely, the law relates the model loss (L) to the number of model parameters (N) and the number of tokens in the training dataset:

Chinchilla law.png — The Chinchilla scaling law relates model loss *(L)* to parameter count *(N)* and training-token count but says nothing about the model's internal architecture — the gap this work addresses.

The other variables in this equation — E, A, B, α, and β — are all learnable coefficients. The DeepMind researchers did extensive experimentation to tune those coefficients.

This "Chinchilla law" doesn't specify architectural choices, such as the size of the model's internal representations — the "hidden size" — or the relative number of parameters allocated to attention layers and multilayer perceptron (MLP) layers. However, two models, each with the same billion-parameter count, trained on the same data, with the same accuracy, can differ by up to 40% in inference-time throughput, depending on additional architectural choices. We set out to deduce scaling laws that can help predict those choices.

The Transformer architecture

The Transformer architecture — which lies at the heart of all LLMs — consists largely of stacked attention and MLP blocks. Attention blocks determine how much weight to give each prior token (word or word part) when updating the current token's representation; MLP blocks transform that representation further and are where much of the model's learned knowledge is stored. A separate output layer at the end of the stack converts the final representation into a probability distribution over the next token.

Architecture is not an afterthought. The right configurations can unlock large efficiency gains with no accuracy cost.

The attention mechanism uses three matrices, with names borrowed from information retrieval: the query matrix encodes what each token is looking for in the rest of the sequence; the key matrix encodes what each token has to offer; and the value matrix holds the content each token can contribute when it's attended to. Comparing queries against keys tells the model how relevant each token is to each other token.

Most LLMs use multihead attention: several attention computations run in parallel, each with its own query, key, and value projections. Different heads tend to specialize in different aspects of the input, letting the model capture a richer set of relationships than a single head would.

Our approach: Architecture as a first-class variable

In our ICLR paper, we introduce a scaling law that augments the Chinchilla framework with three architectural factors: the hidden size (the dimension of the vectors that flow through the embedding, attention, and MLP blocks); the ratio of the number of MLP parameters to the number of attention parameters; and grouped-query attention (GQA), in which groups of attention heads, while preserving distinct query matrices, share key and value matrices.

Each factor has a direct impact on inference throughput:

Hidden size (d_model): Under a fixed parameter budget, larger hidden sizes reduce total inference FLOPs and shrink the key-value cache, improving throughput.
MLP-to-attention ratio (r_mlp/attn): A higher ratio allocates more parameters to the MLP and fewer to attention, shrinking the key-value cache and reducing memory-bandwidth bottlenecks.
Grouped-query attention (GQA): Compressing key-value heads further cuts input/output costs during generation.

Adjusting these factors purely for higher throughput comes at a cost of accuracy. Both hidden size and MLP-to-attention ratio exhibit U-shaped loss curves: there is an optimal point for each, and pushing too far in either direction has a negative effect on model accuracy. GQA has a more erratic effect on loss, so we treat it as a discrete hyperparameter tuned through local search.

We deduce our scaling law in two stages. First, we fit the standard Chinchilla law to the model under investigation, calculating values for the coefficients E, A, B, α, and β. This establishes an optimal reference loss. Then we calibrate how each architectural choice — differences in the three factors we consider — affects that loss. Effectively, we learn a correction surface over the design space. Because the effects of hidden size and MLP-to-attention ratio on loss turn out to be separable, each factor can be optimized independently.

Throughput and predicted training loss .png — Throughput *(left)* and predicted training loss *(right)* over hidden size and MLP-to-attention ratio. There’s a sweet spot where throughput increases and loss decreases simultaneously.

Two model families: Panda and Surefire

This scaling law enabled us to develop a search framework that identifies Pareto-optimal architectures for any given accuracy target. The result of that search was two model families: Panda (which maximizes accuracy) and Surefire (which is Pareto optimal on the accuracy–efficiency frontier).

To validate the framework and identify our families of optimal models, we trained more than 200 models with varying architectures (80 million to three billion parameters, eight billion to 100 billion tokens). The results of our experiments are below (throughput measured on H200 GPU with batchsize-128-4096-input-1024-output tokens):

Model	d_model	GQA	r_mlp/attn	Loss	Avg. accuracy	Throughput vs. LLaMA-3.2-vLLM	Throughput vs. LLaMA-3.2-SGLang
LLaMA-3.2-1B	2048	4	4.80	2.803	54.9%	baseline	baseline
Panda-1B	2560	4	1.07	2.782	57.0%	-33%	-
Surefire-1B	2560	9	3.60	2.804	55.4%	+21%	+47%

LLaMA-3.2-3B	3072	3	4.80	2.625	61.9%	baseline	baseline
Panda-3B	4096	3	1.00	2.619	62.5%	-23%	-
Surefire-3B	4096	7	1.00	2.620	62.6%	+12%	+17%

The billion-parameter Panda model gains 2.1% over LLaMA-3.2-1B, and the three-billion parameter model gains 0.6% over LLaMA-3.2-3B — at the cost of lower throughput.
Surefire models match or exceed LLaMA-3.2 accuracy while improving throughput by 12-47%, with gains reaching up to 42% on A100 (vLLM) and 47% on H200 (SGLang) under different model size and batch size configurations.

Training loss across billion-parameter architectural variants.png — Training loss across billion-parameter architectural variants. Panda-1B sits at the minimum of the U-shaped curve, while Surefire-1B trades a small amount of loss for much higher throughput. LLaMA-3.2-1B, with r_mlp/attn=4.8, operates far from the optimum.

Key takeaways

Architecture is not an afterthought. The optimal MLP-to-attention ratio of LLaMA-3.2-style models is around 1.0, far lower than that of existing open-weight versions (e.g., 4.8 for LLaMA-3.2-1B). Current models overallocate to MLP layers. The right configurations of hidden size, MLP-to-attention ratio, and GQA configuration can unlock large efficiency gains with no accuracy cost.
Small-scale experiments predict large-scale outcomes. The conditional scaling law, calibrated on models with as few as 80 million to 297 million parameters, reliably predicts the best architecture at one billion and three billion parameters, enabling low-cost exploration before expensive full-scale training.
The framework generalizes across hardware and serving systems. Efficiency gains are consistent across A100/H200 GPUs and vLLM/SGLang, making the results directly actionable.

About the Author

Tao Yu

Tao Yu is an applied scientist with AWS AI Labs.

Youngsuk Park

Youngsuk Park is an applied scientist with AWS AI Labs.

Making LLMs faster without sacrificing accuracy

A new scaling law that relates particular architectural choices to loss helps identify models that improve throughput by up to 47% with no loss of accuracy.

The gap in current scaling laws

The Transformer architecture

Our approach: Architecture as a first-class variable

Two model families: Panda and Surefire

Key takeaways

Related content

Work with us