Fine-tuning a large language model (LLM) on a specific task requires updates to billions of parameters across trillions of tokens, with the attendant costs in GPU resources and time.
Low-rank adaptation (LoRA) is a more efficient alternative that freezes the original model weights but introduces lightweight matrices into specific model sublayers, or “modules”. These matrices (commonly referred to as “adapters”) modify the modules’ weights, enabling not only efficient fine tuning but also on-demand model serving, which dramatically lowers inference costs; base-model sharing across GPUs, which cuts memory requirements; lower download overhead; and parallel inference across multiple adapters.
The question is where to insert these adapters across the model. Empirically, targeting more and larger modules tends to boost performance, because it allows more flexibility in customization; but it also increases training and inference costs. Using a smaller, well-chosen subset preserves most gains with significantly better efficiency.
Using Amazon’s Nova 2.0 Lite multimodal reasoning LLM as our base model, we set ourselves the goal of identifying a subset of standardized target-module configurations that works effectively across the vast majority of customer use cases. Through an ablation study, we identified a module known as o_proj, as the single module where adding an adapter achieves the best trade-off between efficiency and accuracy (o_proj is a linear transformation that mixes representations across attention heads into a single, cohesive form for the rest of the model to understand).
The Transformer architecture
Transformer models — the models responsible for all of AI’s remarkable recent gains — consist largely of blocks that are repeated multiple times. Each block in turn has two main components: an attention mechanism, which determines the relevance of previously seen tokens to the token currently being processed, and a feed-forward network, a conventional neural network that does additional processing on the outputs of the attention mechanism.
The attention mechanism involves three different matrices, which take their names from database design: the query matrix represents how relevant the current token is to the other tokens in the input sequence; the key matrix represents how relevant other tokens are to one another; and the value matrix represents the raw content of those other tokens. Multiplying the three matrices together creates, essentially, a recipe for the Transformer's next output.
To reduce computational complexity, these multiplications take place in a space with reduced dimensions. The matrices themselves and the results of their multiplication then have to be projected back up to the original dimensions of the input.
LoRA approximates weight updates using a product of two smaller matrices, drastically reducing the number of trainable parameters. The technique is typically applied to attention projection layers and feed-forward network layers. These modules are ideal candidates because they constitute the bulk of Transformer parameters, directly govern representation learning, and exhibit natural alignment with low-rank approximations. Empirical evidence shows weight changes in these layers often lie within a low-dimensional subspace during fine tuning.
Target module selection
Selecting the right target modules directly affects accuracy, latency, and computational efficiency. The optimal choice of target modules is primarily a function of (a) the base model being fine-tuned (i.e., its architecture, pre- and post-training data distributions, etc.) and (b) customization domain/modality.
When fine-tuning Nova 2.0 Lite, we balanced two competing objectives:
- Maximizing accuracy across diverse tasks and modalities and
- Minimizing latency to preserve LoRA's efficiency benefits.
We investigated the application of LoRA to four different modules in each Transformer block: the query, key, and value projection layers ( qkv); the o_proj layer; and two different fully connected layers in the feed-forward network, gate_up_proj and gate_down_proj (referred to as fc1 and fc2). Below are the trade-offs for these modules, both singly and in combination, based on results published in literature and empirical studies.
Combination |
Expected accuracy |
Expected latency |
Use case |
qkv only |
Good (baseline) |
Lowest |
|
o_proj only |
Moderate |
Lowest |
|
qkv + o_proj |
High |
Low to moderate (+5–10%) |
|
qkv + fc1 / fc2 |
Very high (close to full fine tuning) |
Moderate (+10–15%) |
|
o_proj + fc1 / fc2 |
Good to high |
Moderate (+5–10%) |
|
qkv + o_proj + fc1 / fc2 |
Highest (near-full fine tuning) |
High (+15–20%) |
|
All modules |
Maximum |
Highest (+20–25%) |
|
Trade-offs of accuracy and latency across target modules, based on literature review and empirical evidence.
Experimental methodology
We conducted a comprehensive ablation study, training multiple supervised-fine-tuning (SFT) LoRA variants on seven datasets spanning both text and visual data, across reasoning (i.e., the training datasets themselves include reasoning content) and non-reasoning tasks. The datasets covered diverse challenges from simple question answering to long-context summarization and structured JSON extraction.
Dataset |
Modality |
Reasoning traces |
Domain |
Tasks |
Training size |
Eval size |
Eval metric |
Source |
|
FinCOT |
Txt |
Yes |
Finance |
Financial-reasoning dataset. Samples consist of complex financial queries, along with reasoning traces obtained from GPT-4o. Predictions are typically complex tables or calculations based on the input. |
7436 |
1147 |
Accuracy |
||
GovReport |
Txt |
No |
Goverment Doc |
Large-context (30-40K tokens) summarization |
17457 |
837 |
RougeLsum |
||
MedMCQA |
Txt |
No |
Medical |
Dataset for multiple-choice QA — also used in Nova 1.0 |
20k |
3683 |
Accuracy |
||
MedReason |
Txt |
Yes |
Medical |
Medical-reasoning dataset that consists of questions and answers compiled from various medical benchmarks (MedQA, MedMCQA, etc.), along with synthetic, high-quality reasoning traces. (This uses the same eval set as MedMCQA.) |
31682 |
3683 |
Accuracy |
||
CoCoHD |
Txt |
No |
Political Doc |
A complex benchmark consisting of large-context (>20K tokens) transcripts of congressional hearings. The output is expected to be a summary in a specific JSON format, consisting of the members present, topic discussed, outcomes, etc. |
732 |
1053 |
Averaged key and value match rate |
||
Llava-COT |
Image |
Yes |
Image understanding, General/Science |
Multimodal, image benchmark consisting of Q&A reasoning questions. The dataset includes high-quality reasoning traces. |
10k |
270 |
Exact match rate |
||
Invoice OCR |
Image |
No |
Image understanding |
OCR benchmark that takes an input image and produces a JSON file with fields from the image. |
1400 |
447 |
Accuracy |
||
Summary of the experiment datasets
All experiments used the Nova 2.0 Lite general-availability checkpoint with consistent hyperparameters across target modules, including learning-rate ratio and alpha values.
Target dataset |
Setting |
SFT LoRA target performance |
Nova 2.0 Lite performance |
Fin-COT |
qkv |
67.09% |
72.12% |
o_proj |
68.30% |
||
fc1 |
75.35% |
||
fc2 |
60.24% |
||
o_proj + fc1 |
61.38% |
||
qkv + fc2 |
60.31% |
||
o_proj + fc2 |
62.79% |
||
qkv + fc1 |
68.37% |
||
All target modules |
66.15% |
||
CoCoHD |
qkv |
19.64% |
45.14% |
o_proj |
65.88% |
||
fc1 |
41.96% |
||
fc2 |
17.62% |
||
o_proj + fc1 |
76.83% |
||
qkv + fc2 |
66.47% |
||
o_proj + fc2 |
79.14% |
||
qkv + fc1 |
45.45% |
||
All target modules |
82.75% |
||
GovReport |
o_proj |
41.25% |
38.90% |
fc1 |
39.69% |
||
o_proj + fc1 |
41.74% |
||
o_proj + fc2 |
42.16% |
||
qkv + fc1 |
41.66% |
||
qkv + fc2 |
39.02% |
||
All target modules |
41.95% |
||
Llava-COT |
qkv |
64.26% |
16.22% |
o_proj |
64.26% |
||
fc1 |
65.92% |
||
fc2 |
65.02% |
||
o_proj + fc1 |
63.21% |
||
qkv + fc2 |
62.76% |
||
o_proj + fc2 |
66.37% |
||
qkv + fc1 |
66.52% |
||
All target modules |
63.96% |
||
Invoice OCR |
o_proj |
89.07% |
14.10% |
o_proj + fc1 |
90.03% |
||
qkv + fc2 |
87.84% |
||
o_proj + fc2 |
89.47% |
||
qkv + fc1 |
88.55% |
||
All target modules |
90.11% |
||
MedReason |
o_proj |
24.55% |
1.68% |
o_proj + fc1 |
20.88% |
||
qkv + fc2 |
8.39% |
||
o_proj + fc2 |
20.36% |
||
qkv + fc1 |
4.32% |
||
All target modules |
26.72% |
||
MedMCQA |
qkv |
62.18% |
1.68% |
o_proj |
63.10% |
||
fc1 |
12.90% |
||
fc2 |
59.98% |
||
o_proj + fc1 |
61.39% |
||
qkv + fc2 |
65.63% |
||
o_proj + fc2 |
64.95% |
||
qkv + fc1 |
57.21% |
||
All target modules |
66.11% |
Ablation study for target module selection. Some benchmarks have fewer variations, to save on computation and time. MedMCQA and MedReason use the MedMCQA test set for evaluation. On this task, Nova 2.0 Lite fails mainly due to formatting inconsistencies, even though it produces the right answer. For consistency’s sake, we use the same strict parser for SFT models.
Key findings
1. O_proj is the most robust single target
The o_proj-only configuration demonstrated remarkable consistency, never failing outright on any task and typically performing within a few percentage points of the best configuration (i.e., using all target modules). On MedMCQA, CoCoHD, GovReport, LLaVA-CoT, and Invoice OCR, o_proj-only either matched or came very close to optimal performance, making it an attractive default choice that balances performance and simplicity. There is emerging evidence that this module plays a key role in reasoning, which may explain its effectiveness here.
2. Qkv-only shows instability
While qkv-only performed well on MedMCQA, it exhibited extreme variability, performing below baseline on CoCoHD and showing unremarkable results elsewhere. This aligns with the hypothesis that attention-only LoRA can underfit on tasks requiring richer features from the feed-forward network, rather than relying on modified token routing.
3. Module combinations provide modest gains
Combinations like o_proj + fc2 or "all target modules" often achieved the highest per-dataset scores (particularly on CoCoHD, MedReason, and Invoice OCR). However, improvements over the best single module were typically modest, usually 1-3 percentage points.
4. Task difficulty amplifies configuration impact
On challenging benchmarks where the base model performed poorly, the choice of target modules had greater impact. For example, on CoCoHD (long-context, complex JSON generation), o_proj + fc2 achieved a +15% absolute improvement over the base model, compared to only +3% with o_proj alone.
5. LoRA consistently outperforms base models
Across nearly all datasets, any reasonable LoRA configuration dramatically outperformed the base model. For instance, MedReason, MedMCQA, LLaVA-CoT, and Invoice OCR showed improvements from a baseline accuracy of ~1-16% to 60-90%+ with LoRA. The notable exception was Fin-COT, where only certain configurations (notably fc1) exceeded baseline performance, suggesting task-specific sensitivity to adaptation strategy.
Recommendations
For accuracy-prioritized scenarios, we recommend o_proj + fc2 as the optimal configuration for both text and multimodal tasks, showing 2-12% improvements over o_proj alone across benchmarks.
For balanced efficiency and performance, o_proj-only provides an excellent default, offering robust performance with minimal latency overhead — particularly valuable when serving multiple adapters or operating under resource constraints.
For challenging tasks, such as benchmarks with long context or complex generation requirements or other tasks where base models struggle, the additional accuracy from o_proj + fc2 justifies the modest latency increase.
Future directions
Our research opens several promising avenues for further optimization:
- Modality and task-specific configurations: Segmenting target module selection by modality and task difficulty (e.g., long-context scenarios) could yield specialized configurations with better accuracy-latency trade-offs.
- Per-module hyperparameter optimization: Extensive hyperparameter optimization for each target module configuration could unlock additional performance gains, though computational costs remain a consideration.
- Two-stage LoRA for early candidate identification: Leveraging two-stage LoRA approaches that use training dynamics, gradients, etc., to determine the importance of different modules/layers could help identify promising configurations early in training, reducing the cost of comprehensive hyperparameter searches.
- Layer pruning for latency reduction: Using two-stage training to identify and prune unused layers could further reduce inference latency while maintaining accuracy.
Conclusion
Our comprehensive study demonstrates that thoughtful target module selection in LoRA fine tuning can improve accuracy while preserving the efficiency advantages that make LoRA attractive for production deployments. The o_proj layer emerges as a remarkably robust single target, while o_proj + fc2 combinations offer the best accuracy for challenging tasks. On average, o_proj LoRA is within 2% of o_proj + fc2 in terms of accuracy but has 22.6% lower latency (TPOT p95 decreases from 10.085ms → 7.803ms). These findings provide a principled foundation for standardizing LoRA configurations across diverse customer use cases, balancing the competing demands of model performance and computational efficiency.
Acknowledgements: Kevin Rondinone, Kevin Chen, Nicole Ding, Sebastian Massella, Andy Li