Machine learning

Optimizing LoRA target module selection for efficient fine tuning

Ablation study clarifies trade-offs between accuracy and efficiency when using low-rank adaptation (LoRA) to fine-tune AI models.

By Rushil Anirudh, Anjie Fang, Bhoomit Vasani

March 19, 2026

11 min read

Overview by Amazon Nova

On the CoCoHD dataset, using o_proj + fc2 achieved a +15% absolute improvement over the base model, compared to only +3% with o_proj alone, demonstrating that task difficulty amplifies the impact of target module selection ("Optimizing LoRA target module selection for efficient fine tuning," Amazon Science, 2026).
The o_proj-only configuration demonstrated remarkable consistency, never failing outright on any task and typically performing within a few percentage points of the best configuration, making it an attractive default choice for the Nova 2.0 Lite multimodal reasoning LLM (Ibid.).
On average, o_proj LoRA is within 2% of o_proj + fc2 in terms of accuracy but has 22.6% lower latency (TPOT p95 decreases from 10.085ms → 7.803ms), highlighting the efficiency benefits of using o_proj alone (Ibid.).

Was this answer helpful?

Fine-tuning a large language model (LLM) on a specific task requires updates to billions of parameters across trillions of tokens, with the attendant costs in GPU resources and time.

Low-rank adaptation (LoRA) is a more efficient alternative that freezes the original model weights but introduces lightweight matrices into specific model sublayers, or “modules”. These matrices (commonly referred to as “adapters”) modify the modules’ weights, enabling not only efficient fine tuning but also on-demand model serving, which dramatically lowers inference costs; base-model sharing across GPUs, which cuts memory requirements; lower download overhead; and parallel inference across multiple adapters.

New service lets customers mix their own data with the data used to train Amazon Nova at each major stage of model development, enabling deep domain understanding while preventing "catastrophic forgetting".

The question is where to insert these adapters across the model. Empirically, targeting more and larger modules tends to boost performance, because it allows more flexibility in customization; but it also increases training and inference costs. Using a smaller, well-chosen subset preserves most gains with significantly better efficiency.

Using Amazon’s Nova 2.0 Lite multimodal reasoning LLM as our base model, we set ourselves the goal of identifying a subset of standardized target-module configurations that works effectively across the vast majority of customer use cases. Through an ablation study, we identified a module known as o_proj, as the single module where adding an adapter achieves the best trade-off between efficiency and accuracy (o_proj is a linear transformation that mixes representations across attention heads into a single, cohesive form for the rest of the model to understand).

The Transformer architecture

Transformer models — the models responsible for all of AI’s remarkable recent gains — consist largely of blocks that are repeated multiple times. Each block in turn has two main components: an attention mechanism, which determines the relevance of previously seen tokens to the token currently being processed, and a feed-forward network, a conventional neural network that does additional processing on the outputs of the attention mechanism.

A new hybrid optimization approach allows edge devices to fine-tune vision-language models using only forward passes, achieving up to 7% higher accuracy than existing techniques.

The attention mechanism involves three different matrices, which take their names from database design: the query matrix represents how relevant the current token is to the other tokens in the input sequence; the key matrix represents how relevant other tokens are to one another; and the value matrix represents the raw content of those other tokens. Multiplying the three matrices together creates, essentially, a recipe for the Transformer's next output.

To reduce computational complexity, these multiplications take place in a space with reduced dimensions. The matrices themselves and the results of their multiplication then have to be projected back up to the original dimensions of the input.

LoRA approximates weight updates using a product of two smaller matrices, drastically reducing the number of trainable parameters. The technique is typically applied to attention projection layers and feed-forward network layers. These modules are ideal candidates because they constitute the bulk of Transformer parameters, directly govern representation learning, and exhibit natural alignment with low-rank approximations. Empirical evidence shows weight changes in these layers often lie within a low-dimensional subspace during fine tuning.

LoRA.16x9.png — LoRA for a generic layer-weight matrix (W). The weights are modified by the product of two smaller matrices (A and B), whose lower dimensions drastically reduce the number of trainable parameters.

Target module selection

Selecting the right target modules directly affects accuracy, latency, and computational efficiency. The optimal choice of target modules is primarily a function of (a) the base model being fine-tuned (i.e., its architecture, pre- and post-training data distributions, etc.) and (b) customization domain/modality.

When fine-tuning Nova 2.0 Lite, we balanced two competing objectives:

Maximizing accuracy across diverse tasks and modalities and
Minimizing latency to preserve LoRA's efficiency benefits.

We investigated the application of LoRA to four different modules in each Transformer block: the query, key, and value projection layers ( qkv); the o_proj layer; and two different fully connected layers in the feed-forward network, gate_up_proj and gate_down_proj (referred to as fc1 and fc2). Below are the trade-offs for these modules, both singly and in combination, based on results published in literature and empirical studies.

Combination	Expected accuracy	Expected latency	Use case
*qkv* only	Good (baseline)	Lowest	Resource-constrained environments Tasks where attention mechanisms are critical (e.g., classification, lightweight generation) Prioritizes speed over maximum accuracy
*o_proj* only	Moderate	Lowest	Ultralow-latency scenarios Tasks where refining attention outputs is sufficient (e.g., simple sentiment analysis). Plays an important role in reasoning Less effective than qkv, but very efficient
*qkv* + o_proj	High	Low to moderate (+5–10%)	Attention-focused tasks (e.g., machine translation, summarization) Balances refinement of both attention context ( o_proj) and query/key/value projections ( qkv) Best accuracy-to-latency ratio for most NLP tasks
*qkv* + fc1 / *fc2*	Very high (close to full fine tuning)	Moderate (+10–15%)	Complex generation tasks (e.g., translation, long-form summarization) When feed-forward layers ( fc1/ fc2) significantly influence output quality as they store and retrieve factual knowledge Prioritizes accuracy over speed
*o_proj* + *fc1* / *fc2*	Good to high	Moderate (+5–10%)	Tasks requiring adaptation of both attention output ( o_proj) and feed-forward layers (e.g., text classification, sentiment analysis) Suitable when qkv adaptation is unnecessary
*qkv* + o_proj + fc1 / *fc2*	Highest (near-full fine tuning)	High (+15–20%)	Maximum accuracy for critical tasks (e.g., research benchmarks, high-stakes generation) When all components of the Transformer block need adaptation Avoid for production if latency matters
All modules ( qkv, o_proj, fc1, fc2)	Maximum	Highest (+20–25%)	Prototyping/research with no latency constraints Rarely justified in practice; marginal gains over qkv + o_proj + fc1/ fc2

Trade-offs of accuracy and latency across target modules, based on literature review and empirical evidence.

Experimental methodology

We conducted a comprehensive ablation study, training multiple supervised-fine-tuning (SFT) LoRA variants on seven datasets spanning both text and visual data, across reasoning (i.e., the training datasets themselves include reasoning content) and non-reasoning tasks. The datasets covered diverse challenges from simple question answering to long-context summarization and structured JSON extraction.

Dataset	Modality	Reasoning traces	Domain	Tasks	Training size	Eval size		Eval metric	Source
FinCOT	Txt	Yes	Finance	Financial-reasoning dataset. Samples consist of complex financial queries, along with reasoning traces obtained from GPT-4o. Predictions are typically complex tables or calculations based on the input.	7436		1147	Accuracy	https://huggingface.co/datasets/TheFinAI/FinCoT
GovReport	Txt	No	Goverment Doc	Large-context (30-40K tokens) summarization	17457		837	RougeLsum	https://gov-report-data.github.io/
MedMCQA	Txt	No	Medical	Dataset for multiple-choice QA — also used in Nova 1.0	20k		3683	Accuracy	https://huggingface.co/datasets/openlifescienceai/medmcqa
MedReason	Txt	Yes	Medical	Medical-reasoning dataset that consists of questions and answers compiled from various medical benchmarks (MedQA, MedMCQA, etc.), along with synthetic, high-quality reasoning traces. (This uses the same eval set as MedMCQA.)	31682		3683	Accuracy	https://huggingface.co/datasets/UCSC-VLAA/MedReason
CoCoHD	Txt	No	Political Doc	A complex benchmark consisting of large-context (>20K tokens) transcripts of congressional hearings. The output is expected to be a summary in a specific JSON format, consisting of the members present, topic discussed, outcomes, etc.	732		1053	Averaged key and value match rate	https://github.com/gtfintechlab/CoCoHD
Llava-COT	Image	Yes	Image understanding, General/Science	Multimodal, image benchmark consisting of Q&A reasoning questions. The dataset includes high-quality reasoning traces.	10k		270	Exact match rate	https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k
Invoice OCR	Image	No	Image understanding	OCR benchmark that takes an input image and produces a JSON file with fields from the image.	1400		447	Accuracy

Summary of the experiment datasets

All experiments used the Nova 2.0 Lite general-availability checkpoint with consistent hyperparameters across target modules, including learning-rate ratio and alpha values.

Target dataset	Setting	SFT LoRA target performance	Nova 2.0 Lite performance
Fin-COT	qkv	67.09%	72.12%
	o_proj	68.30%
	fc1	75.35%
	fc2	60.24%
	o_proj + fc1	61.38%
	qkv + fc2	60.31%
	o_proj + fc2	62.79%
	qkv + fc1	68.37%
	All target modules	66.15%
CoCoHD	qkv	19.64%	45.14%
	o_proj	65.88%
	fc1	41.96%
	fc2	17.62%
	o_proj + fc1	76.83%
	qkv + fc2	66.47%
	o_proj + fc2	79.14%
	qkv + fc1	45.45%
	All target modules	82.75%
GovReport	o_proj	41.25%	38.90%
	fc1	39.69%
	o_proj + fc1	41.74%
	o_proj + fc2	42.16%
	qkv + fc1	41.66%
	qkv + fc2	39.02%
	All target modules	41.95%
Llava-COT	qkv	64.26%	16.22%
	o_proj	64.26%
	fc1	65.92%
	fc2	65.02%
	o_proj + fc1	63.21%
	qkv + fc2	62.76%
	o_proj + fc2	66.37%
	qkv + fc1	66.52%
	All target modules	63.96%
Invoice OCR	o_proj	89.07%	14.10%
	o_proj + fc1	90.03%
	qkv + fc2	87.84%
	o_proj + fc2	89.47%
	qkv + fc1	88.55%
	All target modules	90.11%
MedReason	o_proj	24.55%	1.68%
	o_proj + fc1	20.88%
	qkv + fc2	8.39%
	o_proj + fc2	20.36%
	qkv + fc1	4.32%
	All target modules	26.72%
MedMCQA	qkv	62.18%	1.68%
	o_proj	63.10%
	fc1	12.90%
	fc2	59.98%
	o_proj + fc1	61.39%
	qkv + fc2	65.63%
	o_proj + fc2	64.95%
	qkv + fc1	57.21%
	All target modules	66.11%

Ablation study for target module selection. Some benchmarks have fewer variations, to save on computation and time. MedMCQA and MedReason use the MedMCQA test set for evaluation. On this task, Nova 2.0 Lite fails mainly due to formatting inconsistencies, even though it produces the right answer. For consistency’s sake, we use the same strict parser for SFT models.

Key findings

1. O_proj is the most robust single target

The o_proj-only configuration demonstrated remarkable consistency, never failing outright on any task and typically performing within a few percentage points of the best configuration (i.e., using all target modules). On MedMCQA, CoCoHD, GovReport, LLaVA-CoT, and Invoice OCR, o_proj-only either matched or came very close to optimal performance, making it an attractive default choice that balances performance and simplicity. There is emerging evidence that this module plays a key role in reasoning, which may explain its effectiveness here.

2. Qkv-only shows instability

A new philosophy for developing LLM architectures reduces energy requirements, speeds up runtime, and preserves pretrained-model performance.

While qkv-only performed well on MedMCQA, it exhibited extreme variability, performing below baseline on CoCoHD and showing unremarkable results elsewhere. This aligns with the hypothesis that attention-only LoRA can underfit on tasks requiring richer features from the feed-forward network, rather than relying on modified token routing.

3. Module combinations provide modest gains

Combinations like o_proj + fc2 or "all target modules" often achieved the highest per-dataset scores (particularly on CoCoHD, MedReason, and Invoice OCR). However, improvements over the best single module were typically modest, usually 1-3 percentage points.

4. Task difficulty amplifies configuration impact

On challenging benchmarks where the base model performed poorly, the choice of target modules had greater impact. For example, on CoCoHD (long-context, complex JSON generation), o_proj + fc2 achieved a +15% absolute improvement over the base model, compared to only +3% with o_proj alone.

5. LoRA consistently outperforms base models

Across nearly all datasets, any reasonable LoRA configuration dramatically outperformed the base model. For instance, MedReason, MedMCQA, LLaVA-CoT, and Invoice OCR showed improvements from a baseline accuracy of ~1-16% to 60-90%+ with LoRA. The notable exception was Fin-COT, where only certain configurations (notably fc1) exceeded baseline performance, suggesting task-specific sensitivity to adaptation strategy.

Recommendations

For accuracy-prioritized scenarios, we recommend o_proj + fc2 as the optimal configuration for both text and multimodal tasks, showing 2-12% improvements over o_proj alone across benchmarks.

Language models inspired by specialized processing regions in the brain offer significant time and cost savings.

For balanced efficiency and performance, o_proj-only provides an excellent default, offering robust performance with minimal latency overhead — particularly valuable when serving multiple adapters or operating under resource constraints.

For challenging tasks, such as benchmarks with long context or complex generation requirements or other tasks where base models struggle, the additional accuracy from o_proj + fc2 justifies the modest latency increase.

Future directions

Our research opens several promising avenues for further optimization:

Modality and task-specific configurations: Segmenting target module selection by modality and task difficulty (e.g., long-context scenarios) could yield specialized configurations with better accuracy-latency trade-offs.
Per-module hyperparameter optimization: Extensive hyperparameter optimization for each target module configuration could unlock additional performance gains, though computational costs remain a consideration.
Two-stage LoRA for early candidate identification: Leveraging two-stage LoRA approaches that use training dynamics, gradients, etc., to determine the importance of different modules/layers could help identify promising configurations early in training, reducing the cost of comprehensive hyperparameter searches.
Layer pruning for latency reduction: Using two-stage training to identify and prune unused layers could further reduce inference latency while maintaining accuracy.

Conclusion

Our comprehensive study demonstrates that thoughtful target module selection in LoRA fine tuning can improve accuracy while preserving the efficiency advantages that make LoRA attractive for production deployments. The o_proj layer emerges as a remarkably robust single target, while o_proj + fc2 combinations offer the best accuracy for challenging tasks. On average, o_proj LoRA is within 2% of o_proj + fc2 in terms of accuracy but has 22.6% lower latency (TPOT p95 decreases from 10.085ms → 7.803ms). These findings provide a principled foundation for standardizing LoRA configurations across diverse customer use cases, balancing the competing demands of model performance and computational efficiency.

Acknowledgements: Kevin Rondinone, Kevin Chen, Nicole Ding, Sebastian Massella, Andy Li

About the Author

Rushil Anirudh

Rushil Anirudh is an applied scientist with Amazon's Artificial General Intelligence (AGI) organization.

Anjie Fang

Anjie Fang is a senior applied scientist with Amazon's Artificial General Intelligence (AGI) organization.

Bhoomit Vasani

Bhoomit Vasani is a senior machine learning engineer with Amazon's Artificial General Intelligence (AGI) organization.