Conversational AI

Diverse reasoning traces teach LLMs to make better decisions

How to train language models to generate diverse, accurate reasoning paths using tokens that control distinct reasoning strategies.

By Sheng Jia, Xiao Wang, Shiva Kasiviswanathan

May 26, 2026

5 min read

Key takeaways

Amazon researchers introduce set-supervised fine-tuning (SSFT) and global forking policy optimization (GFPO) to train language models that generate diverse reasoning paths.
SSFT and GFPO improve single-shot accuracy on AIME 2025 and LiveCodeBench benchmarks without mode collapse.
Global forking tokens are used to elicit distinct reasoning modes, enabling the model to produce diverse, high-quality reasoning paths.
SSFT models reasoning as a set of complete solution paths, while GFPO selects the most effective reasoning mode for each input.
The combined approach of SSFT and GFPO results in gains of 5% to 7% in single-shot accuracy on standard benchmarks.

Was this answer helpful?

Large language models (LLMs) are pretrained on huge volumes of unlabeled data, but afterward, they’re typically post-trained on specific tasks such as instruction following, avoiding harmful outputs, and reasoning, or providing justifications for the outputs they generate.

Parallel reasoning — in which multiple, diverse reasoning paths are generated and compared for the same problem — is emerging as a key tool for understanding the limits of LLMs’ reasoning capability. It also underpins techniques for testing LLMs such as self-consistency, where multiple reasoning paths are aggregated to improve accuracy.

LLMs are generally optimized for reasoning through supervised fine-tuning (SFT), in which each training example is labeled with a single, human-verified reasoning trace. Given the usefulness of parallel reasoning for evaluation, the question naturally arises, Can we expand the limits of LLMs’ reasoning capacities by training them on diverse reasoning traces for each question? In a paper we presented at this year’s International Conference on Learning Representations (ICLR), we propose a method for doing just that, which avoids some previously identified pitfalls of parallel reasoning.

generate reasoning traces.png — For each question, we gather multiple reasoning traces from different models and sources, capturing diverse solution strategies that serve as supervision for parallel reasoning.

To prompt a single LLM to adopt different reasoning strategies, we introduce a set of global forking tokens (such as <think1> through <think6> in the figure below) in the post-training phase, each intended to elicit a distinct reasoning mode. These tokens enable the model to generate diverse, high-quality reasoning paths for the same problem.

naive sft.png — Under naïve SFT, different tokens fail to specialize: they achieve similar accuracy (top) and exhibit comparable reasoning effort (bottom), indicating mode collapse.

However, naïve post-training strategies such as SFT can lead to mode collapse, where different reasoning tokens produce nearly identical behaviors. To address this, we propose set-supervised fine tuning (SSFT) — a simple and principled training approach that enables models to learn multiple distinct reasoning strategies from diverse supervision. Instead of representing reasoning with a single trace, SSFT models it as a set of complete solution paths, which arrive at the same answer through different strategies.

To further teach the model which reasoning strategy to adopt in what contexts, we introduce a reinforcement learning paradigm we call global forking policy optimization. Between these two techniques, we observe gains of 5% to 7% in single-shot accuracy on standard benchmarks, indicating that improved reasoning-mode selection directly translates to better end-to-end performance.

Supervised fine tuning

In practice, multiple reasoning traces for the same question can be obtained by prompting multiple teacher models, sampling alternative reasoning paths from a single model, or aggregating solutions from heterogeneous sources.

SSFT pairs each such trace with a dedicated forking token (e.g., <think1> through <think6>), where each token indicates a different reasoning mode. During training, a bipartite matching step assigns traces to tokens for each question, encouraging the model to learn distinct behaviors rather than collapsing to a single pattern. The training objective sums the next-token prediction (NTP) losses across all matched pairs, evaluating each reasoning trace conditioned on its assigned control token.

sft and ssft.png — Standard SFT uses fixed or random matching *(left)* and is order dependent, while SSFT *(right)* uses min-cost bipartite matching to achieve order-invariant training.

As a result, each forking token is specialized to a distinct reasoning strategy, and the model produces more diverse solutions — measured by pass@k, the probability that at least one of k generated answers is correct — while maintaining strong single-shot accuracy ( pass@1).

Reinforcement learning

While supervised training encourages the model to learn diverse reasoning strategies, it does not explicitly teach the model which strategy to use for a given question. Choosing the right reasoning mode is inherently a decision problem, making it a natural fit for reinforcement learning.

We address this with global forking policy optimization (GFPO), a lightweight reinforcement learning approach that learns to select the most effective reasoning mode for each input. For a given question x, the model samples a global forking token from a distribution over control tokens (the <think i>s).

The model then produces an answer conditioned on the sampled token, and the output is verified to obtain a reward signal (e.g., correct or incorrect). These rewards are converted into advantages, which are used to update the policy over forking tokens. Importantly, the generated reasoning traces are treated as rollouts: their gradients are detached and used only for computing rewards, not for direct optimization.

By focusing optimization on the forking-token distribution, GFPO avoids the complexity of token-level reinforcement learning while still capturing the key decision — selecting the right reasoning mode upfront. This makes training both efficient and stable, while directly improving end-to-end performance.

gfpo pipeline.png — GFPO pipeline. The model samples a reasoning mode from the distribution π(*<think i>* | x), generates answers, and receives rewards from verification. Advantages derived from these rewards update only the forking-token distribution, while the generated reasoning traces are used solely for evaluation.

Together, SSFT and GFPO enable models to both learn diverse reasoning strategies and select the right one at inference time.

Evaluation

We evaluate SSFT+GFPO on both reasoning and coding benchmarks along two axes: (i) accuracy and (ii) diversity of reasoning. Across all settings, SSFT+GFPO consistently outperforms standard pipelines, such as SFT+GRPO.

58.80%	64.22%	52.07%
AIME 2025 (Pass@1)	AIME 2024 (Pass@1)	LiveCodeBench-v5 (Pass@1)
+6.84 vs. SFT+GRPO	+5.37 vs. SFT+GRPO	+4.94 vs. SFT

Beyond accuracy, a key goal of SSFT is to address mode collapse. SSFT explicitly encourages specialization, allowing different tokens to represent distinct reasoning strategies. This leads to two important effects. First, each global forking token consistently triggers a distinct reasoning pattern. Second, this diversity improves pass@k without compromising pass@1. This contrasts with temperature-based sampling, where increasing diversity typically comes at the cost of accuracy.

diverse reasoning effort.png — Different global forking tokens produce distinct reasoning behaviors, demonstrating specialization across modes.

SSFT improves pass@k while preserving pass@1 accuracy on AIME 2025, a challenging math reasoning benchmark.png — SSFT improves pass@k while preserving pass@1 accuracy on AIME 2025, a challenging math reasoning benchmark.

Below, we present a qualitative example illustrating our approach on a representative problem from the AIME 2025 benchmark, a challenging math reasoning dataset. The same question is solved using multiple qualitatively distinct strategies — such as algebraic manipulation, geometric reasoning, and case-based analysis — depending on the selected global forking token.

Multiple distinct solution strategies for the same problem, each induced by a different global forking token.png — Multiple distinct solution strategies for the same problem, each induced by a different global forking token

Code and models

Ready to try SSFT and GFPO on your own tasks? We've open-sourced the training pipeline, evaluation harnesses, and model weights on GitHub and Hugging Face.

About the Author

Sheng Jia

Sheng Jia is an applied scientist at Amazon Web Services (AWS).

Xiao Wang

Xiao Wang is an applied scientist at Amazon Web Services (AWS).

Shiva Kasiviswanathan

Shiva Prasad Kasiviswanathan is an applied scientist in Amazon’s Computer Vision-Machine Learning organization.

Diverse reasoning traces teach LLMs to make better decisions

How to train language models to generate diverse, accurate reasoning paths using tokens that control distinct reasoning strategies.

Supervised fine tuning

Reinforcement learning

Evaluation

Related content

Work with us