Conversational AI

Multiagent AI for generating chain-of-thought training data

Using ensembles of agents to generate and refine interactions annotated with chains of thought improves performance on a battery of benchmarks by an average of 29%.

July 31, 2025

4 min read

Chain-of-thought reasoning, in which a large language model (LLM) is asked not only to perform multistep actions but to explain its reasons for taking the steps it does, has been shown to improve LLMs’ reasoning capability. One promising application of chain-of-thought (CoT) reasoning is ensuring that LLMs adhere to responsible-AI policies.

Using CoT to optimize an LLM for policy adherence requires high-quality training data annotated with chains of thoughts. But hiring human annotators to generate such training data is expensive and time consuming.

Inspired by current work on incorporating artificial experts into the standard LLM training pipeline, researchers in Amazon’s Artificial General Intelligence organization have begun exploring the possibility of using ensembles of AI agents to generate high-quality CoT data. We report the results of our initial experiments in a paper we presented at this year’s meeting of the Association for Computational Linguistics (ACL).

Using two different LLMs and five different datasets, we compared models fine tuned on data created through our multiagent-deliberation approach to both baseline pretrained models and models fine tuned through supervised fine tuning on conventional data.

Multiagent deliberation

Our approach divides the task of generating policy-compliant chains of thought into three stages, each of which uses LLMs: intent decomposition, deliberation, and refinement.

During intent decomposition, an LLM receives the user query and identifies explicit and implicit user intents. These, together with the query, are then passed to another LLM, which generates an initial CoT.

Deliberation is an iterative process in which multiple LLMs (agents) expand the CoT in sequential fashion, factoring in a defined set of policies. Each agent is prompted to review and correct the version of the CoT it receives — or to confirm that it’s good as is. This stage ends when an agent judges the CoT complete or when a predefined deliberation budget is exhausted.

Finally, in the refinement stage, an LLM takes the outputs of the deliberation stage and post-processes them to filter out redundant, deceptive, and policy-inconsistent thoughts.

Multiagent-deliberation framework.png — A schematic of our multiagent-deliberation framework to generate safety-embedded CoTs.

Evaluation

Following prior work, we analyze the quality of the generated CoTs by measuring three fine-grained attributes: (1) relevance, (2) coherence, and (3) completeness. Each attribute is evaluated on a scale from 1 to 5, where 1 represents the lowest quality and 5 represents the highest. As test data, we use examples from several standard CoT benchmark datasets.

Fine tuning

We use several benchmarks to measure the performance improvements provided by our generated CoT data: Beavertails (for safety), WildChat, XSTest (for overrefusal, or erroneously flagging safe generations as unsafe), MMLU (for utility), and StrongREJECT (for jailbreak robustness).

Novel graph-based, adversarial, agentic method for generating training examples helps identify — and mitigate — "overrefusal".

We used two different LLMs in our tests, the widely used open-source models Qwen and Mixtral. The base versions of these models provide one baseline, and we add another baseline by fine-tuning these models with only the prompts and responses from the original dataset — not the generated CoTs. Our method shows significant improvements over baseline, specifically on safety and jailbreak robustness, with some trade-offs on utility and overrefusal.

Below are the results of evaluation of the supervised fine-tuned (SFT) model. "Base" denotes the LLM without SFT, SFT_OG denotes the model SFT’d on the original response data without any CoTs, and SFT_DB denotes the model SFT’d on our generated CoTs and responses. (If the full table doesn't fit on your browser, try scrolling right.)

LLM: Mixtral

Eval	Dimension	Metric	Dataset	Base	SFT_OG	SFT_DB (ours)
Safety	Safe response	rate	Beavertails	76	79.57	96
WildChat				31	33.5	85.95
Overrefusal	1-Overrefuse	rate	XSTest	98.8	87.6	91.84
Utility	Answer	accuracy	MMLU	35.42	31.38	34.51
Jailbreak Robustness	Safe response	rate	StrongREJECT	51.09	67.01	94.04

LLM: Qwen

Eval	Dimension	Metric	Dataset	Base	SFT_OG	SFT_DB (ours)
Safety	Safe response	rate	Beavertails	94.14	87.95	97
WildChat	-	-	-	95.5	59.42	96.5
Overrefusal	1-Overrefuse	rate	XSTest	99.2	98	93.6
Utility	Answer	accuracy	MMLU	75.78	55.73	60.52
Jailbreak Robustness	Safe response	rate	StrongREJECT	72.84	59.48	95.39

Acknowledgements: We would like to acknowledge our coauthors and collaborators, Kai-Wei Chang, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Aram Galstyan, Richard Zemel, and Rahul Gupta, for their contributions.

About the Author

Charith Peris

Charith Peris is a senior applied scientist in Amazon's Artificial General Intelligence (AGI) organization.

Tharindu Kumarage

Tharindu Kumarage is an applied scientist with Amazon's Artificial General Intelligence organization.

Metric	LLM_ZS	AIDSAFE	delta
Relevance	4.66	4.68	0.43%
Coherence	4.93	4.96	0.61%
Completeness	4.86	4.92	1.23%
CoTs’ faithfulness (policy)	3.85	4.27	10.91%
Response faithfulness (policy)	4.85	4.91	1.24%
Response faithfulness (CoT)	4.99	5	0.20%

Multiagent AI for generating chain-of-thought training data

Using ensembles of agents to generate and refine interactions annotated with chains of thought improves performance on a battery of benchmarks by an average of 29%.

Multiagent deliberation

Evaluation

Fine tuning

Related content

Work with us