Conversational AI

Customizing multiturn AI agents with reinforcement learning

Leveraging existing environment simulators and reward functions based on verifiable ground truth boosts task success rate, even with small models and small training datasets.

By Shreyas Subramanian, Panpan Xu, Yawei Wang

January 13, 2026

7 min read

In today's rapidly evolving AI landscape, organizations increasingly need AI agents that excel in specific domains and business environments. While general-purpose AI systems demonstrate impressive capabilities across broad tasks, they often fall short when deployed in specialized contexts that require deep understanding of particular workflows, tools, and organizational needs.

In recent work, scientists with Amazon Web Services’ AI Labs have been investigating how to efficiently adapt general-purpose agents to specific domains without requiring extensive expertise in machine learning or prohibitive computational resources. Through systematic experimentation across two distinct use cases — personal-assistant agents and agentic retrieval-augmented generation (RAG) — we've demonstrated that reinforcement-learning-based customization can significantly boost task success rates across diverse use cases, even with relatively small amounts of training data.

Experimental framework and assumptions

Consider a customer service agent that needs to navigate complex internal systems, understand company-specific policies, and maintain consistent brand voice across thousands of interactions. Or imagine a coding assistant that must adapt to a particular organization's coding standards, architectural patterns, and development workflows. These scenarios demand more than off-the-shelf AI solutions: they require agents that can be systematically customized and optimized for their intended environments. Our work explores the use of reinforcement learning (RL) to customize such agents.

To establish a practical foundation for our experiments, we made several simplifying assumptions. We focused primarily on asynchronous multiturn agents that can autonomously complete tasks using tools, with results verifiable against ground truth. This approach reduces our dependency on simulated users while maintaining a framework applicable to many scenarios.

Additionally, we leveraged existing environment and tool simulators from public benchmark datasets and agents, allowing us to focus on the core RL methodology rather than building simulation infrastructure from scratch. For reward signals, we rely on verifiable feedback available directly from the environment, such as task completion rates, code execution success, or information retrieval accuracy. These constraints provide the minimal conditions needed to begin our experimentation while keeping our scenarios realistic.

Experimental design

For our experiments involving a personal-assistant agent, we used the AppWorld benchmark, which involves the completion of day-to-day activities through phone app interactions. For the agentic-RAG experiments, we implemented a DeepSearch Agent for intelligent information retrieval and synthesis, using two different datasets. For the reward functions, we relied on verifiable environment-based feedback for AppWorld and exact match and semantic accuracy for RAG tasks.

Our RL training framework has two main components: an online simulator and an online RL trainer. The online simulator takes a batch of tasks and produces a batch of rollout trajectories — sequences of interactions between the agent and its environment, often involving dozens of API calls. It also produces a reward for each trajectory by running checks against ground truth.

The online RL trainer takes the rollout trajectories and the reward from the online simulator to update the actor policy. Internally, the online RL trainer has components such as actor, critic (for proximate policy optimization, which approximates the optimal weight that any one training example should be given during policy updates), and reference model. After the actor policy is updated in the online RL trainer, the weights of the actor model are synced to the agent in the online simulator.

RL-based-training pipeline

Let’s take a closer look at the RL pipeline, using the AppWorld experiments as an example. First, the simulator does a parallel simulation of interactions between agents and the AppWorld environment based on the provided task IDs and produces a batch of rollout trajectories. We’ll consider one such trajectory, which demonstrates how an agent systematically decomposes a high-level instruction — "add date prefixes to files and move non-current year files to recycle bin" — into a sequence of 32 discrete API calls across multiple applications and reasoning steps.

Trajectory 17-19.png — Steps 17 through 19 of a 32-step sample trajectory. The full trajectory can be found on the AppWorld website.

The agent begins by authenticating with the file system using supervisor-provided credentials, then methodically explores available APIs through introspection calls. Each step involves explicit reasoning about the next action, error handling when APIs don't conform to expectations (as when the agent finds no "rename_file" function and adapts, using "move_file" instead), and maintaining state across multiple file operations.

The trajectory showcases the agent's ability to handle complex parsing of dates and times, iterate through file collections, and coordinate operations across different directory structures while maintaining data integrity. Critically, the environment provides verifiable information about whether the task execution is successful, enabling the reinforcement learning framework to learn through concrete, measurable outcomes, rather than requiring human evaluation at every step. Moreover, rewards are collected only at the last turn, and this sparse reward collection provides a significant performance advantage over similar methods.

Results and insights

The consolidated table below shows that reinforcement learning can significantly boost agent performance across diverse use cases, even when relatively small training datasets are applied to relatively small models.

Use case	Dataset	Base model	Base model performance	RL-trained performance	Metric
Personal-assistant agent	AppWorld	Qwen2.5-32B-Instruct	39.20%	72% (vs. Sonnet 3.7/4.0 ~69%)	Task goal completion
Agentic RAG	NQ	Qwen2.5-3b-Base	0.106	0.406	Exact match
Agentic RAG	Musique	Llama-3.2-3B-inst	0.04	0.1	Exact match

Here are a few of our experimental findings:

Larger base models demonstrate greater gains from RL training in absolute performance. This likely stems from their generating higher-quality rollouts during training, creating a positive feedback loop that enhances the RL process.
Applying online RL customization to increasingly capable base models may unlock performance exceeding the benchmarks established by current proprietary models, which are often several times as large or complex as the base models.
Achieving near-proprietary-model performance with small-scale RL training (72 examples in AppWorld) at 1% to 2% the cost demonstrates a fundamental shift in the economics of model customization. In some cases, online RL shows immediate effectiveness from the first training step, with rapid progression to competitive performance within 30 steps.
RL training also induces specific behavioral improvements that may be useful, such as always checking API documentation before writing code, which leads to reduced code errors. Models also maintain robust semantic understanding across prompt variations even when exact-match scores decline, indicating genuine comprehension rather than pattern matching.
In our experiments, smaller models face fundamental reasoning limitations (inability to recognize unanswerable questions or extract answers from relevant context) that RL alone cannot overcome. For constrained models, targeted distillation from more capable models may be more effective than scaling RL training.

Based on these findings, we recommend investing in online RL as a method for agent customization across assistant agents and other use cases such as coding agents. However, several critical factors emerged that warrant careful attention in deployment: data quality and format correctness proved essential at every stage of the pipeline; larger base models demonstrated disproportionate benefits from RL training; and strategic task selection — prioritizing harder problems during training — enabled more efficient learning through asymmetric transfer to simpler tasks.

Looking ahead, our research roadmap focuses on two primary directions. The first is expanding the applicability of our approach through synthetic-data generation and adaptive data filtering to improve training efficiency. The second is deepening our understanding of RL algorithms through more thorough comparisons across model families, reward signal exploration beyond outcome-based metrics, and pipeline optimizations. These investigations aim to make RL-based agent customization more accessible, efficient, and effective for organizations seeking to deploy AI agents that truly excel in their specific operational contexts.

Our latest research papers — “SALT: Step-level advantage assignment for long-horizon agents via trajectory graph” and “Reinforcement learning for self-improving agent with skill library” — demonstrate further advances in agent RL algorithms, via fine-grained advantage assignment and reward shaping for agent skill learning, further demonstrating huge potential in this area.

Acknowledgments: Lin Lee Cheong

About the Author

Shreyas Subramanian

Shreyas Subramanian is a principal data scientist with Amazon Web Services (AWS).

Panpan Xu

Panpan Xu is a principal applied scientist with Amazon Web Services (AWS).

Yawei Wang

Yawei Wang is a senior applied scientist with Amazon Web Services (AWS).

Customizing multiturn AI agents with reinforcement learning

Leveraging existing environment simulators and reward functions based on verifiable ground truth boosts task success rate, even with small models and small training datasets.

Experimental framework and assumptions

Experimental design

RL-based-training pipeline

Results and insights

Related content

Work with us