Machine learning

The unseen work of building reliable AI agents

"Reinforcement learning gyms" train agents on the many low-level tasks that they must chain together to execute customer requests.

January 7, 2026

6 min read

Ask an AI developer what an agent might do for you, and the answer often sounds like a travel brochure: book your flights, find you a hotel, plan your summer vacation. It's a charming image — an invisible concierge effortlessly stitching together an itinerary while you sip a coffee.

But inside Amazon, researchers know that a million small things must work before big things can happen. One example: before an AI can plan a vacation, it must learn to scroll.

Literally.

It must learn how to scroll … and click … and tab … and select a date that's hidden behind a pop-up … and recover when a form silently resets … and distinguish a calendar widget from a drop-down … and re-enter a field exactly once without overwriting another … and navigate a loyalty portal that hasn't been redesigned since 2004.

A single "book my summer vacation" command sets off hundreds of micro-interactions across travel services: airline reservation systems still running decades-old interfaces; hotel inventory tools with inconsistent use patterns; credit card verification layers; loyalty programs; payment rails; mobile confirmations; and compliance checks buried behind browser-based forms. Every tiny action has to succeed — reliably, deterministically, every time — before the magical consumer moment is possible. This is the gap between the narrative of AI agents and the reality of building one.

At Amazon, the mundane details aren't an afterthought; they're the foundation. To work successfully in the real world, an agent must first master a set of atomic behaviors. Internally, we sometimes describe this as building "normcore agents": systems trained to be exceptionally good at the very simple, very boring interactions that underpin the reliable operation of real software.

Mastering those atomic behaviors requires a lot of practice, which is why Amazon's Artificial General Intelligence (AGI) Lab is building an ecosystem of high-fidelity reinforcement learning (RL) "gyms" where agents can hone their skills. Just as an athlete builds core stability by repeating fundamental movements under controlled conditions, an agent develops reliability by practicing the smallest units of interaction in repeatable, instrumented scenarios.

Designed to reflect the messiness of real web systems, a gym isolates a skill, varies it, stresses it, and measures it. The end result is an agentic substrate — a shared foundation of competence from which a fleet of agents can build domain-specific efficiencies in real-world applications: form completions that make an address usable for a delivery or reservation; drop-down selections that indicate whether a fare, benefit, or option applies; and multistep workflows that guarantee that a transaction reaches a valid, verifiable end state.

Today, the Amazon AGI Lab has built and trained agents in gyms spanning dozens of application domains and thousands of individual tasks, with more in development. These gyms don't just teach an agent how to book a vacation; they teach it how to survive the unpredictable terrain beneath the task. How to reason about web interfaces. How to detect and recover from errors. How to interact with legacy systems that humans tolerate but machines often misinterpret. To build an agent that can do anything humans do on a computer, our team has to teach it to handle the ambiguity humans navigate instinctively.

Reliability

If an agent's path to booking a summer vacation runs through hundreds of tiny, failure-prone steps, the autonomous cars that get us to the airport face an environment that's even less forgiving. So it's no accident that some of the engineers and researchers inside Amazon's AGI Lab come from the world of self-driving cars. They spent years in environments where "almost right" is indistinguishable from "unsafe," where a system that performs flawlessly one moment and fails silently the next is unfit for deployment. In autonomous vehicles, correctness isn't probabilistic; the system must be right every single time.

That mindset now shapes how our lab approaches agentic AI. Agents don't just produce outputs; they take actions inside live systems. They touch databases, initiate transactions, and modify system states. And when the output of a model is a real change in the world, reliability becomes non-negotiable.

To meet that standard, an agent must do something language models cannot: determine whether the system responded correctly to its action. That doesn't mean the agent inherently knows correctness; it means the training environment exposes enough ground truth — document object model (DOM) structure, UI timing, network behavior, backend state transitions — for the agent to compare what it attempted with what actually happened and escalate or defer to a human when the outcome is ambiguous or requires approval.

This is where formal verifiers come in. Each task inside a gym is anchored by a specification that defines exactly what successful completion looks like. It describes the required end state, the backend changes that are allowed to produce it, and the changes that must never occur. A workflow like "send an e-mail," for example, isn't declared successful just because a button appears to have been clicked; it's declared successful because exactly one new e-mail record exists in the database, and no unrelated records have been created, modified, or deleted.

In our RL gyms, these verifiers are the basis of a scoring function. The agent receives a reward only when the environment reflects the precise changes permitted and none of the forbidden ones, providing a signal about what "right" means.

Agents must satisfy these verifiers not once but thousands of times, under shifting timing, network, and UI conditions. This repeated exposure — within precisely engineered RL gyms that isolate skills, vary conditions, and enforce verifiable outcomes — converts isolated successes into durable competence. Only when an agent meets that standard of near-perfect reliability can it be trusted to run real workflows. And only then can it operate safely in production, where every action has consequences.

Normcore workouts

Look closely at any real-world workflow and you'll find a scattering of tiny tasks that have to be executed perfectly. These are the normcore workouts inside our RL gyms: concentrated practice routines where agents learn the small things that make the big things happen. Here are a few examples:

Workout 1: The calendar stability test

Building robustness against inconsistent UI components

In calendar applications, even selecting a date requires surprising coordination. Across the web, calendars behave in subtly different ways: elements shift under zoom, and widgets hide behind other UI layers or re-render mid-click. In RL gyms, these variations appear intentionally, teaching the agent to recognize a widget's current state, recover when it drifts, and commit the correct date exactly once — then verify that the resulting backend state is correct. This foundational skill applies to workflows everywhere, from travel bookings to scheduling tools to compliance applications.

Workout 2: The dropdown discipline drill

Learning to distinguish UI appearance from system state

A dropdown menu might appear to have been updated before the backend has actually processed the change. This mismatch appears in enterprise applications, consumer portals, and government systems alike. Agents must confirm that the system — not just the UI — has registered the action. The drill builds discipline: trust the system state, not the surface.

Workout 3: The async endurance run

Maintaining coherence across long, timing-sensitive flows

Many workflows involve long chains of asynchronous steps — searching, filtering, validating, refreshing — each with different timing and failure modes. RL gyms break these flows into atomic segments: text fields that compete with autosuggest lists, modal windows that load out of order, backends that intermittently return errors, and pages that scaffold before they populate. The agent learns endurance — staying aligned with the true state of the system across dozens or hundreds of steps.

Acknowledgments: Deniz Birlikci, Gary Lim, and Annika Huston for their contributions.

About the Author

Jason Laster

Jason Laster is a member of technical staff for the Amazon AGI Lab.