Many of the world’s most important systems — the ones that move money, book flights, issue licenses, and process claims — are slow, brittle, and deeply outdated. Built decades ago and extended repeatedly, they now sit at the center of workflows too vital to pause, take offline, rebuild, or replace.
Inside Amazon’s Artificial General Intelligence (AGI) Lab, teams train agents not on idealized interfaces but on high-fidelity simulations of such legacy systems. Learning the real behaviors of these systems — the quirks, delays, error states, and invisible dependencies — makes possible a different kind of innovation, one that grows from the systems we have instead of requiring their replacement. And by managing the idiosyncrasies of legacy systems behind the scenes, the agent effectively becomes a universal API — a single interface that the customer can use to perform a wide range of special-purpose tasks.
The legacy systems that power everyday life
Step behind the scenes of any large institution — a bank, an insurer, a hospital, a state agency — and you’ll find the same thing: an invisible layer of human labor holding software together. People know which buttons must be clicked in which order, which warnings can be ignored, which fields must be entered twice, and which screens must never be refreshed. The institutional knowledge required to navigate these eccentricities is passed down like the oral traditions of legacy systems.
Much of the infrastructure beneath these workflows is older than the people managing it. The software backbone of modern finance, insurance, travel, scientific research, and public services took shape in the 1960s and ’70s, built on mainframe architectures and written in languages like COBOL and FORTRAN — designed for stability rather than adaptability.
When the web arrived, institutions didn’t rebuild. They wrapped. Web forms fed mainframe jobs, middleware translated modern inputs into decades-old formats, and enterprise portals accumulated atop business rules that were never rewritten. Over time, modernization settled into layers: a mainframe instruction set at the bottom; a 1990s database above it; a 2000s portal above that; and a modern web interface masking everything beneath. A single transaction today might pass through all these layers — scripts, connectors, and integrations holding them together in ways no one fully understands.
Attempts to replace these systems routinely stall. Dependencies surface no one knew existed, migrations fail, budgets spiral, and public-sector modernization efforts collapse under their own complexity. These systems cannot be taken offline, which means institutions must keep operating them no matter how brittle they become. For Amazon, this is one of the most compelling applications of agentic AI — navigating not the polished surfaces of web-era consumer apps but the deep, slow-moving architectures that keep institutions running.
Learning the bad to heal the bad
The hardest part of training an AI agent is not teaching it what a successful workflow looks like; it’s teaching it why workflows fail. The logic behind legacy systems reveals itself most clearly through friction: the modal (mandatory) window that appears late because it encodes a sequencing rule; the field that refuses input until another value is saved; the form that resets because a backend job restarted midflow. These behaviors aren’t glitches. They are the real semantics of the system.
Researchers at Amazon’s AGI Labs seek this friction out. To surface failure modes safely and repeatedly, Amazon trains agents inside reinforcement learning (RL) gyms — synthetic environments designed to reproduce the quirks, delays, and ordering rules embedded in real workflows. These include synthetic web environments that simulate systems like state agencies, airline bookings, and specialized tax- and benefits-processing, among hundreds of others.
Jason Laster, an AGI software engineer who works on agent training and replay systems, puts it plainly: “I want to push our RL training gyms to have all of the warts, all of the issues.”
This is what “learning the bad to heal the bad” means: training an agent on the full spectrum of a system’s true behavior, including flaws, inconsistencies, delays, and all the embedded histories humans have quietly adapted to. By exposing agents to the same brokenness people navigate every day, Amazon trains them to move beyond surface correctness and understand the deeper logic beneath the interface.
Agents as a new interface layer
Once an agent can reliably navigate the idiosyncrasies of legacy interfaces, something more interesting begins to happen. Researchers have observed agents inferring not just what to click next but why — the latent workflow the interface expresses. In one simulated benefits application environment, an agent that realized it had added only one dependent was able to navigate back, correct the omission, and resume the flow without restarting — an early sign of understanding the nature of the system.
For lab members, this marks an architectural turning point. Many institutional systems simply don’t expose APIs that reflect how real workflows behave; the only faithful expression of the logic is the interface itself. As Laster puts it, “the UI was designed to be discoverable, learnable — even if it’s bad.” When agents learn that layer deeply enough to predict outcomes and recover from failures, they begin to function as a kind of synthetic API — a stable, programmatic surface over infrastructure that can’t be changed. That shift enables new architectural possibilities:
- Stable semantics over unstable UIs: Agents turn inconsistent behaviors — delays, re-renders, partial saves — into predictable patterns.
- Cross-system abstraction: Because the agent reasons about the workflow rather than the application, it can bridge systems never designed to interoperate.
- Incremental modernization: Institutions can update components gradually without breaking workflows; the agent absorbs transitional fragility.
- Preservation of institutional logic: Agents retain operational knowledge otherwise stored only in human memory — rules, sequences, dependencies no one has documented.
This is not workflow automation. It is a new interface layer for the world’s oldest systems — an upgrade path that doesn’t require turning anything off.
The work ahead
Agentic AI will not replace the administrative tasks that structure daily life — booking vacations, renewing licenses, scheduling medical appointments — but it can help make them more efficient by allowing the evolution of systems once too fragile to change.
That fragility is becoming more acute. The programmers who built the institutional backbone of the 1960s and ’70s — COBOL batch jobs, FORTRAN routines, mainframe integrations — are retiring. Few younger developers learn these languages, and the knowledge embedded in those systems grows harder to access each year. Critical workflows now run atop software whose inner workings fewer and fewer people understand.
Agents offer a different form of continuity. By learning how these systems behave — not from lost documentation but from the systems themselves — they can preserve operational logic that would otherwise disappear. They can stabilize workflows sitting atop code no one can safely rewrite and carry forward institutional knowledge that would otherwise age out of the workforce.
In that sense, “the work ahead” is twofold. There is the technical work of building agents that can meet the reliability these environments demand. And there is the human work that becomes newly possible when people are no longer trapped inside brittle interfaces — work grounded in judgment, coordination, empathy, and design rather than memorizing which field must be entered twice.
Agents will not rebuild the foundations of our digital world. But they may rebuild something else: our notion that innovation comes only from replacement. By turning brittle systems into stable platforms, agents offer a new model of progress — one that grows from what already works.