The year 2026 marks a definitive shift in the AI landscape: we have moved from models that simply know to agents that do. Foundation models (FMs) — large Transformer models pretrained with massive datasets and fine-tuned for diverse downstream tasks — have moved far beyond chatbots, coding, and other digital applications. They are now used as the cognitive engines for AI agents in the physical world, where they plan, use tools, and execute multistep tasks across complex, digitally integrated environments, from warehouses and factories to transportation systems and hospitals.
At Amazon, you can see the transition to this new era of "physical AI" in the debut of Project Eluna, an agentic AI model designed to transform how Amazon fulfillment centers operate. To be useful in a high-stakes physical environment, however, an agent needs to be more than fluent in natural language; it needs to be grounded in physical laws and operational constraints.
In particular, we must overcome the challenge of hallucination, which, in virtual environments, takes the form of fabricated information — made-up citations, factual inaccuracies, and logical fallacies, all output with high levels of certainty. In a physical system, such hallucinations can lead to violations of reality, with detrimental consequences. For example, if an agent suggests a robotic path that ignores the momentum and mass of the items being moved, its output could be potentially dangerous to people or result in damage to products or equipment.
In this article, I propose four approaches to grounding AI agents in the physical world, where "grounding" is defined as the integration of external information, including domain-specific datasets, physical principles, and numerical simulations, to contextualize a model's reasoning.
All four approaches can be used separately or in combination, depending on the specific application. Practical implementation of these approaches will not only accelerate the safe and productive use of AI agents but could allow for their further expansion into new domains.
Four pillars of grounding
Project Eluna is an agentic AI model that lives in the cloud and assists operators who manage operations within fulfillment centers via digital dashboards. It’s designed to act with a degree of autonomy, reasoning through complex operational situations and recommending actions to operation managers. It pulls in historical and real-time data — such as the states of conveyor belts or robots — to anticipate bottlenecks and keep operations running smoothly. The four approaches to grounding AI agents that I describe here grew out of my research at the University of California, San Diego, and with the Amazon Fulfillment Technology (AFT) team, and they help ensure that agents like Eluna are physically consistent and operationally reliable.
1. Physics-guided deep learning.
Traditional foundation models can learn to mimic statistical patterns in data but often fail to respect the hard constraints of the physical universe, such as the conservation of mass, energy, or momentum. In physics-guided deep learning (PGDL), we integrate first-principle physical knowledge into the foundation model in pretraining. First principles include symmetries, such as inductive biases like rotations and other transformations, and differential equations that could be used, for instance, in a robot’s motion and control. Not only does this ensure that predictions obey governing physical laws, but grounding a model in physics allows it to learn from significantly smaller datasets. If the model already "knows" the fundamental principles of dynamics, it requires less data to achieve satisfactory accuracy.
2. Uncertainty-aware reasoning.
LLMs often exhibit overconfidence in uncertain predictions, which can lead to the assertion of misinformation with high certainty. For an AI agent to be trustworthy in a mission-critical setting, it must know when it does not know. Using our framework (UQ4CT), we produce calibrated uncertainty over the space of functions that map input prompts to outputs. The framework uses an approach called mixture of experts, in which the model is divided into smaller “subnetworks”, each with specific expertise.
Our UQ4CT framework allows the model to dynamically align its confidence estimates with predictive correctness. Practically speaking, an agent grounded using calibrated uncertainty can halt or request human intervention when its internal uncertainty exceeds a safety threshold, ensuring reliability even when a model has been fine-tuned with relatively small datasets such as epidemiological forecasts or rare weather events.
UQ4CT preserves high accuracy across five benchmarks while demonstrating over 25% reduction in expected calibration error (ECE), a measure of how well a model's estimated "probabilities" match the true, observed probabilities. Even under distribution shift, UQ4CT maintains superior ECE performance with high accuracy, showcasing improved generalizability.
3. Bridging the text-to-numerical gap.
While foundation models are masters of natural language, the laws of the physical world are written in the language of mathematics and high-dimensional data, the kind used in fields like robotics, supply chain management, and finance. A trustworthy agent must translate human intent, expressed through language, into precise numerical execution without losing accuracy.
Our group developed the adapting-while-learning (AWL) framework, which relies on two key mechanisms. The first is called world-knowledge distillation, where AI agents interact with simulators of the physical world to gather a range of information about what’s physically possible. This knowledge is internalized through supervised fine tuning, effectively grounding the agents’ future outputs.
The second mechanism is dynamic tool adaptation, in which a foundation model calls a specialized numerical simulator when it recognizes that its original training is insufficient for the complexity of the current task. This approach is particularly useful in climate science or epidemiology. For instance, if scientists need to plan for vaccine distribution, their original model would call on outside datasets representing disease dissemination.
Compared to original models without AWL, those post-trained with AWL achieved 29 percent higher answer accuracy and 12 percent better usage of simulator tools, even surpassing state-of-the-art models including GPT4o and Claude-3.5 on physical-science datasets.
4. Verifier-augmented grounding.
Verifiers are software external to LLMs that can be used to ensure that the models work within the bounds of logic and reality. Our weather AI agent, Zephyrus, uses verifiers to refine the reasoning of foundation models in weather science. Zephyrus works in a “reflective” interactive loop, where the agent writes code to query outside weather datasets, observes physical results, and revises its reasoning if the output is flagged by a verifier as scientifically implausible.
Another verifier, Hilbert, is used specifically for mathematical reasoning. LLMs, in general, can already generate mathematical proofs, but they need humans to verify whether these proofs are correct. However, there exist so-called proving systems, such as Lean 4, that can offer automatic verification.
This has prompted efforts to build specialized prover LLMs that can generate proofs in formal mathematical language. So far, however, these provers solve substantially fewer problems than general-purpose LLMs operating in natural language. Hilbert bridges this gap by breaking complex mathematical problems into subgoals and using feedback from a separate formal verifier to validate them recursively. This process ensures that the agent’s outputs are provably correct. We’ve shown an impressive 422 percent performance improvement over the best publicly available prover LLM.
Looking ahead
We believe these four pillars lay a solid foundation for grounding LLMs in reality. Meanwhile, several research directions stand to deepen the connection between AI agents and the physical world. First, foundation models can be fine-tuned to interact with more complex, multifidelity numerical simulations, moving beyond function calls to agentic tools and toward an internalized sense for when and at what fidelity to invoke a simulator during reasoning.
Second, uncertainty can serve not only as a hallucination detector but also as an intrinsic reward signal, training agents to explore areas of the environment where they have low confidence, high surprise, or incomplete knowledge.
Third, physical laws and domain constraints can be embedded as formal verifiers during process planning. They can check every proposed action against conservation principles, kinematic limits, and safety envelopes before execution.
As these techniques mature, they will increasingly work in concert: an agent that couples physics-guided learning with calibrated uncertainty and formal verification will be far more robust than one relying on any single pillar alone. Ultimately, as AI agents expand into increasingly complex physical domains, faithful reasoning and effective grounding will be the guiding principles to ensure that agentic AI operates safely, reliably, and at scale across the physical world.