Today, the key challenge in AI isn’t only how to build better models; it’s how to build evaluation systems that can keep up. Search-augmented AI systems can now produce deep research reports — long, polished syntheses of many sources that increasingly resemble expert analysis. But those reports are useful only if their claims are supported by the underlying literature.
Most existing fact-checking tools work best when a claim can be matched to a short quote or a single document. But in AI-generated research reports, a single sentence may combine evidence from several sources. It can depend on the surrounding report for context, and it might compare assertions in a way that no single source does on its own.
When Amazon’s Artificial General Intelligence (AGI) group started working on the problem of evaluating AI-generated research reports, we thought that the main technical challenge would be building a stronger AI fact checker. But before you can evaluate an AI fact checker, you need a benchmark, a standardized test set used to measure performance. And in this setting, building the benchmark turned out to be at least as hard as building the model.
Traditionally, we view the ground truth for a problem as a fixed dataset. But we discovered that to evaluate complex AI properly, ground truth has to become a process. We call that process audit-then-score, and we present it, together with two accompanying datasets, in a paper we recently published to arXiv.
When static datasets break down
In the standard method for measuring AI performance, human experts label examples, those labels become the “ground truth” (the undisputed correct answers), and models are scored against them. To test this approach with AI-generated research reports, we recruited PhD-level specialists from fields such as computer science, control theory, education, public health, and environmental engineering. We asked them to verify claims from reports in their own specialties, mixing in a hidden set of claims whose answers we already knew.
The result was sobering. In a controlled study, unassisted experts achieved only 60.8% accuracy on the hidden set of known answers.
The issue was not a lack of expertise. It was that assessing deep-research factuality is an unusually demanding task. Verifying a single claim can require long-context reading, cross-document synthesis, and sustained attention.
Normally, in machine learning, when a model disagrees with a benchmark, we assume the model made a mistake. But we realized that, in cognitively demanding tasks like deep research, disagreement should not automatically be treated as a model failure. Sometimes, a model’s “error” is actually a signal that the benchmark itself is ambiguous, incomplete, or wrong.
Audit, then score
Instead of treating the initial expert labels as unquestionable ground truth, we decided to use the models to actively scrutinize the benchmark. This is the core idea behind the audit-then-score protocol. Our paper introduces the protocol alongside DeepFact-Bench, a shared test set for comparing systems, and DeepFact-Eval, a system that checks whether literature supports report claims.
Here is how the protocol works: When our AI fact checker disagrees with the current benchmark answer, it is not simply penalized. Instead, it acts as a challenger and must submit concrete evidence and a written rationale for why it thinks the original human answer is wrong. An auditor — which can be a human expert — then steps in. Crucially, auditors do not start from scratch; they compare the challenger’s new evidence directly against the benchmark’s original rationale. If the challenger makes the stronger case, we revise the benchmark before we score the model.
DeepFact-Eval reads the full report context, plans searches to cover the relevant literature, summarizes retrieved documents, and asks follow-up questions when key details are missing. It then produces both a verdict and a written explanation. This fundamentally changes what a benchmark is.
A new role for human expertise
One of the most striking things we found is that the same experts who were unreliable as one-shot labelers became far more reliable when placed in the role of auditor. Across four rounds of audit-then-score, accuracy on our hidden test set rose from 60.8% to 90.9%. When experts start from a blank page, they have to find the evidence, interpret it, and make a judgment on their own; when they audit a disputed claim, they can focus on comparing two concrete cases.
This shift had significant impact. On DeepFact-Bench, DeepFact-Eval reached 83.4% accuracy when we used GPT-4.1 as the underlying model. That was higher than the 58.5% of the best traditional fact-checking system we tested and the 69.1% of a strong prior deep-research system.
Evaluation as an evolving infrastructure
This shift has implications beyond one paper or one task. If AI systems continue improving, to the point that they exhibit humanlike expertise, the community will increasingly run into settings where evaluation based on one-time human answers is not enough. In those settings, sustaining benchmark quality may require auditing, revision, calibration, and periodic revalidation. Evaluation will become an ongoing collaboration among humans, models, and the evidence they surface together.