AI agent performance is not just a modeling problem; it is fundamentally a systems problem. A modern agent combines an LLM with a harness, software that mediates the LLM’s interaction with tools and manages the cycle of reasoning and feedback: you can think of the harness as the operating system around the model. As models improve, the performance bottleneck shifts from the model’s ability to reason to the harness’s ability to translate model intent into actions and reflect execution outcomes back to the model.
We formalize this bottleneck as the intent-execution gap: the mismatch between what the model intends and what the harness executes, and vice versa. For example, in trying to revise code, a model may intend to edit a single instance of a function, while the harness accidentally modifies multiple instances.
We show that minimizing this bidirectional gap — without any task-specific tuning — is sufficient to achieve state-of-the-art performance across diverse agentic benchmarks, including datasets that test real-world repository patching (SWE-Pro, SWE-Verified) and interactive terminal environments (Terminal-Bench2).
While the most visible components of the harness — such as the execution graph, which controls iterations over the thought-action-observation process, and tools — are natural candidates for improvement, we highlight that seemingly trivial implementation details lead to nontrivial fluctuations in performance. Factors such as environment interaction timeouts, infrastructure stability, and resource constraints also materially affect performance. Thus, benchmaxing, or reporting higher numbers on benchmarks, may not necessarily quantify underlying model/harness capability, as it is additionally influenced by the basic infrastructure parameters used during evaluations.
We also introduce Simple Strands Agent (SSA), a lightweight and customizable single-agent harness designed to close the gap between the performance reported in agent documentation and the performance seen in open-source implementations. SSA achieves consistent gains in performance across multiple models and benchmarks.
Finally, we show that effective agent design is not entirely model agnostic. While many principles generalize, different model families exhibit distinct preferences in tool usage, feedback interpretation, and context sensitivity, making model-harness codesign a critical factor in achieving optimal performance.
Motivations
It is well established that problem-specific customizations such as tuned prompts, tailored tools, and specialized execution graphs can improve AI models’ performance in a controlled setting (fixing all other factors, such as evaluation infrastructure). However, we observed that many such optimizations fail to transfer between models. Improvements that work for one model or version often degrade, disappear, or even regress with newer models.
This lack of transferability exposes a deeper issue: many optimizations implicitly overfit the behavior of a specific model. As models improve, these behaviors change, making such gains brittle and noncompounding.
In the context of agents, this suggests a shift in focus: rather than optimizing for current model behavior, we should identify invariant components — design principles that remain effective across model upgrades, benchmarks, and environments. To identify such invariants, we focus on the model-harness interface — the boundary where model outputs are interpreted and executed and where execution outcomes are communicated back to the model. This interface is the primary locus of failure when agent performance degrades across settings. From this perspective, two fundamental questions emerge:
- Does the harness understand what the model intends to do?
- Is the model clear about how the harness interpreted its actions?
These questions define the core alignment problem between model and harness and characterize the failure modes we analyze in the following sections.
Tool-interface failures
We consider the case in which the agent’s goal is code generation. Our agent primarily uses a bash tool, which provides access to the computer terminal (for example, to execute code), and a file editor to revise code.
The bash tool is extremely powerful and can consume all the atomic operations of reading, searching, and editing. We make a simple enhancement to manage its outputs when they get too long. Naïvely truncating the output does not work well because the end of a command execution confirmation carries useful information such as job status and command success/failure. Instead, we contain the response length by condensing content in the middle and keeping only a limited number of lines at the beginning and the end.
For reasons of efficiency and better corner-case handling in editing, we use file-editing tools in addition to bash. Our file editor is based on a string-replace mechanism that replaces existing file content with new (model-provided) content to produce edits. While string-replace works well in many cases, we repeatedly observed failure modes that expose the intent-execution gap: the model may have a clear intention, but the harness may not have enough information to execute that intention safely. In these cases, a naïve editor does not merely underperform; it can actively damage the working state by applying the wrong edit with high confidence.
The first failure mode arises when the context of the model’s proposed edit appears at multiple locations in the codebase. From the model’s perspective, the requested edit may be unambiguous, because it is reasoning about a specific function, block, or error location. But if the harness receives only a raw “replace old text with new text” request, and the old text occurs several times, it cannot reliably infer which occurrence was intended.
Naïvely replacing all matches is dangerous. In practice, the safer behavior is for the harness to alert the model of the ambiguity and request clarification — for example, by asking it to expand the current context such that the text to be replaced is unique. This is a small implementation detail, but it sharply improves faithfulness between intended and executed edits.
A second failure mode appears when the model proposes only partial lines or short fragments for replacement. Partial-text matching is attractive because it is flexible, but it is also brittle: the same fragment may appear inside comments, string literals, neighboring expressions, or unrelated code paths. Even when the fragment is unique, replacing text that does not constitute a full logical unit — a complete line or well-bounded span — can produce malformed edits. These may be syntactically correct from the editor’s point of view but semantically unintended from the model’s point of view.
We found that requiring stronger text anchors — such as exact line spans, richer surrounding context, or line-aware matching — substantially reduces these accidental edits. Put differently, the harness should not execute underspecified edit requests by guessing.
Third, even when an edit is applied successfully, simply returning “edit succeeded” leaves the model underinformed about what the harness changed. This weakens the reverse side of the interaction loop: not only should the model express intent clearly, but it should also be able to verify how that intent was interpreted.
To close this loop, we found it useful, after every successful edit, to supply the model with a diff file — a text file indicating what additions and deletions had been made and what text stayed the same. A diff serves as an immediate confirmation channel: the model can inspect whether the replacement landed in the correct location, whether collateral lines changed, and whether follow-up edits are needed. This seemingly minor feedback mechanism improves reliability because it converts editing from a fire-and-forget action into an observable state transition.
A natural question arises: if the diff is provided after a successful edit, why do the first two failure modes require special handling? While the diff does expose unintended changes, it does so after the mistake has already been applied. At that point, the model must decide whether to roll back, repair the unintended edits, or continue execution with a potentially corrupted state. This introduces additional branching in the agent’s trajectory and forces it to spend tokens and reasoning effort correcting avoidable errors, rather than progressing toward the solution.
In other words, every correction step injects additional information into the model’s context window. Note that every piece of information competes for the agent’s attention for next-action generation. Unrelated or unintended edits do not just waste tokens; they actively degrade performance by introducing spurious patterns and relationships, increasing the likelihood that the model forms incorrect associations and drifts away from the original goal.
In contrast, addressing ambiguity and weak anchoring before execution ensures that edits are applied correctly in the first place. This reduces unnecessary exploration, prevents cascading errors, and keeps the context focused on task-relevant signals. In effect, the first two failure modes improve correctness at the point of action, while diff feedback improves observability after action. Both are necessary, but they operate at fundamentally different stages of the interaction loop.
Reasoning
A less obvious but equally important design consideration is how agents balance internal reasoning with external interactions. Chain-of-thought reasoning is clearly valuable. It allows the model to decompose a problem, plan next steps, and decide which tool to invoke. Without sufficient reasoning, tool usage becomes reactive, leading to shallow exploration, redundant calls, or poor sequencing of actions.
However, excessive thinking introduces its own failure mode. When the model spends too long reasoning internally, it begins to form assumptions about the environment rather than verifying them. These assumptions may appear coherent within the model’s internal state, but they are often misaligned with the actual system state. As a result, the agent may issue poorly grounded tool calls or skip necessary validation steps altogether, creating a fundamental tension.
Effective agents must continuously reconcile these two demands, and we refer to this balance as tool calling with a reasoning nudge. The idea is to encourage the model to perform just enough reasoning to decide the next action and then prioritize evidence-gathering interactions with the environment over further reasoning. Rather than extending internal chains of thought, the agent is nudged toward validating its hypotheses through tool outputs.
In practice, we did not find a single “golden prompt” that reliably balances reasoning and tool interaction across all model families. For the Claude variants, we found that introducing quantitative guidance — e.g., “make 50+ tool calls” or “ideal tool call count is 100” — helps break long reasoning chains and pushes the model toward interacting with the environment. While the exact number of target tool calls is not important, it serves as a useful north star that biases the model toward action.
However, in our experiments, this strong nudge was ineffective for other families, such as Gemini and Grok, which often interpret such instructions literally and make empty tool calls in order to meet the target. Such behavior reduces agent quality. Here, we find that using a flexible nudge like “You should use tools as much as possible” works just fine. The principle remains the same: we need to nudge the model to proactively use tools along with right amount of reasoning.
Tool use preferences
Across agents, tools function in exactly the same way, but models tend to exhibit distinct preferences in how they invoke them. For example, GPT models prefer to update code by using an apply_patch command to splice in text from a separate file, formatted in a particular way; denying them their formatting preferences hurts performance.
Similarly, for Grok-4.20, a single monolithic tool for editing and viewing creates confusion, which leads to incorrect tool calls. Splitting functionality into atomic operations yields better results — even when the functionality remains unchanged. Additionally, viewing line numbers in a file helps most models, but Grok’s tokenizer and attention mechanism appeared less robust at separating prefixes from line numbers, and disabling this feature helps the view tool. These preferences are a by-product of training.
This reinforces a broader design principle: agent performance is a function of not only what tools are available but how naturally those tools align with the model’s learned behaviors. A well-designed harness meets the model where it is, adapting interfaces, feedback, and interaction patterns to its strengths while still enforcing the invariants needed for reliable execution.
Benchmarking study
SSA is a simple harness that implements many of the principles we describe above. We evaluated it on three agentic benchmarks — SWE-Bench-Verified (n = 500), SWE-Bench-Pro (public set, n = 731) and Terminal-Bench-2 (n = 89). Each example in SWE-Bench-Verified and SWE-Bench-Pro is an open-source code repository and an “issue” to be fixed by making a code change. Terminal-Bench-2 tackles a range of programming tasks (software engineering, machine learning, security, etc.) but is not tied to a code repository.
All three benchmarks have individual, static, prewritten tests for evaluating generated code. In SWE-Bench-Verified and SWE-Bench-Pro, the runs and evaluations occur in separate container images, meaning changes must be transferred into a different evaluation environment; in Terminal-Bench-2, the evaluation happens in the same container. Therefore, in SWE problems, it may be necessary to exclude irrelevant artifacts to not overly bloat the diff patch. Additionally, Terminal-Bench-2 imposes computational and agent-runtime limits that the SWE benchmarks do not. We evaluate our SSA agents using metrics standard in the field.
Note that the mini-swe-agent results reported above in the SWE-Bench-Verified graph and the Terminus results reported in the Terminal-Bench-2 graph correspond to a fixed agent configuration per benchmark — the exact same prompts, tool specifications, and structural output instructions. As we discuss above, however, different model families require different reasoning nudges and exhibit distinct preferences for tool use. As a result, while SSA’s core harness remains identical, there are minimal but nonzero differences in prompts and tool specifications across model families (e.g., Claude, Gemini, GPT, Grok).
Our goal in building SSA was not to optimize separate agents per model but to identify minimal, orthogonal adaptations that allow different model families to express their strongest capabilities within a shared harness framework.
Terminal-Bench-2
Unlike SWE-Bench-Verified and SWE-Bench-Pro, the Terminal-Bench-2 dataset restricts the agent’s environment by limiting computational capacity (memory, storage, number of CPUs) and time (both agent and verifier run times) per project. While this is effective in limiting disproportionate use of computational resources to boost benchmark scores, it does have the unintended side effect of making the benchmark more sensitive to infrastructure choices.
We observed that, given those restrictions, the following system characteristics have the most impact:
- Reliability of the inference backend. The inference backend’s capacity (tokens per minute and requests per minute) should be able to support all concurrently run projects for the full duration of the evaluation. High variance in invoker latency, frequent API timeouts, and retries eat into the allowed time budget, leading to more timeouts and a lower resolution rate.
- The number of concurrent projects run on a single node. This affects the network bandwidth available to each project. One of the first steps for an agent in Terminal-Bench-2 is to install dependencies (popular libraries like pip, torch, transformers, etc.). If the evaluation infrastructure is set up in such a way that multiple projects are run on a single node (e.g., Harbor with n_concurrent > 1), the available network bandwidth for each node is shared across all the concurrent projects. This increases the download times for dependencies, leaving the agent with less time for problem solving and a higher risk of getting interrupted before it’s done.
Since the majority of tool calls involve command-line instructions, a natural way to address timeouts is to introduce a batch interface, allowing the agent to execute multiple commands in a single turn, rather than executing them sequentially. In our experiments, however, the results of this approach were mixed and correspond to one of the failure modes we describe above — the balance between reasoning and tool interaction.
While batching reduces interaction overhead, it also requires the model to maintain a coherent terminal state across multiple steps, which increases reasoning complexity. For Claude models, the time taken by additional autoregressive reasoning tends to offset the gains from batching. In contrast, for other model families (such as Gemini and Grok), batch execution was beneficial, as it did not trigger additional reasoning. Overall, under constrained settings, batching commands does not consistently improve performance across all models.
Given that evaluations are sensitive to such confounding factors, we next assess the upper-bound potential of the agent-model combination by relaxing time constraints. Specifically, we compare SSA’s performance on Terminal-Bench-2 under constrained settings (as shown above) and unconstrained settings, where memory and agent timeouts are removed. The unconstrained setup serves as an estimate of the achievable performance ceiling.
The gap in accuracy between the constrained and unconstrained evaluations is typically 5-10%. We note that in our experiments, out of the 89 total projects in Terminal-Bench-2, a few consistently have a high timeout rate in the constrained evaluation but a high solve rate in the unconstrained setting. Those projects are make-doom-for-mips, torch-pipeline-parallelism, gpt2-codegolf, caffe-cifar-10, and train-fasttext.
Experimental methodology
We evaluate SSA across multiple agent benchmarks under a controlled and reproducible setup. All experiments were conducted on an AWS PCS cluster using c7.48xlarge instances, with maximum concurrency set to 10 to balance throughput and system stability. For model access, Claude models were served via Amazon Bedrock (production capacity), while OpenAI, Gemini, and Grok models were accessed through their respective commercial APIs.
We enforced strict evaluation hygiene. Internet access was disabled for SWE-Bench-Verified and SWE-Bench-Pro runs, while it was enabled for Terminal-Bench 2 due to its benchmark design. For SWE-Bench-Verified and SWE-Bench-Pro, we used the standard benchmarking Docker environments, which include repository state up to the point of the current code revision. This allows agents access to the relevant history of the codebase while ensuring no access to future revisions.
Evaluation-specific issues
In SWE-Bench-Verified, instances such as astropy-8872 and astropy-8707 fail even with flawless code patches due to setup inconsistencies and require fixes in the evaluation environment. Additionally, some psf_requests instances can fail intermittently due to external test dependencies (e.g., nonresponsive URLs), requiring manual patching for reliable evaluation.
For SWE-Bench-Pro, evaluations were executed on Amazon ECS. Due to environment-specific assumptions, a small subset of tests — 3 out of 731 instances — consistently fail when run on AWS infrastructure, resulting in an approximate 0.41% ceiling loss across all SSA evaluations. Finally, to minimize information leakage during agent runs in Terminal-Bench-2, hidden tests are introduced into the Docker environment only after the agent has completed its execution, ensuring that the agent has no direct access to them during problem solving. Note that internet access in Terminal-Bench 2 does introduce a possibility of solution leakage, but a manual review of trajectories didn’t reveal any instances of the model trying to copy solutions.
Model configs
To ensure reproducibility, we used public documented configurations from release/model cards wherever available. Specifically, Claude Opus 4.6 and Claude Sonnet 4.6 were used with adaptive thinking and max effort across all benchmarks (except when Sonnet 4.6 was tested on Terminal-Bench-2 with thinking disabled). Opus 4.5 used high effort and no thinking across all benchmark runs (except in Terminal-Bench-2, where Opus 4.5 has thinking enabled with 128k budget tokens). Sonnet 4.5 was used with an interleaved-thinking budget of 200k, Haiku 4.5 with a 128k budget, and Sonnet 4.0 with a 200k budget across all runs. Both Gemini 3.0 Flash and Gemini 3.1 Pro used thinking_level high and temperature 1.0 across all runs. Every GPT model used reasoning effort xhigh for all benchmarking runs. With Grok, we used the grok-4.20 reasoning variant for all runs with default configs.
Detailed config files for every experiment are included in the SSA package.
Conclusion
We show that bridging the intent and execution gap in agent harnesses is critical to extracting state-of-the-art performance out of frontier models. Well-chosen editing tools, feedback from tool application, and management of tool-output lengths improve performance across all model families. On the other hand, models exhibit distinct preferences for different tool interfaces, and an effective harness should leverage them instead of trying to uniformly impose the same interfaces across all model families. We open-source all elements of our harness — the agent logic, tools, and prompts, as well as model configs, for easy reproducibility in the SSA package.
Acknowledgments: Luke Huan and Anoop Deoras