Bridging intent and execution in agentic systems

The harnesses that mediate between models and tools in agentic systems are becoming their own performance bottleneck, but a few simple design principles can fix what ails them.

Overview by Amazon Nova
  • Amazon researchers introduce Simple Strands Agent (SSA), a customizable single-agent harness designed to minimize the intent-execution gap, achieving consistent performance gains across multiple models and benchmarks.
  • Key design principles include improving tool interfaces, providing feedback through diff files, and balancing internal reasoning with external interactions to enhance agent performance.
  • The research highlights model-specific preferences in tool usage and the importance of adapting harnesses to align with these preferences for optimal performance.
  • All elements of the SSA harness, including agent logic, tools, prompts, and model configurations, are open-sourced for reproducibility.
Was this answer helpful?

AI agent performance is not just a modeling problem; it is fundamentally a systems problem. A modern agent combines an LLM with a harness, software that mediates the LLM’s interaction with tools and manages the cycle of reasoning and feedback: you can think of the harness as the operating system around the model. As models improve, the performance bottleneck shifts from the model’s ability to reason to the harness’s ability to translate model intent into actions and reflect execution outcomes back to the model.

We formalize this bottleneck as the intent-execution gap: the mismatch between what the model intends and what the harness executes, and vice versa. For example, in trying to revise code, a model may intend to edit a single instance of a function, while the harness accidentally modifies multiple instances.

We show that minimizing this bidirectional gap — without any task-specific tuning — is sufficient to achieve state-of-the-art performance across diverse agentic benchmarks, including datasets that test real-world repository patching (SWE-Pro, SWE-Verified) and interactive terminal environments (Terminal-Bench2).

While the most visible components of the harness — such as the execution graph, which controls iterations over the thought-action-observation process, and tools — are natural candidates for improvement, we highlight that seemingly trivial implementation details lead to nontrivial fluctuations in performance. Factors such as environment interaction timeouts, infrastructure stability, and resource constraints also materially affect performance. Thus, benchmaxing, or reporting higher numbers on benchmarks, may not necessarily quantify underlying model/harness capability, as it is additionally influenced by the basic infrastructure parameters used during evaluations.

We also introduce Simple Strands Agent (SSA), a lightweight and customizable single-agent harness designed to close the gap between the performance reported in agent documentation and the performance seen in open-source implementations. SSA achieves consistent gains in performance across multiple models and benchmarks.

Finally, we show that effective agent design is not entirely model agnostic. While many principles generalize, different model families exhibit distinct preferences in tool usage, feedback interpretation, and context sensitivity, making model-harness codesign a critical factor in achieving optimal performance.

Motivations

It is well established that problem-specific customizations such as tuned prompts, tailored tools, and specialized execution graphs can improve AI models’ performance in a controlled setting (fixing all other factors, such as evaluation infrastructure). However, we observed that many such optimizations fail to transfer between models. Improvements that work for one model or version often degrade, disappear, or even regress with newer models.

This lack of transferability exposes a deeper issue: many optimizations implicitly overfit the behavior of a specific model. As models improve, these behaviors change, making such gains brittle and noncompounding.

In the context of agents, this suggests a shift in focus: rather than optimizing for current model behavior, we should identify invariant components — design principles that remain effective across model upgrades, benchmarks, and environments. To identify such invariants, we focus on the model-harness interface — the boundary where model outputs are interpreted and executed and where execution outcomes are communicated back to the model. This interface is the primary locus of failure when agent performance degrades across settings. From this perspective, two fundamental questions emerge:

  1. Does the harness understand what the model intends to do?
  2. Is the model clear about how the harness interpreted its actions?

These questions define the core alignment problem between model and harness and characterize the failure modes we analyze in the following sections.

Tool-interface failures

We consider the case in which the agent’s goal is code generation. Our agent primarily uses a bash tool, which provides access to the computer terminal (for example, to execute code), and a file editor to revise code.

Condensed log output.jpg
Original vs. condensed bash log output.

The bash tool is extremely powerful and can consume all the atomic operations of reading, searching, and editing. We make a simple enhancement to manage its outputs when they get too long. Naïvely truncating the output does not work well because the end of a command execution confirmation carries useful information such as job status and command success/failure. Instead, we contain the response length by condensing content in the middle and keeping only a limited number of lines at the beginning and the end.

For reasons of efficiency and better corner-case handling in editing, we use file-editing tools in addition to bash. Our file editor is based on a string-replace mechanism that replaces existing file content with new (model-provided) content to produce edits. While string-replace works well in many cases, we repeatedly observed failure modes that expose the intent-execution gap: the model may have a clear intention, but the harness may not have enough information to execute that intention safely. In these cases, a naïve editor does not merely underperform; it can actively damage the working state by applying the wrong edit with high confidence.

Erroneous vs. correct search-replace edits.jpg
Overly broad search-and-replace edits (left) vs. properly scoped replacement (right).

The first failure mode arises when the context of the model’s proposed edit appears at multiple locations in the codebase. From the model’s perspective, the requested edit may be unambiguous, because it is reasoning about a specific function, block, or error location. But if the harness receives only a raw “replace old text with new text” request, and the old text occurs several times, it cannot reliably infer which occurrence was intended.

Naïvely replacing all matches is dangerous. In practice, the safer behavior is for the harness to alert the model of the ambiguity and request clarification — for example, by asking it to expand the current context such that the text to be replaced is unique. This is a small implementation detail, but it sharply improves faithfulness between intended and executed edits.

A second failure mode appears when the model proposes only partial lines or short fragments for replacement. Partial-text matching is attractive because it is flexible, but it is also brittle: the same fragment may appear inside comments, string literals, neighboring expressions, or unrelated code paths. Even when the fragment is unique, replacing text that does not constitute a full logical unit — a complete line or well-bounded span — can produce malformed edits. These may be syntactically correct from the editor’s point of view but semantically unintended from the model’s point of view.

We found that requiring stronger text anchors — such as exact line spans, richer surrounding context, or line-aware matching — substantially reduces these accidental edits. Put differently, the harness should not execute underspecified edit requests by guessing.

Erroneous vs. correct partial-line change.jpg
Overly broad search-and-replace edit (left) and an edit made by a harness that knows to avoid partial-line replacements.

Third, even when an edit is applied successfully, simply returning “edit succeeded” leaves the model underinformed about what the harness changed. This weakens the reverse side of the interaction loop: not only should the model express intent clearly, but it should also be able to verify how that intent was interpreted.

To close this loop, we found it useful, after every successful edit, to supply the model with a diff file — a text file indicating what additions and deletions had been made and what text stayed the same. A diff serves as an immediate confirmation channel: the model can inspect whether the replacement landed in the correct location, whether collateral lines changed, and whether follow-up edits are needed. This seemingly minor feedback mechanism improves reliability because it converts editing from a fire-and-forget action into an observable state transition.

Feedback with diff.png
A vanilla successful-edit notification (top right) and one accompanied by a diff file (bottom right).

A natural question arises: if the diff is provided after a successful edit, why do the first two failure modes require special handling? While the diff does expose unintended changes, it does so after the mistake has already been applied. At that point, the model must decide whether to roll back, repair the unintended edits, or continue execution with a potentially corrupted state. This introduces additional branching in the agent’s trajectory and forces it to spend tokens and reasoning effort correcting avoidable errors, rather than progressing toward the solution.

In other words, every correction step injects additional information into the model’s context window. Note that every piece of information competes for the agent’s attention for next-action generation. Unrelated or unintended edits do not just waste tokens; they actively degrade performance by introducing spurious patterns and relationships, increasing the likelihood that the model forms incorrect associations and drifts away from the original goal.

In contrast, addressing ambiguity and weak anchoring before execution ensures that edits are applied correctly in the first place. This reduces unnecessary exploration, prevents cascading errors, and keeps the context focused on task-relevant signals. In effect, the first two failure modes improve correctness at the point of action, while diff feedback improves observability after action. Both are necessary, but they operate at fundamentally different stages of the interaction loop.

Reasoning

A less obvious but equally important design consideration is how agents balance internal reasoning with external interactions. Chain-of-thought reasoning is clearly valuable. It allows the model to decompose a problem, plan next steps, and decide which tool to invoke. Without sufficient reasoning, tool usage becomes reactive, leading to shallow exploration, redundant calls, or poor sequencing of actions.

However, excessive thinking introduces its own failure mode. When the model spends too long reasoning internally, it begins to form assumptions about the environment rather than verifying them. These assumptions may appear coherent within the model’s internal state, but they are often misaligned with the actual system state. As a result, the agent may issue poorly grounded tool calls or skip necessary validation steps altogether, creating a fundamental tension.

Effective agents must continuously reconcile these two demands, and we refer to this balance as tool calling with a reasoning nudge. The idea is to encourage the model to perform just enough reasoning to decide the next action and then prioritize evidence-gathering interactions with the environment over further reasoning. Rather than extending internal chains of thought, the agent is nudged toward validating its hypotheses through tool outputs.

Reasoning nudge.jpg
An effective agent must balance the competing demands of thinking (left) and acting (right). The harness should nudge the model toward validating its hypotheses through tool outputs (center).

In practice, we did not find a single “golden prompt” that reliably balances reasoning and tool interaction across all model families. For the Claude variants, we found that introducing quantitative guidance — e.g., “make 50+ tool calls” or “ideal tool call count is 100” — helps break long reasoning chains and pushes the model toward interacting with the environment. While the exact number of target tool calls is not important, it serves as a useful north star that biases the model toward action.

However, in our experiments, this strong nudge was ineffective for other families, such as Gemini and Grok, which often interpret such instructions literally and make empty tool calls in order to meet the target. Such behavior reduces agent quality. Here, we find that using a flexible nudge like “You should use tools as much as possible” works just fine. The principle remains the same: we need to nudge the model to proactively use tools along with right amount of reasoning.

Tool use preferences

Across agents, tools function in exactly the same way, but models tend to exhibit distinct preferences in how they invoke them. For example, GPT models prefer to update code by using an apply_patch command to splice in text from a separate file, formatted in a particular way; denying them their formatting preferences hurts performance.

Similarly, for Grok-4.20, a single monolithic tool for editing and viewing creates confusion, which leads to incorrect tool calls. Splitting functionality into atomic operations yields better results — even when the functionality remains unchanged. Additionally, viewing line numbers in a file helps most models, but Grok’s tokenizer and attention mechanism appeared less robust at separating prefixes from line numbers, and disabling this feature helps the view tool. These preferences are a by-product of training.

This reinforces a broader design principle: agent performance is a function of not only what tools are available but how naturally those tools align with the model’s learned behaviors. A well-designed harness meets the model where it is, adapting interfaces, feedback, and interaction patterns to its strengths while still enforcing the invariants needed for reliable execution.

Benchmarking study

SSA is a simple harness that implements many of the principles we describe above. We evaluated it on three agentic benchmarks — SWE-Bench-Verified (n = 500), SWE-Bench-Pro (public set, n = 731) and Terminal-Bench-2 (n = 89). Each example in SWE-Bench-Verified and SWE-Bench-Pro is an open-source code repository and an “issue” to be fixed by making a code change. Terminal-Bench-2 tackles a range of programming tasks (software engineering, machine learning, security, etc.) but is not tied to a code repository.

All three benchmarks have individual, static, prewritten tests for evaluating generated code. In SWE-Bench-Verified and SWE-Bench-Pro, the runs and evaluations occur in separate container images, meaning changes must be transferred into a different evaluation environment; in Terminal-Bench-2, the evaluation happens in the same container. Therefore, in SWE problems, it may be necessary to exclude irrelevant artifacts to not overly bloat the diff patch. Additionally, Terminal-Bench-2 imposes computational and agent-runtime limits that the SWE benchmarks do not. We evaluate our SSA agents using metrics standard in the field.

SWE-Pro pass@1.png
Results on SWE-Bench-Pro. Each model is run five times on the full benchmark (731 instances). The solid bar represents the percentage of code samples that, on average, pass the benchmark tests after one round of corrections (pass@1). Whiskers are the 95% confidence intervals calculated over a total of 3,655 trials. All available official model release numbers are either within or below SSA’s confidence intervals, except for one model (GPT 5.2 Codex).
SWE-Bench Verified pass@1.png
Results on SWE-Bench-Verified. Each model is run five times per full benchmark (500 instances). The solid bar represents average pass@1 across runs, and whiskers are the 95% confidence intervals calculated over a total of 2,500 trials. All available official model release numbers are within SSA’s confidence intervals. SSA consistently outperforms mini-SWE agent, a popular open-source harness for agentic SWE tasks.
Terminal-Bench-2 pass@1.png
Results on Terminal-Bench-2. Each model is run five times per full benchmark (89 instances). The solid bar represents average pass@1 across runs, and whiskers are the 95% confidence intervals calculated over a total of 445 trials. All available official model release numbers are either within or below SSA’s confidence intervals. SSA consistently outperforms Terminus-2, the default agent in Harbor.

Note that the mini-swe-agent results reported above in the SWE-Bench-Verified graph and the Terminus results reported in the Terminal-Bench-2 graph correspond to a fixed agent configuration per benchmark — the exact same prompts, tool specifications, and structural output instructions. As we discuss above, however, different model families require different reasoning nudges and exhibit distinct preferences for tool use. As a result, while SSA’s core harness remains identical, there are minimal but nonzero differences in prompts and tool specifications across model families (e.g., Claude, Gemini, GPT, Grok).

Our goal in building SSA was not to optimize separate agents per model but to identify minimal, orthogonal adaptations that allow different model families to express their strongest capabilities within a shared harness framework.

Terminal-Bench-2

Unlike SWE-Bench-Verified and SWE-Bench-Pro, the Terminal-Bench-2 dataset restricts the agent’s environment by limiting computational capacity (memory, storage, number of CPUs) and time (both agent and verifier run times) per project. While this is effective in limiting disproportionate use of computational resources to boost benchmark scores, it does have the unintended side effect of making the benchmark more sensitive to infrastructure choices.

We observed that, given those restrictions, the following system characteristics have the most impact:

  1. Reliability of the inference backend. The inference backend’s capacity (tokens per minute and requests per minute) should be able to support all concurrently run projects for the full duration of the evaluation. High variance in invoker latency, frequent API timeouts, and retries eat into the allowed time budget, leading to more timeouts and a lower resolution rate.
  2. The number of concurrent projects run on a single node. This affects the network bandwidth available to each project. One of the first steps for an agent in Terminal-Bench-2 is to install dependencies (popular libraries like pip, torch, transformers, etc.). If the evaluation infrastructure is set up in such a way that multiple projects are run on a single node (e.g., Harbor with n_concurrent > 1), the available network bandwidth for each node is shared across all the concurrent projects. This increases the download times for dependencies, leaving the agent with less time for problem solving and a higher risk of getting interrupted before it’s done.

Since the majority of tool calls involve command-line instructions, a natural way to address timeouts is to introduce a batch interface, allowing the agent to execute multiple commands in a single turn, rather than executing them sequentially. In our experiments, however, the results of this approach were mixed and correspond to one of the failure modes we describe above — the balance between reasoning and tool interaction.

While batching reduces interaction overhead, it also requires the model to maintain a coherent terminal state across multiple steps, which increases reasoning complexity. For Claude models, the time taken by additional autoregressive reasoning tends to offset the gains from batching. In contrast, for other model families (such as Gemini and Grok), batch execution was beneficial, as it did not trigger additional reasoning. Overall, under constrained settings, batching commands does not consistently improve performance across all models.

Given that evaluations are sensitive to such confounding factors, we next assess the upper-bound potential of the agent-model combination by relaxing time constraints. Specifically, we compare SSA’s performance on Terminal-Bench-2 under constrained settings (as shown above) and unconstrained settings, where memory and agent timeouts are removed. The unconstrained setup serves as an estimate of the achievable performance ceiling.

TB2 constrained vs. unconstrained.png
Constrained vs. unconstrained evaluation of Terminal-Bench-2.

The gap in accuracy between the constrained and unconstrained evaluations is typically 5-10%. We note that in our experiments, out of the 89 total projects in Terminal-Bench-2, a few consistently have a high timeout rate in the constrained evaluation but a high solve rate in the unconstrained setting. Those projects are make-doom-for-mips, torch-pipeline-parallelism, gpt2-codegolf, caffe-cifar-10, and train-fasttext.

Experimental methodology

We evaluate SSA across multiple agent benchmarks under a controlled and reproducible setup. All experiments were conducted on an AWS PCS cluster using c7.48xlarge instances, with maximum concurrency set to 10 to balance throughput and system stability. For model access, Claude models were served via Amazon Bedrock (production capacity), while OpenAI, Gemini, and Grok models were accessed through their respective commercial APIs.

We enforced strict evaluation hygiene. Internet access was disabled for SWE-Bench-Verified and SWE-Bench-Pro runs, while it was enabled for Terminal-Bench 2 due to its benchmark design. For SWE-Bench-Verified and SWE-Bench-Pro, we used the standard benchmarking Docker environments, which include repository state up to the point of the current code revision. This allows agents access to the relevant history of the codebase while ensuring no access to future revisions.

Evaluation-specific issues

In SWE-Bench-Verified, instances such as astropy-8872 and astropy-8707 fail even with flawless code patches due to setup inconsistencies and require fixes in the evaluation environment. Additionally, some psf_requests instances can fail intermittently due to external test dependencies (e.g., nonresponsive URLs), requiring manual patching for reliable evaluation.

For SWE-Bench-Pro, evaluations were executed on Amazon ECS. Due to environment-specific assumptions, a small subset of tests — 3 out of 731 instances — consistently fail when run on AWS infrastructure, resulting in an approximate 0.41% ceiling loss across all SSA evaluations. Finally, to minimize information leakage during agent runs in Terminal-Bench-2, hidden tests are introduced into the Docker environment only after the agent has completed its execution, ensuring that the agent has no direct access to them during problem solving. Note that internet access in Terminal-Bench 2 does introduce a possibility of solution leakage, but a manual review of trajectories didn’t reveal any instances of the model trying to copy solutions.

Model configs

To ensure reproducibility, we used public documented configurations from release/model cards wherever available. Specifically, Claude Opus 4.6 and Claude Sonnet 4.6 were used with adaptive thinking and max effort across all benchmarks (except when Sonnet 4.6 was tested on Terminal-Bench-2 with thinking disabled). Opus 4.5 used high effort and no thinking across all benchmark runs (except in Terminal-Bench-2, where Opus 4.5 has thinking enabled with 128k budget tokens). Sonnet 4.5 was used with an interleaved-thinking budget of 200k, Haiku 4.5 with a 128k budget, and Sonnet 4.0 with a 200k budget across all runs. Both Gemini 3.0 Flash and Gemini 3.1 Pro used thinking_level high and temperature 1.0 across all runs. Every GPT model used reasoning effort xhigh for all benchmarking runs. With Grok, we used the grok-4.20 reasoning variant for all runs with default configs.

Detailed config files for every experiment are included in the SSA package.

Conclusion

We show that bridging the intent and execution gap in agent harnesses is critical to extracting state-of-the-art performance out of frontier models. Well-chosen editing tools, feedback from tool application, and management of tool-output lengths improve performance across all model families. On the other hand, models exhibit distinct preferences for different tool interfaces, and an effective harness should leverage them instead of trying to uniformly impose the same interfaces across all model families. We open-source all elements of our harness — the agent logic, tools, and prompts, as well as model configs, for easy reproducibility in the SSA package.

Acknowledgments: Luke Huan and Anoop Deoras

Related content

US, CA, Palo Alto
Amazon Advertising is one of Amazon's fastest growing and most profitable businesses. Amazon's advertising portfolio helps merchants, retail vendors, and brand owners succeed via native advertising, which grows incremental sales of their products sold through Amazon. The primary goals are to help shoppers discover new products they love, be the most efficient way for advertisers to meet their business objectives, and build a sustainable business that continuously innovates on behalf of customers. Our products and solutions are strategically important to enable our Retail and Marketplace businesses to drive long-term growth. We deliver billions of ad impressions and millions of clicks and break fresh ground in product and technical innovations every day! Amazon continues to develop its advertising program. Ads run in our Stores (including Consumer Stores, Books, Amazon Business, Whole Foods Market, and Fresh) and Media and Entertainment publishers (including Fire TV, Fire Tablets, Kindle, Alexa, Twitch, Prime Video, Freevee, Amazon Music, MiniTV, Audible, IMDb, and others). In addition to these first-party (1P) publishers, we also deliver ads on third-party (3P) publishers. We have a number of ad products, including Sponsored Products and Sponsored Brands, display and video products for smaller brands, including Sponsored Display and Sponsored TV. We also operate ad tech products, including Amazon Marketing Cloud (a clean-room for advertisers), Amazon Publisher Cloud (a clean-room for publishers), and Amazon DSP (an enterprise-level buying tool that brings together our ad tech for buying video, audio, and display ads). Key job responsibilities This role is focused on diving deep into Amazon Ads data, especially full funnel ads campaigns, a new AI-driven workflow provided to advertisers. Rolling out this workflow at scale is critical for Amazon in 2026.
US, NY, New York
We are seeking a Robotics/AI Motor Control Scientist to develop cutting-edge machine learning algorithms for motor control systems in robots. In this role, you will focus on creating and optimizing intelligent motor control strategies to enable robots to perform complex, whole-body tasks. Your contributions will be essential in advancing robotics by enabling fluid, reliable, and safe interactions between robots and their environments. Key job responsibilities - Develop controllers that leverage reinforcement learning, imitation learning, or other advanced AI techniques to achieve natural, robust, and adaptive motor behaviors - Collaborate with multi-disciplinary teams to integrate motor control systems with robotic hardware, ensuring alignment with real-world constraints such as actuator dynamics and energy efficiency - Use simulation and real-world testing to refine and validate control algorithms - Stay updated on advancements in robotics, AI, and control systems to apply advanced techniques to robotic motion challenges - Lead technical projects from conception through production deployment - Mentor junior scientists and engineers - Bridge research initiatives with practical engineering implementation About the team Fauna Robotics, an Amazon company, is building capable, safe, and genuinely delightful robots for everyday life. Our goal is simple: make robots people actually want to live and interact with in everyday human spaces. We believe that future won’t arrive until building for robotics becomes far more accessible. Today, too much effort is spent reinventing the fundamentals. We’re changing that by developing tightly integrated hardware and software systems that make it faster, safer, and more intuitive to create real-world robotic products. Our work spans the full stack: mechanical design, control systems, dynamic modeling, and intelligent software. The focus is not just functionality, but experience. We’re building robots that feel responsive, expressive, and genuinely useful. At Fauna, you’ll work at the frontier of this space, helping define how robots move, manipulate, and interact with people in natural environments. It’s an opportunity to solve hard problems across hardware and software with a team focused on making robotics accessible and joyful to build. If you care about making robotics real for everyone and building systems that are as delightful as they are capable, we’re interested in hearing from you. an opportunity to solve hard problems across hardware and software with a team focused on making robotics accessible and joyful to build. If you care about making robotics real for everyone and building systems that are as delightful as they are capable, we’re interested in hearing from you.
US, WA, Bellevue
Are you passionate about applying machine learning, time series forecasting, and operations research to transform the delivery of heavy and bulky items for Amazon customers? Are you excited about working with large-scale operational data and developing models that drive real business impact? If so, the Amazon Extra Large (AMXL) Science team may be the right fit for you. AMXL is Amazon's specialized business for delivering heavy and bulky items — appliances, furniture, fitness equipment, and mattresses — with a premium customer experience that includes room-of-choice delivery, at-home installations, and assembly services. In this role, you will leverage large-scale operational data to develop and deploy predictive models and optimization solutions that solve real-world logistics and fulfillment challenges, partnering closely with scientists, engineers, and business stakeholders. Key job responsibilities Apply machine learning, statistical modeling, time series analysis, and operations research techniques to build solutions for delivery routing, capacity planning, demand forecasting, workforce scheduling, and network optimization Analyze large-scale historical and real-time operational data to surface efficiency patterns, bottlenecks, and emerging trends across the AMXL network Develop, validate, and deploy models that improve cost-to-serve and customer experience Partner with cross-functional teams to implement data-driven strategies and measure impact Build scalable, automated pipelines for data ingestion, feature engineering, model training, and validation Monitor deployed model performance and communicate results through clear reporting on key operational and business metrics A day in the life You'll be part of a small, collaborative team of scientists who move fast and care deeply about the problems they solve. A typical week might involve whiteboarding a new forecasting approach with a senior scientist, partnering with engineers to push a model into production, deep-diving into operational data to understand why a metric moved, or presenting your findings to business leaders who will act on them. The work is high-visibility and high-impact. The models you build will directly influence how millions of heavy and bulky items reach customers. About the team The AMXL Science team is a worldwide group of data scientists, applied scientists, and product managers solving Amazon's most complex heavy bulky supply chain challenges. We build forecasting models, capacity planning systems, and optimization tools that directly impact millions of customer deliveries. Our culture values scientific rigor, measurable business impact, and clear communication. We start with baselines, earn complexity, and partner closely with operations to ensure our work drives real decisions. You'll tackle problems where logistics constraints demand creative, data-driven solutions — and see your models shape labor planning, routing, and customer experience at scale.
US, CA, Sunnyvale
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video subscriptions such as Apple TV+, HBO Max, Peacock, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video team member, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! Key job responsibilities As an Applied Scientist at Prime Video, you will have end-to-end ownership of the product, related research and experimentation, applying advanced machine learning techniques in computer vision (CV), Generative AI, multimedia understanding and so on. You’ll work on diverse projects that enhance Prime Video’s content localization, image/video understanding, and content personalization, driving impactful innovations for our global audience. Other responsibilities include: - Research and develop generative models for controllable synthesis across images, video, vector graphics, and multimedia - Innovate in advanced diffusion and flow-based methods (e.g., inverse flow matching, parameter efficient training, guided sampling, test-time adaptation) to improve efficiency, controllability, and scalability. - Advance visual grounding, depth and 3D estimation, segmentation, and matting for integration into pre-visualization, compositing, VFX, and post-production pipelines. - Design multimodal GenAI workflows including visual-language model tooling, structured prompt orchestration, agentic pipelines. A day in the life Prime Video is pioneering the use of Generative AI to empower the next generation of creatives. Our mission is to make world-class media creation accessible, scalable, and efficient. We are seeking an Applied Scientist to advance the state of the art in Generative AI and to deliver these innovations as production-ready systems at Amazon scale. Your work will give creators unprecedented freedom and control while driving new efficiencies across Prime Video’s global content and marketing pipelines. This is a newly formed team within Prime Video Science!
ES, M, Madrid
Are you interested in building the measurement foundation that proves whether targeted, cohort-based marketing actually changes customer behavior at Amazon scale? We are seeking an Applied Scientist to own measurement and experimentation for our Lifecycle Marketing Experimentation roadmap within the PRIMAS (Prime & Marketing Analytics and Science) team. In this role, you will design and execute rigorous experiments that measure the effectiveness of audience-based marketing campaigns across multiple channels, providing the evidence that guides marketing strategy and investment decisions. This is a high-impact role where you will build measurement frameworks from scratch, design experiments that isolate causal effects, and establish the experimental standards for lifecycle marketing across EU. You will work closely with business leaders and the senior science lead to answer critical questions: does targeting specific cohorts (Bargain hunters, Young adults) improve efficiency vs. broad campaigns? Which creative strategies drive behavior change? How should we optimize marketing spend across channels? Key job responsibilities Measurement & Experimentation Ownership: 1. Own measurement end-to-end for lifecycle marketing campaigns – design experiments (RCTs, geo-tests, audience holdouts) that measure campaign effectiveness across marketing channels 2. Build measurement frameworks and experimental best practices that work across different activation platforms and can scale to multiple campaigns 3. Establish experimental standards and tooling for lifecycle marketing, ensuring statistical rigor while balancing business constraints Causal Inference & Analysis: 1. Apply causal inference methods to measure incremental impact of marketing campaigns vs. counterfactual 2. Navigate measurement challenges across different platforms (Meta attribution, LiveRamp, clean rooms, onsite tracking) 3. Analyze experiment results and provide optimization recommendations based on statistical evidence 4. Establish guardrails and success criteria for campaign evaluation About the team The PRIMAS team, is part of a larger tech tech team called WIMSI (WW Integrated Marketing Systems and Intelligence). WIMSI core mission is to accelerate marketing technology capabilities that enable de-averaged customer experiences across the marketing funnel: awareness, consideration, and conversion.
IN, KA, Bengaluru
Alexa+ is Amazon’s next-generation, AI-powered assistant. Building on the original Alexa, it uses generative AI to deliver a more conversational, personalized, and effective experience. The Trust CX Innovations team is looking for an Applied Scientist with strong background in Generative AI space to build solutions that help in upholding customer trust for Alexa+. A Senior Applied Scientist in Trust CX innovations, you will be at the forefront of developing innovative solutions to critical challenges in AI trust and privacy. You'll lead research in trust-preserving machine learning techniques. We are working on revolutionizing the way Amazonians work and collaborate. You will help us achieve new heights of productivity through the power of advanced generative AI technologies. We are looking for a leader with strong technical experiences a passion for building scientific driven solutions in a fast-paced environment. You should have good understanding of Artificial Intelligence (AI), Natural Language Understanding (NLU), Machine Learning (ML), Dialog Management, Automatic Speech Recognition (ASR), and Audio Signal Processing where to apply them in different business cases. You will be joining a select group of people making history producing one of the most highly rated products in Amazon's history, so if you are looking for a challenging and innovative role where you can solve important problems while growing as a leader, this may be the place for you. Key job responsibilities • Lead research initiatives in generative AI, focusing on LLMs, multimodal models, and frontier AI capabilities • Develop innovative approaches for model optimization, including prompt engineering, few-shot learning, and efficient fine-tuning • Pioneer new methods for AI safety, alignment, and responsible AI development • Design and execute sophisticated experiments to evaluate model performance and behavior • Lead the development of production-ready AI solutions that scale efficiently • Collaborate with product teams to translate research innovations into practical applications • Guide engineering teams in implementing AI models and systems at scale • Author technical papers for top-tier conferences • File patents for novel AI technologies and applications A day in the life You will be working with a group of talented scientists on researching algorithm and running experiments to test scientific proposal/solutions to improve our trust-preserving experiences. This will involve collaboration with partner teams including engineering, PMs, data annotators, and other scientists to discuss data quality, policy, and model development. You work closely with partner teams across Alexa to deliver platform features that require cross-team leadership. About the team Who We Are: Trust CX Innovations is a strategic innovation team within Amazon Devices & Services that focuses on advancing AI technology while prioritizing customer trust and experience. Our team operates at the intersection of artificial intelligence, privacy engineering and customer-centric design.
IN, TS, Hyderabad
The WW DSP Analytics team is a centralized analytics organization within Amazon's Last Mile Delivery Service Partner (DSP) program. We build best-in-class solutions that enable data-driven decision making across our global DSP ecosystem. Our team partners with internal stakeholders, DSP owners, and cross-functional teams to deliver insights that drive operational excellence, business growth, and the success of small business owners in Last Mile delivery. Our work directly impacts customer experience, driver and station associate experience, DSP success, and Amazon's sustainable growth. We are seeking a passionate Data Scientist with strong machine learning and analytical skills to join our team. You will work on challenging problems in the delivery planning space, applying data science rigor to generate actionable insights that support DSP performance measurement and continuous improvement. Key job responsibilities Develop Science Solutions for DSP Performance: Design and implement data science solutions to optimize Delivery Service Partner (DSP) operations, capacity planning, and performance measurement across the global DSP network Apply Advanced Machine Learning Techniques: Leverage solid research experience in Machine Learning and statistical modeling to identify opportunities for improving DSP analytics, forecasting models, and performance measurement systems Optimize DSP Program Policies and Sentiment Risks: Analyze sentiment risks and enhance existing algorithms that support DSP program management, including scorecard metrics, capacity reliability models, and performance evaluation frameworks Analyze Business Requirements with Return on Investment (ROI) calculation: Demonstrate superior logical thinking by quickly approaching large, ambiguous problems, translating high-level DSP program requirements into mathematical models, and applying models to predict the return on investment. Build Production-Scale Analytics: Contribute to the development and deployment of scalable data models, dashboards, and automated reporting systems that enable self-service analytics for DSP stakeholders Accelerate GenAI footprint: Partner with Data Engineers to expand our GenAI tools and improve developer productivity along with raising the bar on data quality. Conduct Independent Data Analysis: Mine and analyze complex datasets across multiple domains (performance metrics, financial data, operational data) using programming and statistical analysis tools to generate actionable insights Thrive in a Collaborative Environment: Excel in a fast-paced analytics organization that encourages collaborative and creative problem-solving, measure and communicate analytical risks, constructively critique peer work, and align research focuses with DSP program strategic needs Partner Cross-Functionally: Work closely with Business Intelligence Engineers, program teams, and DSP stakeholders to define KPIs, validate analytical approaches, and ensure insights drive meaningful business outcomes
US, TX, Austin
Applied Scientists in AWS Automated Reasoning are dedicated to making AWS the best computing service in the world for customers who require advanced and rigorous solutions for automated reasoning, privacy, and sovereignty. Key job responsibilities The successful candidate will: - Solve large or significantly complex problems that require deep knowledge and understanding of your domain and scientific innovation. - Own strategic problem solving, and take the lead on the design, implementation, and delivery for solutions that have a long-term quantifiable impact. - Provide cross-organizational technical influence, increasing productivity and effectiveness by sharing your deep knowledge and experience. - Develop strategic plans to identify fundamentally new solutions for business problems. - Assist in the career development of others, actively mentoring individuals and the community on advanced technical issues. A day in the life This is a unique and rare opportunity to get in early on a fast-growing segment of AWS and help shape the technology, product and the business. You will have a chance to utilize your deep technical experience within a fast moving, start-up environment and make a large business and customer impact. About the team Diverse Experiences Amazon Automated Reasoning values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn't followed a traditional path, or includes alternative experiences, don't let it stop you from applying. Why Amazon Automated Reasoning? At Amazon, automated reasoning is central to maintaining customer trust and delivering delightful customer experiences. Our organization is responsible for creating and maintaining a high bar for automated reasoning across all of Amazon's products and services. We offer talented automated reasoning professionals the chance to accelerate their careers with opportunities to build experience in a wide variety of areas including cloud, devices, retail, entertainment, healthcare, operations, and physical stores. Inclusive Team Culture In Amazon Automated Reasoning, it's in our nature to learn and be curious. Ongoing DEI events and learning experiences inspire us to continue learning and to embrace our uniqueness. Addressing the toughest automated reasoning challenges requires that we seek out and celebrate a diversity of ideas, perspectives, and voices. Training & Career Growth We're continuously raising our performance bar as we strive to become Earth's Best Employer. That's why you'll find endless knowledge-sharing, training, and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there's nothing we can't achieve.
US, MA, Boston
Sr. Applied Scientists in AWS Automated Reasoning are dedicated to making AWS the best computing service in the world for customers who require advanced and rigorous solutions for automated reasoning, privacy, and sovereignty. Key job responsibilities The successful candidate will: - Solve large or significantly complex problems that require deep knowledge and understanding of your domain and scientific innovation. - Own strategic problem solving, and take the lead on the design, implementation, and delivery for solutions that have a long-term quantifiable impact. - Provide cross-organizational technical influence, increasing productivity and effectiveness by sharing your deep knowledge and experience. - Develop strategic plans to identify fundamentally new solutions for business problems. - Assist in the career development of others, actively mentoring individuals and the community on advanced technical issues. A day in the life This is a unique and rare opportunity to get in early on a fast-growing segment of AWS and help shape the technology, product and the business. You will have a chance to utilize your deep technical experience within a fast moving, start-up environment and make a large business and customer impact. About the team Diverse Experiences Amazon Automated Reasoning values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn't followed a traditional path, or includes alternative experiences, don't let it stop you from applying. Why Amazon Automated Reasoning? At Amazon, automated reasoning is central to maintaining customer trust and delivering delightful customer experiences. Our organization is responsible for creating and maintaining a high bar for automated reasoning across all of Amazon's products and services. We offer talented automated reasoning professionals the chance to accelerate their careers with opportunities to build experience in a wide variety of areas including cloud, devices, retail, entertainment, healthcare, operations, and physical stores. Inclusive Team Culture In Amazon Automated Reasoning, it's in our nature to learn and be curious. Ongoing DEI events and learning experiences inspire us to continue learning and to embrace our uniqueness. Addressing the toughest automated reasoning challenges requires that we seek out and celebrate a diversity of ideas, perspectives, and voices. Training & Career Growth We're continuously raising our performance bar as we strive to become Earth's Best Employer. That's why you'll find endless knowledge-sharing, training, and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there's nothing we can't achieve.
US, MA, Boston
Applied Scientists in AWS Automated Reasoning are dedicated to making AWS the best computing service in the world for customers who require advanced and rigorous solutions for automated reasoning, privacy, and sovereignty. Key job responsibilities The successful candidate will: - Solve large or significantly complex problems that require deep knowledge and understanding of your domain and scientific innovation. - Own strategic problem solving, and take the lead on the design, implementation, and delivery for solutions that have a long-term quantifiable impact. - Provide cross-organizational technical influence, increasing productivity and effectiveness by sharing your deep knowledge and experience. - Develop strategic plans to identify fundamentally new solutions for business problems. - Assist in the career development of others, actively mentoring individuals and the community on advanced technical issues. A day in the life This is a unique and rare opportunity to get in early on a fast-growing segment of AWS and help shape the technology, product and the business. You will have a chance to utilize your deep technical experience within a fast moving, start-up environment and make a large business and customer impact. About the team Diverse Experiences Amazon Automated Reasoning values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn't followed a traditional path, or includes alternative experiences, don't let it stop you from applying. Why Amazon Automated Reasoning? At Amazon, automated reasoning is central to maintaining customer trust and delivering delightful customer experiences. Our organization is responsible for creating and maintaining a high bar for automated reasoning across all of Amazon's products and services. We offer talented automated reasoning professionals the chance to accelerate their careers with opportunities to build experience in a wide variety of areas including cloud, devices, retail, entertainment, healthcare, operations, and physical stores. Inclusive Team Culture In Amazon Automated Reasoning, it's in our nature to learn and be curious. Ongoing DEI events and learning experiences inspire us to continue learning and to embrace our uniqueness. Addressing the toughest automated reasoning challenges requires that we seek out and celebrate a diversity of ideas, perspectives, and voices. Training & Career Growth We're continuously raising our performance bar as we strive to become Earth's Best Employer. That's why you'll find endless knowledge-sharing, training, and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there's nothing we can't achieve.