SABER: Small actions, big errors — Safe-guarding mutating steps in LLM agents

Alex Cuadron Lafuente; Pengfei Yu; Yang Liu; Arpit Gupta

Publication

SABER: Small actions, big errors — Safe-guarding mutating steps in LLM agents

By Alex Cuadron Lafuente, Pengfei Yu, Yang Liu, Arpit Gupta

2025

Download Copy BibTeX GitHub

Share

Download

Copy BibTeX

GitHub

Share

Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: do all actions contribute equally to failure? Analyzing execution traces on τ-Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into mutating (environment-changing) vs. non-mutating steps and formalize de-cisive deviations—earliest action-level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action re-duces the odds of success by upto 92% on Airline and upto 96% on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no ef-fect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce SABER, a model-agnostic, gradient-free, test-time safeguard that (i) adds mutation-gated verifi-cation, (ii) injects Targeted Reflection before mutating steps, and (iii) performs block-based context cleaning. SABER delivers consistent gains—e.g., Qwen3-Thinking: +28% relative on Airline, +11% on Retail, and +7% on SWE-Bench Verified; Claude: +9%/+7%. We further identify ceiling effects in τ-Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release τ-Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action-level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi-turn agents.

SABER: Small actions, big errors — Safe-guarding mutating steps in LLM agents

Latest news

Work with us