SABER: Small actions, big errors — Safe-guarding mutating steps in LLM agents
2025
Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: do all actions contribute equally to failure? Analyzing execution traces on τ-Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into mutating (environment-changing) vs. non-mutating steps and formalize de-cisive deviations—earliest action-level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action re-duces the odds of success by upto 92% on Airline and upto 96% on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no ef-fect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce SABER, a model-agnostic, gradient-free, test-time safeguard that (i) adds mutation-gated verifi-cation, (ii) injects Targeted Reflection before mutating steps, and (iii) performs block-based context cleaning. SABER delivers consistent gains—e.g., Qwen3-Thinking: +28% relative on Airline, +11% on Retail, and +7% on SWE-Bench Verified; Claude: +9%/+7%. We further identify ceiling effects in τ-Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release τ-Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action-level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi-turn agents.
Research areas