Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

arXiv cs.CL / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Agentic Harness Engineering (AHE) to automate the evolution of coding-agent “harnesses,” which strongly influence how models run tasks against repositories and tools.
  • AHE adds matched observability to three stages—component editing, trajectory inspection, and decision making—by making the action space explicit (component observability), building a drill-down evidence corpus from long trajectories (experience observability), and linking each edit to a prediction later validated by task outcomes (decision observability).
  • By turning each harness edit into a falsifiable contract, AHE aims to avoid naive trial-and-error during harness optimization.
  • Experiments show that after ten AHE iterations, pass@1 on Terminal-Bench 2 improves from 69.7% to 77.0%, beating a human-designed harness (Codex-CLI) and strong self-evolving baselines.
  • The evolved (then frozen) harness transfers to other settings, improving token efficiency on SWE-bench-verified and delivering cross-family gains on Terminal-Bench 2, suggesting the learned components generalize beyond specific benchmarks.

Abstract

Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.