Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

arXiv cs.CL / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes Agentic Harness Engineering (AHE) to automate the evolution of coding-agent “harnesses,” which strongly influence how models run tasks against repositories and tools.
AHE adds matched observability to three stages—component editing, trajectory inspection, and decision making—by making the action space explicit (component observability), building a drill-down evidence corpus from long trajectories (experience observability), and linking each edit to a prediction later validated by task outcomes (decision observability).
By turning each harness edit into a falsifiable contract, AHE aims to avoid naive trial-and-error during harness optimization.
Experiments show that after ten AHE iterations, pass@1 on Terminal-Bench 2 improves from 69.7% to 77.0%, beating a human-designed harness (Codex-CLI) and strong self-evolving baselines.
The evolved (then frozen) harness transfers to other settings, improving token efficiency on SWE-bench-verified and delivering cross-family gains on Terminal-Bench 2, suggesting the learned components generalize beyond specific benchmarks.

Abstract

Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

An API testing tool built specifically for AI agent loops

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Key Points

Abstract

💡 Insights using this article

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

An API testing tool built specifically for AI agent loops

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer