A protocol for auditing AI agent harnesses

Dev.to / 5/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

A Tsinghua study on Natural-Language Agent Harnesses shows that adding a same-model verifier and multi-candidate sampling can significantly regress agent task success on OSWorld, with both approaches failing for a shared structural reason.
The core issue is “verifier endorsement”: when the doer model is confidently wrong, a verifier trained/shaped by the same priors and failure modes tends to approve incorrect outputs rather than detect them.
The article argues that the broader ablation patterns can be explained by a single rule: harness modules that introduce new signals improve performance, while modules that recycle the doer’s signals degrade it.
Two additional papers (Fudan’s edit-level audit and Stanford’s Meta-Harness) propose ways to operationalize this rule—by verifying predicted fixes/regressions at edit time and by using raw failure traces rather than compressed summaries.
The proposed three-layer protocol composes these ideas in dependency order: first address trace utility, then perform module-level ablation, and finally run manifest-based verification on each subsequent edit.

I have been building coding agents for the last several months and watching every component I added fail to move the resolve-rate I cared about. A verifier first. Multi-candidate sampling next. A structured-output sub-agent after that. Each was justified by a specific observed failure mode and each looked cheap at the margin. None of them helped. The Tsinghua paper Natural-Language Agent Harnesses, run on SWE-bench Verified at GPT-5.4 high reasoning, explains the loss directly: a same-model verifier on top of a baseline coding agent regresses task success on OSWorld by 8.4 percentage points, and multi-candidate sampling regresses it by 5.6 the same way. Both lose for the same structural reason. The verifier and the proposer are the same model as the doer. They share its training distribution, its priors, its failure modes. When the doer is confidently wrong, the verifier endorses the wrong output with the same confidence. The check does not catch errors. It approves them.

The pattern generalises. Three papers from late March 2026 explain the failure mode, the rule that follows from it, and the audit you can run on a harness. Read in the right order they form a three-layer protocol that catches the verifier failure mode first, then predicts the rest of the ablation table without reference to the numbers.

tl;dr

Tsinghua's NLAH ablation, controlled at the module level: verifiers regress accuracy by 8.4pp on OSWorld; multi-candidate search by 5.6pp. Both lose for the same structural reason: they recycle the doer's blind spots.
The whole table follows from one rule. Harness modules that introduce a new signal win; modules that recycle the doer's signal lose. The rule predicts every row without reference to the numbers.
Fudan's AHE turns ablation into an edit-level audit. Each edit ships a manifest of predicted fixes and predicted regressions; the next iteration verifies it against task-level deltas; misses revert in git. Fix-precision is 33.7% (5x random). Regression-precision is 11.8% (2x random), and that asymmetry is the methodology's open problem.
Stanford's Meta-Harness is upstream of both. The proposer fed raw failure traces hits 50.0% search-set accuracy; fed LLM summaries it hits 34.9%, statistically the same as scores only. Trace compression destroys roughly 15pp of optimisation signal.
The protocol composes them by dependency: L0 trace utility first, then L1 module ablation, then L2 manifest verification on every subsequent edit. Get L0 wrong and L1/L2 collapse on degraded ground truth.

Full article here >