AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

arXiv cs.AI / 5/7/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces AsymmetryZero, a framework that turns human expert preferences into “semantic evals” by defining an explicit, stable evaluation contract for each task.
  • AsymmetryZero makes grading criteria and aggregation logic explicit, improving faithful encoding of subjective, procedural, domain-specific requirements that are hard to represent with exact-match targets or raw preference labels.
  • The framework can be executed in both Inspect (model-only evals) and Harbor (agentic evals), producing comparable scores and shared audit artifacts across settings.
  • Using Harbor with fixed task contracts, the study compares five-model “frontier” juries vs five-model “compact” juries across four frontier-class solvers and finds high but not perfect criterion-level agreement (about 75.9%–89.6%).
  • Compact juries show higher internal dissent but substantially lower judging cost and latency (cost ~4.2%–5.6% of frontier; latency ~21.7%–27.1%), while task-level outcomes are often relatively stable.

Abstract

Much of the focus in RL today is on evaluation design: building meaningful evals that serve simultaneously as benchmarks and as well-defined reward signals for post-training. Yet, many real-world tasks are governed by subjective, procedural, and domain-specific requirements that are difficult to encode as exact-match targets or open-ended preference judgments frequently used in RL pipelines today. In this work, we present AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals. AsymmetryZero represents each task as a stable evaluation contract that makes grading criteria explicit: what is being graded, how each criterion is judged, and how criterion-level decisions are aggregated into a task outcome. The same contract can be executed using Inspect for model-only evaluations, as well as the Harbor Framework for agentic evaluations, enabling comparable scores and shared audit artifacts across both settings. We argue that the central challenge in post-training today is the faithful encoding of expert requirements into the evaluation itself. To that end, we present a study using Harbor that holds task contracts fixed and compares a five-model frontier jury against a five-model compact jury across four frontier-class solvers (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro). We find that criterion-level frontier-vs-compact agreement ranges from 75.9\% to 89.6\% (strict common-subset agreement: 77.8\% to 92.1\%), while compact juries exhibit substantially higher internal dissent (3--2 split rate 28.7\%--32.4\%) than frontier juries (6.1\%--11.5\%). Verifier traces further show that compact juries reduce per-criterion judging cost to roughly 4.2\%--5.6\% of frontier and latency to roughly 21.7\%--27.1\%, even as aggregated task-level outcomes often remain comparatively stable.