AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

arXiv cs.AI / 5/7/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces AsymmetryZero, a framework that turns human expert preferences into “semantic evals” by defining an explicit, stable evaluation contract for each task.
AsymmetryZero makes grading criteria and aggregation logic explicit, improving faithful encoding of subjective, procedural, domain-specific requirements that are hard to represent with exact-match targets or raw preference labels.
The framework can be executed in both Inspect (model-only evals) and Harbor (agentic evals), producing comparable scores and shared audit artifacts across settings.
Using Harbor with fixed task contracts, the study compares five-model “frontier” juries vs five-model “compact” juries across four frontier-class solvers and finds high but not perfect criterion-level agreement (about 75.9%–89.6%).
Compact juries show higher internal dissent but substantially lower judging cost and latency (cost ~4.2%–5.6% of frontier; latency ~21.7%–27.1%), while task-level outcomes are often relatively stable.

Abstract

Much of the focus in RL today is on evaluation design: building meaningful evals that serve simultaneously as benchmarks and as well-defined reward signals for post-training. Yet, many real-world tasks are governed by subjective, procedural, and domain-specific requirements that are difficult to encode as exact-match targets or open-ended preference judgments frequently used in RL pipelines today. In this work, we present AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals. AsymmetryZero represents each task as a stable evaluation contract that makes grading criteria explicit: what is being graded, how each criterion is judged, and how criterion-level decisions are aggregated into a task outcome. The same contract can be executed using Inspect for model-only evaluations, as well as the Harbor Framework for agentic evaluations, enabling comparable scores and shared audit artifacts across both settings. We argue that the central challenge in post-training today is the faithful encoding of expert requirements into the evaluation itself. To that end, we present a study using Harbor that holds task contracts fixed and compares a five-model frontier jury against a five-model compact jury across four frontier-class solvers (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro). We find that criterion-level frontier-vs-compact agreement ranges from

75.9\%

89.6\%

(strict common-subset agreement:

77.8\%

92.1\%

), while compact juries exhibit substantially higher internal dissent (3--2 split rate

28.7\%

32.4\%

) than frontier juries (

6.1\%

11.5\%

). Verifier traces further show that compact juries reduce per-criterion judging cost to roughly

4.2\%

5.6\%

of frontier and latency to roughly

21.7\%

27.1\%

, even as aggregated task-level outcomes often remain comparatively stable.

Why GPU Density Just Broke Two Decades of Data Centre Design Assumptions

Dev.to

Turning Images into Useful Text with AI

Dev.to

Ten Reddit Threads That Make the AI-Agent Boom Look More Like Systems Engineering

Dev.to

Ten Reddit Threads That Made AI Agents Look More Like Infrastructure Than Hype

Dev.to

From Demos to Guardrails: 10 Reddit Threads Tracking the AI-Agent Shift

Dev.to

AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

Key Points

Abstract

Related Articles

Why GPU Density Just Broke Two Decades of Data Centre Design Assumptions

Turning Images into Useful Text with AI

Ten Reddit Threads That Make the AI-Agent Boom Look More Like Systems Engineering

Ten Reddit Threads That Made AI Agents Look More Like Infrastructure Than Hype

From Demos to Guardrails: 10 Reddit Threads Tracking the AI-Agent Shift

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer