Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

arXiv cs.AI / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that conventional LLM evaluation (binary correctness) is inadequate for enterprise tasks that are subjective, context-dependent, and executed via long, multi-step tool workflows.
It introduces LH-Bench, a three-part evaluation framework combining expert-grounded rubrics for LLM judging, curated ground-truth artifacts to produce stepwise reward signals, and human pairwise preferences for validation.
The study finds that domain-authored (expert) rubrics produce more reliable evaluation signals than LLM-authored rubrics (kappa 0.60 vs. 0.46), indicating better agreement with human standards.
Human preference evaluations corroborate the same ranking outcomes statistically (p < 0.05), supporting the claim that expert-grounded evaluation can scale while maintaining reliability.
The authors release public datasets and report results on two long-horizon environments: Figma-to-code (33 tasks using the Figma API via MCP) and Programmatic content (41 courses with 183 evaluatable chapters).

Abstract

Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a single correct answer. In contrast, real-world enterprise work is often subjective and context-dependent: success hinges on organizational goals, user intent, and the quality of intermediate artifacts produced across long, multi-tool workflows. We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. The pillars are: (i) expert-grounded rubrics that give LLM judges the domain context needed to score subjective work, (ii) curated ground-truth artifacts that enable stepwise reward signals (e.g., chapter-level annotation for content tasks), and (iii) pairwise human preference evaluation for convergent validation. We show that domain-authored rubrics provide substantially more reliable evaluation signals than LLM-authored rubrics (kappa = 0.60 vs. 0.46), and that human preference judgments confirm the same top-tier separation (p < 0.05), evidence that expert-grounded evaluation can scale without sacrificing reliability. We release public datasets and report results on two environments: Figma-to-code (33 real .fig tasks against the Figma API via MCP) and Programmatic content (41 courses comprising 183 individually-evaluated chapters on a course platform serving 30+ daily users).

Lemonade 10.0.1 improves setup process for using AMD Ryzen AI NPUs on Linux

Reddit r/artificial

The 2026 Developer Showdown: Claude Code vs. Google Antigravity

Dev.to

Google March 2026 Spam Update: SEO Impact and What to Do Now | MKDM

Dev.to

CRM Development That Drives Growth

Dev.to

Karpathy's Autoresearch: Improving Agentic Coding Skills

Dev.to

Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

Key Points

Abstract

Related Articles

Lemonade 10.0.1 improves setup process for using AMD Ryzen AI NPUs on Linux

The 2026 Developer Showdown: Claude Code vs. Google Antigravity

Google March 2026 Spam Update: SEO Impact and What to Do Now | MKDM

CRM Development That Drives Growth

Karpathy's Autoresearch: Improving Agentic Coding Skills

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer