Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
arXiv cs.AI / 3/25/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that conventional LLM evaluation (binary correctness) is inadequate for enterprise tasks that are subjective, context-dependent, and executed via long, multi-step tool workflows.
- It introduces LH-Bench, a three-part evaluation framework combining expert-grounded rubrics for LLM judging, curated ground-truth artifacts to produce stepwise reward signals, and human pairwise preferences for validation.
- The study finds that domain-authored (expert) rubrics produce more reliable evaluation signals than LLM-authored rubrics (kappa 0.60 vs. 0.46), indicating better agreement with human standards.
- Human preference evaluations corroborate the same ranking outcomes statistically (p < 0.05), supporting the claim that expert-grounded evaluation can scale while maintaining reliability.
- The authors release public datasets and report results on two long-horizon environments: Figma-to-code (33 tasks using the Figma API via MCP) and Programmatic content (41 courses with 183 evaluatable chapters).
Related Articles

Lemonade 10.0.1 improves setup process for using AMD Ryzen AI NPUs on Linux
Reddit r/artificial
The 2026 Developer Showdown: Claude Code vs. Google Antigravity
Dev.to

Google March 2026 Spam Update: SEO Impact and What to Do Now | MKDM
Dev.to
CRM Development That Drives Growth
Dev.to

Karpathy's Autoresearch: Improving Agentic Coding Skills
Dev.to