ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

arXiv cs.AI / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces ACE-Bench, an agent evaluation benchmark designed to reduce environment interaction overhead (reported as up to ~41% of total evaluation time in prior benchmarks) by using a lightweight, static tooling setup.
ACE-Bench standardizes evaluation around a unified grid-based planning task where agents fill hidden slots in a partially completed schedule while satisfying local and global constraints.
It enables fine-grained, interpretable control over task horizon and difficulty using two orthogonal parameters: Scalable Horizons via the number of hidden slots (H) and Controllable Difficulty via a decoy budget (B).
All tool calls are resolved through static JSON files in a Lightweight Environment, improving setup speed and enabling fast, reproducible evaluation suitable for training-time validation.
Experiments across 13 model families and sizes over 6 domains show reliable control over horizon/difficulty, strong domain consistency, and meaningful cross-model performance variation, indicating improved model discriminability.

Abstract

Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizons, controlled by the number of hidden slots

H

, and Controllable Difficulty, governed by a decoy budget

B

that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing

Dev.to

Google isn’t an AI-first company despite Gemini being great

Reddit r/artificial

GitHub Weekly: Copilot SDK Goes Public, Cloud Agent Breaks Free

Dev.to

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

Key Points

Abstract

Related Articles

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Context Windows Are Getting Absurd — And That's a Good Thing

Google isn’t an AI-first company despite Gemini being great

GitHub Weekly: Copilot SDK Goes Public, Cloud Agent Breaks Free

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer