FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

arXiv cs.CV / 5/1/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces FineState-Bench, a new benchmark focused on fine-grained, state-conditioned GUI interaction, addressing gaps in prior evaluations such as limited coverage and vague target-state definitions.
FineState-Bench contains 2,209 explicitly defined instances across desktop, web, and mobile, covering four interaction families and 23 UI component types, with exact target states for each task.
The authors propose FineState-Metrics, a four-stage diagnostic framework (SR@Loc, SR@Int, ES-SR@Loc, ES-SR@Int) to pinpoint where agents fail during localization and interaction.
Results show low exact goal-state success (ES-SR@Int peaks at 32.8% on web and 22.8% on average across platforms), and using the Visual Diagnostic Assistant (VDA) gives Gemini-2.5-Flash a +14.9 point boost in ES-SR@Int.
Overall, the study suggests there is significant room for improving visual grounding, but current models still lack accuracy for reliable fine-grained state-conditioned GUI control.

Abstract

Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail. To address this gap, we introduce \textbf{FineState-Bench}, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state. FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting. We further propose \textit{FineState-Metrics}, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction Success Rate (SR@Int), Exact State Success Rate at Locate (ES-SR@Loc), and Exact State Success Rate at Interact (ES-SR@Int), and a plug-and-play \textit{Visual Diagnostic Assistant} (VDA) that generates a Description and a bounding-box Localization Hint to diagnose visual grounding reason via controlled w/ vs.\ w/o comparisons. On FineState-Bench, exact goal-state success remains low: ES-SR@Int peaks at 32.8\% on Web and 22.8\% on average across platforms. With VDA localization hints, Gemini-2.5-Flash gains +14.9 ES-SR@Int points, suggesting substantial headroom from improved visual grounding, yet overall accuracy is still insufficient for reliable fine-grained state-conditioned interaction \href{https://github.com/FengxianJi/FineState-Bench}{Github.}

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest

Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

Dev.to

How to Fix OpenClaw Tool Calling Issues

Dev.to

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Announcing the NVIDIA Nemotron 3 Super Build Contest

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

How to Fix OpenClaw Tool Calling Issues

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer