FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting
arXiv cs.CV / 5/1/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces FineState-Bench, a new benchmark focused on fine-grained, state-conditioned GUI interaction, addressing gaps in prior evaluations such as limited coverage and vague target-state definitions.
- FineState-Bench contains 2,209 explicitly defined instances across desktop, web, and mobile, covering four interaction families and 23 UI component types, with exact target states for each task.
- The authors propose FineState-Metrics, a four-stage diagnostic framework (SR@Loc, SR@Int, ES-SR@Loc, ES-SR@Int) to pinpoint where agents fail during localization and interaction.
- Results show low exact goal-state success (ES-SR@Int peaks at 32.8% on web and 22.8% on average across platforms), and using the Visual Diagnostic Assistant (VDA) gives Gemini-2.5-Flash a +14.9 point boost in ES-SR@Int.
- Overall, the study suggests there is significant room for improving visual grounding, but current models still lack accuracy for reliable fine-grained state-conditioned GUI control.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest
Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to