HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces HalluClear, a suite aimed at diagnosing, evaluating, and mitigating hallucinations specifically in GUI agents where cascading failures are common in real deployments.
  • HalluClear includes a GUI-focused hallucination taxonomy, a three-stage evaluation workflow to improve the reliability of VLM-as-a-judge using expert-annotated benchmarks and ensemble credibility estimation, and an intervention strategy using closed-loop structured reasoning.
  • The mitigation approach supports lightweight continual post-training with cold-start initialization, targeting both generalist and GUI-specialist agents rather than relying solely on large-scale retraining.
  • Experiments on representative agents and public benchmarks suggest that post-training with only about 9K samples can substantially reduce hallucinations and improve grounding and action fidelity.
  • The work positions hallucination-focused tooling as a compute-efficient complement to industrial-scale scaling for building more robust GUI automation.

Abstract

While progress in GUI agents has been largely driven by industrial-scale training, ungrounded hallucinations often trigger cascading failures in real-world deployments.Unlike general VLM domains, the GUI agent field lacks a hallucination-focused suite for fine-grained diagnosis, reliable evaluation, and targeted mitigation.To bridge this gap, we introduce HalluClear, a comprehensive suite for hallucination mitigation in GUI agents as a complement to computation-intensive scaling. HalluClear comprises: (1) a GUI-specific hallucination taxonomy derived from empirical failure analysis; (2) a calibrated three-stage evaluation workflow which enhances VLM-as-a-judge reliability via expert-annotated benchmarking and ensemble credibility estimation; and (3) a mitigation scheme based on closed-loop structured reasoning, enabling lightweight continual post-training with cold-start initialization for both generalist and GUI-specialist agents. Experiments across representative agents and public benchmarks demonstrate that post-training on only 9K samples within our suite can significantly reduce hallucinations, thereby improving grounding and action fidelity, offering a compute-efficient pathway to robust GUI automation.