Multi-modal user interface control detection using cross-attention

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the challenge of detecting UI controls from screenshots by introducing a multi-modal YOLOv5 extension that leverages GPT-generated text descriptions alongside visual inputs.
  • It uses cross-attention modules to align visual features with semantic information from text embeddings, improving context-awareness beyond pixel-only approaches.
  • Evaluations on a dataset of 16,000+ annotated UI screenshots covering 23 control classes show consistent gains over baseline YOLOv5 using multiple text-visual fusion strategies.
  • Convolutional fusion delivers the best results, especially for semantically complex or visually ambiguous UI control classes where vision alone is often insufficient.
  • The authors suggest the approach can enable more reliable automated testing, accessibility support, and UI analytics, and motivates future work on efficient, robust, generalizable multi-modal detection systems.

Abstract

Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.