Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

arXiv cs.CV / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a systematic threat model for contradictory virtual content attacks in augmented reality (AR), where malicious or inconsistent virtual elements can mislead users or cause semantic confusion.
  • It presents ContrAR, a new benchmark consisting of 312 real-world, human-validated AR videos, designed to evaluate how well vision-language models (VLMs) handle AR virtual content manipulation and contradictions.
  • The authors benchmark 11 VLMs (commercial and open-source) and find that while many can understand contradictory virtual content to some extent, there is still significant room for improvement in adversarial detection and reasoning in AR settings.
  • A key reported challenge is balancing detection accuracy with latency, which is important for real-time AR systems.
  • Overall, the work highlights security and reliability gaps for current VLMs when deployed in AR environments under adversarial virtual content conditions.

Abstract

Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.