Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

arXiv cs.CV / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a systematic threat model for contradictory virtual content attacks in augmented reality (AR), where malicious or inconsistent virtual elements can mislead users or cause semantic confusion.
It presents ContrAR, a new benchmark consisting of 312 real-world, human-validated AR videos, designed to evaluate how well vision-language models (VLMs) handle AR virtual content manipulation and contradictions.
The authors benchmark 11 VLMs (commercial and open-source) and find that while many can understand contradictory virtual content to some extent, there is still significant room for improvement in adversarial detection and reasoning in AR settings.
A key reported challenge is balancing detection accuracy with latency, which is important for real-time AR systems.
Overall, the work highlights security and reliability gaps for current VLMs when deployed in AR environments under adversarial virtual content conditions.

Abstract

Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.

Black Hat Asia

AI Business

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing

Dev.to

Every AI Agent Registry in 2026, Compared

Dev.to

Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

Key Points

Abstract

Related Articles

Black Hat Asia

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Context Windows Are Getting Absurd — And That's a Good Thing

Every AI Agent Registry in 2026, Compared

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer