Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment
arXiv cs.CV / 4/2/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces IKEA-Bench, a benchmark with 1,623 questions across six task types and 29 IKEA products, designed to evaluate vision-language model alignment between 2D assembly instructions and video/camera depictions.
- Experiments with 19 VLMs (2B–38B) show that recovering assembly-instruction understanding from text can harm diagram-to-video alignment, indicating a trade-off between text-driven reasoning and cross-depiction visual grounding.
- Model architecture family is found to predict alignment accuracy more reliably than sheer parameter count, suggesting structural design choices matter more than scaling alone.
- A mechanistic analysis finds diagrams and video representations lie in largely disjoint ViT subspaces, and that adding text shifts attention toward text-mediated reasoning rather than improving visual correspondence.
- Video understanding is identified as the dominant bottleneck that remains difficult regardless of the alignment strategy, implying that improving visual encoding for cross-depiction robustness is the primary research target.
Related Articles

Black Hat Asia
AI Business

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama
Dev.to

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally
Dev.to

Why the same codebase should always produce the same audit score
Dev.to

Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)
Dev.to