Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

arXiv cs.CV / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces IKEA-Bench, a benchmark with 1,623 questions across six task types and 29 IKEA products, designed to evaluate vision-language model alignment between 2D assembly instructions and video/camera depictions.
  • Experiments with 19 VLMs (2B–38B) show that recovering assembly-instruction understanding from text can harm diagram-to-video alignment, indicating a trade-off between text-driven reasoning and cross-depiction visual grounding.
  • Model architecture family is found to predict alignment accuracy more reliably than sheer parameter count, suggesting structural design choices matter more than scaling alone.
  • A mechanistic analysis finds diagrams and video representations lie in largely disjoint ViT subspaces, and that adding text shifts attention toward text-mediated reasoning rather than improving visual correspondence.
  • Video understanding is identified as the dominant bottleneck that remains difficult regardless of the alignment strategy, implying that improving visual encoding for cross-depiction robustness is the primary research target.

Abstract

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/