How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study
arXiv cs.AI / 4/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates whether LLMs and VLMs can perform viewpoint rotation understanding (VRU) using text-only inputs, without any visual information.
- Experiments show that both LLMs and VLMs perform poorly on the authors’ VRU dataset, while humans achieve 100% accuracy, highlighting a significant capability gap for spatial intelligence.
- Layer-wise probing and head-wise causal interventions suggest models can encode viewpoint information but have difficulty binding the viewpoint position to the corresponding observation, leading to hallucinations in later layers.
- Selective fine-tuning of key attention heads identified by causal intervention improves VRU performance while largely avoiding catastrophic forgetting of general abilities, and the dataset/code are planned for release.


![[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Flu4b6ttuhur71z5gemm0.png&w=3840&q=75)
