Why MLLMs Struggle to Determine Object Orientations

arXiv cs.CV / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates why multimodal large language models (MLLMs) struggle with reasoning about 2D object orientations in images, building on prior hypotheses about visual encoder limitations.
Using a controlled empirical protocol, the authors test whether orientation information is preserved in encoder embeddings by training linear regressors on features from SigLIP/Vit and CLIP-based setups in LLaVA and Qwen2.5-VL.
Contrary to the null/accepted hypothesis, the study finds that object orientation can be recovered accurately from encoder representations using simple linear models.
The results contradict the idea that orientation failures primarily stem from the visual encoder being unable to represent geometric orientation.
The authors also observe that while orientation information exists, it is distributed diffusely across very large numbers of features, suggesting the issue may lie in how MLLMs/heads exploit or attend to that information rather than whether it is encoded.

Abstract

Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by training linear regressors to predict object orientation from encoded features. Contrary to the hypothesis, we find that orientation information is recoverable from encoder representations: simple linear models accurately predict object orientations from embeddings. This contradicts the assumption that MLLM orientation failures originate in the visual encoder. Having rejected the accepted hypothesis that MLLMs struggle with 2D orientation tasks because of visual encoder limitations, we still don't know why they fail. Although a full explanation is beyond the scope of this paper, we show that although present, orientation information is spread diffusely across tens of thousands of features. This may or may not be while MLLMs fail to exploit the available orientation information.