GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning
arXiv cs.CL / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that simply injecting static, single-layer geometric features into MLLMs can create a “task misalignment bias” because those features drift toward the 3D foundation model’s pretraining objectives rather than the MLLM’s diverse spatial-reasoning needs.
- It introduces GeoAlign, which builds a hierarchical geometric feature bank and uses the MLLM’s own visual tokens as content-aware queries to dynamically perform layer-wise sparse routing and fetch appropriate geometry features per image patch.
- Experiments on VSI-Bench, ScanQA, and SQA3D show that the proposed approach improves multimodal spatial reasoning performance, with a compact 4B model reaching state-of-the-art results.
- The method can outperform larger existing MLLMs, suggesting that better geometric alignment (via dynamic multi-layer aggregation) may be more important than model size for spatial reasoning tasks.
- Overall, GeoAlign reframes geometric-feature injection as an adaptive alignment problem rather than a one-time feature extraction step, aiming to better match heterogeneous spatial demands during inference.
Related Articles

Black Hat Asia
AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking
Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance
Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning