GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

arXiv cs.CL / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that simply injecting static, single-layer geometric features into MLLMs can create a “task misalignment bias” because those features drift toward the 3D foundation model’s pretraining objectives rather than the MLLM’s diverse spatial-reasoning needs.
It introduces GeoAlign, which builds a hierarchical geometric feature bank and uses the MLLM’s own visual tokens as content-aware queries to dynamically perform layer-wise sparse routing and fetch appropriate geometry features per image patch.
Experiments on VSI-Bench, ScanQA, and SQA3D show that the proposed approach improves multimodal spatial reasoning performance, with a compact 4B model reaching state-of-the-art results.
The method can outperform larger existing MLLMs, suggesting that better geometric alignment (via dynamic multi-layer aggregation) may be more important than model size for spatial reasoning tasks.
Overall, GeoAlign reframes geometric-feature injection as an adaptive alignment problem rather than a one-time feature extraction step, aiming to better match heterogeneous spatial demands during inference.

Abstract

Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

Black Hat Asia

AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking

Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance

Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

Key Points

Abstract

Related Articles

Black Hat Asia

The Complete Guide to Better Meeting Productivity with AI Note-Taking

5 Ways Real-Time AI Can Boost Your Sales Call Performance

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer