LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
arXiv cs.CV / 4/14/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that multimodal LLMs often mis-handle complex geometric layouts due to hallucinations and imprecision, and suggests using specialized vision tools to supply structured spatial priors.
- It introduces LAST, a tool-augmented spatial reasoning framework that wraps heterogeneous, parameter-rich tool calls into atomic instructions and reusable “spatial skills.”
- LAST uses an extensible interactive sandbox (LAST-Box) that converts low-level tool outputs (e.g., segmentation masks, depth maps) into LLM-consumable multimodal hints such as annotated images and textual descriptions.
- A three-stage progressive training strategy is proposed to help models learn to interpret tool outputs and then become proficient at invoking tools adaptively.
- Experiments across four datasets report that LAST-7B delivers ~20% gains over its backbone and performs competitively against strong proprietary closed-source LLMs on complex spatial reasoning tasks.
Related Articles

Black Hat Asia
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning
Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale
Dev.to
Bit of a strange question?
Reddit r/artificial