Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
arXiv cs.CV / 3/20/2026
📰 NewsModels & Research
Key Points
- Perceptio introduces a perception-enhanced LVLM that enables explicit 2D/3D spatial reasoning by emitting spatial tokens (semantic segmentation tokens and depth tokens) during autoregressive generation.
- It tokenizes dense depth with a VQVAE codebook distilled from a monocular teacher and integrates SAM2 semantic segmentation tokens inside the LLM to ground spatial reasoning before answering.
- The approach uses composite depth-token objectives (marker, token, and count losses) and a soft-merging technique to stabilize depth token generation and differentiable reconstruction.
- A multi-task co-training regime across diverse datasets lets the model learn perception tokens for multiple downstream tasks, building on InternVL.
- On benchmarks, Perceptio achieves state-of-the-art results, boosting RefCOCO-series segmentation metrics, improving spatial understanding accuracy by 10.3%, and increasing MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought strengthens LVLM grounding.
Related Articles
How We Built ScholarNet AI: An AI-Powered Study Platform for Students
Dev.to
Extracting and Following Paths for Robust Relational Reasoning with Large Language Models
arXiv cs.CL
Predictive Photometric Uncertainty in Gaussian Splatting for Novel View Synthesis
arXiv cs.CV
LatentQA: Teaching LLMs to Decode Activations Into Natural Language
arXiv cs.CL
DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation
arXiv cs.CL