Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
arXiv cs.CV / 3/20/2026
📰 NewsModels & Research
Key Points
- Perceptio introduces a perception-enhanced LVLM that enables explicit 2D/3D spatial reasoning by emitting spatial tokens (semantic segmentation tokens and depth tokens) during autoregressive generation.
- It tokenizes dense depth with a VQVAE codebook distilled from a monocular teacher and integrates SAM2 semantic segmentation tokens inside the LLM to ground spatial reasoning before answering.
- The approach uses composite depth-token objectives (marker, token, and count losses) and a soft-merging technique to stabilize depth token generation and differentiable reconstruction.
- A multi-task co-training regime across diverse datasets lets the model learn perception tokens for multiple downstream tasks, building on InternVL.
- On benchmarks, Perceptio achieves state-of-the-art results, boosting RefCOCO-series segmentation metrics, improving spatial understanding accuracy by 10.3%, and increasing MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought strengthens LVLM grounding.
Related Articles

Interactive Web Visualization of GPT-2
Reddit r/artificial
[R] Causal self-attention as a probabilistic model over embeddings
Reddit r/MachineLearning
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Dev.to
The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX
Dev.to
[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data
Reddit r/MachineLearning