| DeepSeek, in collaboration with Peking University and Tsinghua University, has released the paper "Thinking with Visual Primitives" along with its open-source repository, introducing a new multimodal reasoning framework. The core approach of this framework is to elevate spatial tokens—specifically coordinate points and bounding boxes—into the "minimal units of thought" within the model's chain-of-thought. These are directly interleaved during the reasoning process, enabling the model to "point" to specific locations within an image while it "thinks." https://github.com/deepseek-ai/Thinking-with-Visual-Primitives [link] [comments] |
DeepSeek released 'Thinking-with-Visual-Primitives' framework
Reddit r/LocalLLaMA / 4/30/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- DeepSeek has released the paper “Thinking with Visual Primitives” and an open-source repository in collaboration with Peking University and Tsinghua University.
- The proposed multimodal reasoning framework treats spatial tokens—such as coordinate points and bounding boxes—as “minimal units of thought” in the model’s chain-of-thought.
- These visual/spatial tokens are interleaved directly during reasoning, allowing the model to point to specific locations in an image while generating its internal reasoning.
- The release provides a new mechanism for grounding visual understanding and reasoning over image content using explicit spatial representations.
Related Articles

The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay
Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works
Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD
Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%
Dev.to

Stop Building Signal APIs. Build Systems That Prove Themselves Wrong.
Dev.to