DeepSeek released 'Thinking-with-Visual-Primitives' framework

Reddit r/LocalLLaMA / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

DeepSeek has released the paper “Thinking with Visual Primitives” and an open-source repository in collaboration with Peking University and Tsinghua University.
The proposed multimodal reasoning framework treats spatial tokens—such as coordinate points and bounding boxes—as “minimal units of thought” in the model’s chain-of-thought.
These visual/spatial tokens are interleaved directly during reasoning, allowing the model to point to specific locations in an image while generating its internal reasoning.
The release provides a new mechanism for grounding visual understanding and reasoning over image content using explicit spatial representations.

DeepSeek released 'Thinking-with-Visual-Primitives' framework

https://preview.redd.it/47r9qee44cyg1.png?width=1450&format=png&auto=webp&s=0d6f9687115be6ff96d0a194d95232ac0413a7e9

DeepSeek, in collaboration with Peking University and Tsinghua University, has released the paper "Thinking with Visual Primitives" along with its open-source repository, introducing a new multimodal reasoning framework. The core approach of this framework is to elevate spatial tokens—specifically coordinate points and bounding boxes—into the "minimal units of thought" within the model's chain-of-thought. These are directly interleaved during the reasoning process, enabling the model to "point" to specific locations within an image while it "thinks."

https://github.com/deepseek-ai/Thinking-with-Visual-Primitives

https://preview.redd.it/lt5qu53g0cyg1.png?width=1844&format=png&auto=webp&s=5d6f0a8de6481035faa22c9d57873c51ca97b1fb

submitted by /u/External_Mood4719
[link] [comments]