DeepSeek released 'Thinking-with-Visual-Primitives' framework

Reddit r/LocalLLaMA / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • DeepSeek has released the paper “Thinking with Visual Primitives” and an open-source repository in collaboration with Peking University and Tsinghua University.
  • The proposed multimodal reasoning framework treats spatial tokens—such as coordinate points and bounding boxes—as “minimal units of thought” in the model’s chain-of-thought.
  • These visual/spatial tokens are interleaved directly during reasoning, allowing the model to point to specific locations in an image while generating its internal reasoning.
  • The release provides a new mechanism for grounding visual understanding and reasoning over image content using explicit spatial representations.
DeepSeek released 'Thinking-with-Visual-Primitives' framework

https://preview.redd.it/47r9qee44cyg1.png?width=1450&format=png&auto=webp&s=0d6f9687115be6ff96d0a194d95232ac0413a7e9

DeepSeek, in collaboration with Peking University and Tsinghua University, has released the paper "Thinking with Visual Primitives" along with its open-source repository, introducing a new multimodal reasoning framework. The core approach of this framework is to elevate spatial tokens—specifically coordinate points and bounding boxes—into the "minimal units of thought" within the model's chain-of-thought. These are directly interleaved during the reasoning process, enabling the model to "point" to specific locations within an image while it "thinks."

https://github.com/deepseek-ai/Thinking-with-Visual-Primitives

https://preview.redd.it/lt5qu53g0cyg1.png?width=1844&format=png&auto=webp&s=5d6f0a8de6481035faa22c9d57873c51ca97b1fb

submitted by /u/External_Mood4719
[link] [comments]