| https://huggingface.co/dranger003/Phi-4-reasoning-vision-15B-GGUF You may remember this model https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B Phi-4-Reasoning-Vision-15B is a compact open-weight multimodal reasoning model built on the Phi-4-Reasoning language model backbone and the SigLIP-2 vision encoder, using a mid-fusion architecture. In this architecture, the vision encoder first converts images into visual tokens, which are then projected into the language model's embedding space and injected into the pretrained language model. This approach leverages the strengths of both pretrained components while keeping training and inference costs manageable. The model employs a dynamic resolution vision encoder with up to 3,600 visual tokens, enabling high-resolution image understanding critical for tasks such as GUI grounding and fine-grained document analysis. Bidirectional attention is applied within images (intra-image) to improve spatial reasoning without the overfitting risks observed with broader bidirectional schemes. Phi-4-Reasoning-Vision-15B is trained with Supervised Fine-Tuning (SFT) on a carefully curated mixture of reasoning and non-reasoning data. Rather than training separate models for each mode, the model operates as a single system that can invoke extended chain-of-thought reasoning (using [link] [comments] |
support for microsoft/Phi-4-reasoning-vision-15B has been merged into llama.cpp
Reddit r/LocalLLaMA / 3/12/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- Support for Microsoft Phi-4-Reasoning-Vision-15B has been merged into llama.cpp, enabling use of the model via the library.
- The architecture uses a mid-fusion approach with a SigLIP-2 vision encoder; vision tokens are projected into the language model's embedding space and injected into the pretrained model for multimodal processing.
- It supports high-resolution image understanding with up to 3,600 visual tokens and bidirectional intra-image attention to improve spatial reasoning for tasks like GUI grounding and fine-grained document analysis.
- The model is trained with supervised fine-tuning on a mix of reasoning and non-reasoning data, operates as a single system with extended chain-of-thought via <think> blocks or direct inference via <nothink> for perception tasks, and relies on open datasets plus internal Microsoft data; training used around 240 NVIDIA B200 GPUs for 4 days.
- The change is documented via llama.cpp pull request #20168, reflecting a data-centric approach with moderate compute requirements rather than extremely large training scales.
Related Articles

Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents
Dev.to

I Built a Full-Stack App in 5 Minutes with 8080.ai — Here's How
Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI
TechCrunch

I Shipped 6 Developer Tools in One Day Using an AI Agent Fleet
Dev.to

Workflow Builders vs AI Agents: 5 Automation Tools Compared (2026)
Dev.to