support for microsoft/Phi-4-reasoning-vision-15B has been merged into llama.cpp

Reddit r/LocalLLaMA / 3/12/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Read original →

共有:

Key Points

Support for Microsoft Phi-4-Reasoning-Vision-15B has been merged into llama.cpp, enabling use of the model via the library.
The architecture uses a mid-fusion approach with a SigLIP-2 vision encoder; vision tokens are projected into the language model's embedding space and injected into the pretrained model for multimodal processing.
It supports high-resolution image understanding with up to 3,600 visual tokens and bidirectional intra-image attention to improve spatial reasoning for tasks like GUI grounding and fine-grained document analysis.
The model is trained with supervised fine-tuning on a mix of reasoning and non-reasoning data, operates as a single system with extended chain-of-thought via <think> blocks or direct inference via <nothink> for perception tasks, and relies on open datasets plus internal Microsoft data; training used around 240 NVIDIA B200 GPUs for 4 days.
The change is documented via llama.cpp pull request #20168, reflecting a data-centric approach with moderate compute requirements rather than extremely large training scales.

support for microsoft/Phi-4-reasoning-vision-15B has been merged into llama.cpp

https://huggingface.co/dranger003/Phi-4-reasoning-vision-15B-GGUF

You may remember this model https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B

Phi-4-Reasoning-Vision-15B is a compact open-weight multimodal reasoning model built on the Phi-4-Reasoning language model backbone and the SigLIP-2 vision encoder, using a mid-fusion architecture. In this architecture, the vision encoder first converts images into visual tokens, which are then projected into the language model's embedding space and injected into the pretrained language model. This approach leverages the strengths of both pretrained components while keeping training and inference costs manageable. The model employs a dynamic resolution vision encoder with up to 3,600 visual tokens, enabling high-resolution image understanding critical for tasks such as GUI grounding and fine-grained document analysis. Bidirectional attention is applied within images (intra-image) to improve spatial reasoning without the overfitting risks observed with broader bidirectional schemes.

Phi-4-Reasoning-Vision-15B is trained with Supervised Fine-Tuning (SFT) on a carefully curated mixture of reasoning and non-reasoning data. Rather than training separate models for each mode, the model operates as a single system that can invoke extended chain-of-thought reasoning (using <think>...</think> blocks) for tasks like mathematical and scientific reasoning, or default to direct inference (tagged with <nothink>) for perception-focused tasks such as captioning, object detection, and grounding. The training data consists primarily of meticulously filtered and improved open-source vision-language datasets, supplemented by high-quality domain-specific data from internal Microsoft teams and targeted data acquisitions. This data-centric approach, combined with moderate training compute requirements (240 NVIDIA B200 GPUs for 4 days), distinguishes Phi-4-Reasoning-Vision-15B from models that rely on substantially more training data and compute.

submitted by /u/jacek2023
[link] [comments]