ibm-granite/granite-4.0-3b-vision · Hugging Face

Reddit r/LocalLLaMA / 3/29/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

IBM’s Granite-4.0-3B-Vision is a vision-language model tailored for enterprise document extraction, emphasizing chart, table, and semantic key-value pair (KVP) extraction from document images.
It is released on Hugging Face as a LoRA adapter built on top of the Granite 4.0 Micro base model, allowing the same deployment to handle both multimodal document understanding (with the adapter) and text-only workloads (without loading the adapter).
The model supports structured outputs for charts (e.g., Chart2CSV/Chart2Summary/Chart2Code) and table extraction into formats such as JSON, HTML, or OTSL.
It aims to preserve and extend capabilities from Granite-Vision-3.3 2B for seamless adoption, while also supporting general vision-language tasks like image-to-text.
The model can be used standalone and integrates with the Docling pipeline to enhance document processing with deeper visual understanding.

ibm-granite/granite-4.0-3b-vision · Hugging Face

Model Summary: Granite-4.0-3B-Vision is a vision-language model (VLM) designed for enterprise-grade document data extraction. It focuses on specialized, complex extraction tasks that ultracompact models often struggle with:

Chart extraction: Converting charts into structured, machine-readable formats (Chart2CSV, Chart2Summary, and Chart2Code)
Table extraction: Accurately extracting tables with complex layouts from document images to JSON, HTML, or OTSL
Semantic Key-Value Pair (KVP) extraction: Extracting values based on key names and descriptions across diverse document layouts

The model is delivered as a LoRA adapter on top of Granite 4.0 Micro, enabling a single deployment to support both multimodal document understanding and text-only workloads — the base model handles text-only requests without loading the adapter. See Model Architecture for details.

While our focus is on specialized document extraction tasks, the current model preserves and extends the capabilities of Granite-Vision-3.3 2B, ensuring that existing users can adopt it seamlessly with no changes to their workflow. It continues to support vision‑language tasks such as producing detailed natural‑language descriptions from images (image‑to‑text). The model can be used standalone and integrates seamlessly with Docling to enhance document processing pipelines with deep visual understanding capabilities.

submitted by /u/jacek2023
[link] [comments]