Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
arXiv cs.CV / 3/17/2026
📰 NewsModels & Research
Key Points
- Qianfan-OCR is a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding in a single architecture.
- It supports direct image-to-Markdown output and prompt-driven tasks such as table extraction, chart understanding, document QA, and key information extraction.
- It introduces Layout-as-Thought, an optional thinking phase triggered by think tokens to generate structured layout representations before final outputs, restoring layout grounding.
- It ranks first among end-to-end models on OmniDocBench v1.5 and OlmOCR Bench and shows competitive results on OCRBench, CCOCR, DocVQA, and ChartQA, with top averages on public key information extraction benchmarks.
- The model is publicly accessible via the Baidu AI Cloud Qianfan platform.
Related Articles
Self-Refining Agents in Spec-Driven Development
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA

M2.7 open weights coming in ~2 weeks
Reddit r/LocalLLaMA

MiniMax M2.7 Will Be Open Weights
Reddit r/LocalLLaMA
Best open source coding models for claude code? LB?
Reddit r/LocalLLaMA