Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
arXiv cs.CV / 3/17/2026
📰 NewsModels & Research
Key Points
- Qianfan-OCR is a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding in a single architecture.
- It supports direct image-to-Markdown output and prompt-driven tasks such as table extraction, chart understanding, document QA, and key information extraction.
- It introduces Layout-as-Thought, an optional thinking phase triggered by think tokens to generate structured layout representations before final outputs, restoring layout grounding.
- It ranks first among end-to-end models on OmniDocBench v1.5 and OlmOCR Bench and shows competitive results on OCRBench, CCOCR, DocVQA, and ChartQA, with top averages on public key information extraction benchmarks.
- The model is publicly accessible via the Baidu AI Cloud Qianfan platform.
Related Articles

ラピダス、半導体設計AIエージェント「国内2社海外1社が使用中」
日経XTECH

Superposition and the Capsule: Quantum State Collapse Meets AI Identity
Dev.to

The Basilisk Inversion: Why Coercive AI Futures Are Thermodynamically Unlikely
Dev.to

The Loop as Laboratory: What 3,190 Cycles of Autonomous AI Operation Reveal
Dev.to

MiMo-V2-Pro & Omni & TTS: "We will open-source — when the models are stable enough to deserve it."
Reddit r/LocalLLaMA