Qianfan-OCR — 4B end-to-end document AI model: 93.12 on OmniDocBench v1.5, 192 languages, runs on a single A100 with vLLM

Reddit r/LocalLLaMA / 3/19/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

Qianfan-OCR is a 4B-parameter end-to-end vision-language model for document understanding that handles OCR, layout analysis, table extraction, formula recognition, chart understanding, and key information extraction in a single forward pass.
It introduces Layout-as-Thought with an optional <think> reasoning phase that can be turned on to improve bounding boxes, element types, and reading order, trading off speed for accuracy.
Benchmark results show OmniDocBench v1.5 score of 93.12 (top among end-to-end models), OCRBench 880, and KIE average 87.9, beating several larger models.
Practical deployment details include inference at 1.024 pages/sec on a single A100 with W8A8 quantization, 192 languages support, out-of-the-box vLLM compatibility, and training on 2.85T tokens across 4 stages on 1,024 Kunlun P800 chips.

We just open-sourced Qianfan-OCR, a 4B-parameter end-to-end vision-language model for document understanding.

Instead of the typical detect → recognize → LLM pipeline, this model handles OCR, layout analysis, table extraction, formula recognition, chart understanding, and key information extraction — all in one forward pass.

Core idea: Layout-as-Thought

The model can optionally enter a <think> reasoning phase before generating output, where it reasons about bounding boxes, element types, and reading order. Think of it as Chain-of-Thought, but for document layout. You can turn it on/off depending on whether you need the extra accuracy or prefer speed.

Benchmarks: