AI Navigate

Qianfan-OCR — 4B end-to-end document AI model: 93.12 on OmniDocBench v1.5, 192 languages, runs on a single A100 with vLLM

Reddit r/LocalLLaMA / 3/19/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • Qianfan-OCR is a 4B-parameter end-to-end vision-language model for document understanding that handles OCR, layout analysis, table extraction, formula recognition, chart understanding, and key information extraction in a single forward pass.
  • It introduces Layout-as-Thought with an optional <think> reasoning phase that can be turned on to improve bounding boxes, element types, and reading order, trading off speed for accuracy.
  • Benchmark results show OmniDocBench v1.5 score of 93.12 (top among end-to-end models), OCRBench 880, and KIE average 87.9, beating several larger models.
  • Practical deployment details include inference at 1.024 pages/sec on a single A100 with W8A8 quantization, 192 languages support, out-of-the-box vLLM compatibility, and training on 2.85T tokens across 4 stages on 1,024 Kunlun P800 chips.

We just open-sourced Qianfan-OCR, a 4B-parameter end-to-end vision-language model for document understanding.

Instead of the typical detect → recognize → LLM pipeline, this model handles OCR, layout analysis, table extraction, formula recognition, chart understanding, and key information extraction — all in one forward pass.

Core idea: Layout-as-Thought

The model can optionally enter a <think> reasoning phase before generating output, where it reasons about bounding boxes, element types, and reading order. Think of it as Chain-of-Thought, but for document layout. You can turn it on/off depending on whether you need the extra accuracy or prefer speed.

Benchmarks:

Benchmark Qianfan-OCR (4B) Notes
OmniDocBench v1.5 93.12 #1 among end-to-end models
OCRBench 880
KIE (avg) 87.9 Beats Gemini-3.1-Pro & Qwen3-VL-235B

Practical stuff:

  • Single A100 inference: 1.024 pages/sec (W8A8 quantization)
  • 192 languages (Latin, Cyrillic, Arabic, South/Southeast Asian, CJK)
  • Works with vLLM out of the box
  • Trained on 2.85T tokens across 4 stages on 1,024 Kunlun P800 chips

Links:

Happy to answer questions about architecture, training, or deployment.

submitted by /u/Dear-Cow3657
[link] [comments]