We just open-sourced Qianfan-OCR, a 4B-parameter end-to-end vision-language model for document understanding.
Instead of the typical detect → recognize → LLM pipeline, this model handles OCR, layout analysis, table extraction, formula recognition, chart understanding, and key information extraction — all in one forward pass.
Core idea: Layout-as-Thought
The model can optionally enter a <think> reasoning phase before generating output, where it reasons about bounding boxes, element types, and reading order. Think of it as Chain-of-Thought, but for document layout. You can turn it on/off depending on whether you need the extra accuracy or prefer speed.
Benchmarks:
| Benchmark | Qianfan-OCR (4B) | Notes |
|---|---|---|
| OmniDocBench v1.5 | 93.12 | #1 among end-to-end models |
| OCRBench | 880 | |
| KIE (avg) | 87.9 | Beats Gemini-3.1-Pro & Qwen3-VL-235B |
Practical stuff:
- Single A100 inference: 1.024 pages/sec (W8A8 quantization)
- 192 languages (Latin, Cyrillic, Arabic, South/Southeast Asian, CJK)
- Works with vLLM out of the box
- Trained on 2.85T tokens across 4 stages on 1,024 Kunlun P800 chips
Links:
- 🤗 Model: https://huggingface.co/baidu/Qianfan-OCR
- 📄 Tech report: https://arxiv.org/abs/2603.13398
- 💻 Code: https://github.com/baidubce/Qianfan-VL
- 📰 HF Daily Paper: https://huggingface.co/papers/2603.13398
Happy to answer questions about architecture, training, or deployment.
[link] [comments]




