Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

共有:

Key Points

Qianfan-OCR is a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding in a single architecture.
It supports direct image-to-Markdown output and prompt-driven tasks such as table extraction, chart understanding, document QA, and key information extraction.
It introduces Layout-as-Thought, an optional thinking phase triggered by think tokens to generate structured layout representations before final outputs, restoring layout grounding.
It ranks first among end-to-end models on OmniDocBench v1.5 and OlmOCR Bench and shows competitive results on OCRBench, CCOCR, DocVQA, and ChartQA, with top averages on public key information extraction benchmarks.
The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

Abstract

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

Self-Refining Agents in Spec-Driven Development

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

M2.7 open weights coming in ~2 weeks

Reddit r/LocalLLaMA

MiniMax M2.7 Will Be Open Weights

Reddit r/LocalLLaMA

Best open source coding models for claude code? LB?

Reddit r/LocalLLaMA

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Key Points

Abstract

Related Articles

Self-Refining Agents in Spec-Driven Development

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

M2.7 open weights coming in ~2 weeks

MiniMax M2.7 Will Be Open Weights

Best open source coding models for claude code? LB?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer