| We run an open document AI benchmark. 20 models, 9,000+ real documents. Just added all four Qwen3.5 sizes (0.8B to 9B). Now we have per-task breakdowns for every model. You can see the results here : idp-leaderboard.org Where all Qwen wins or matches: OlmOCR (text extraction from messy scans, dense PDFs, multi-column layouts): Qwen3.5-9B: 78.1 9B and 4B are ahead of every frontier model on raw text extraction. The 2B matches GPT-5.4. VQA (answering questions about document content, charts, tables): Gemini 3.1 Pro: 85.0 This one surprised us the most. The 9B is second only to Gemini 3.1 Pro on VQA. It edges past GPT-5.4. It is 14 points ahead of Claude Sonnet and 16 points ahead of Gemini Flash. For a 9B open model, that VQA score is hard to explain. KIE (extracting invoice numbers, dates, amounts): Gemini 3 Flash: 91.1 Qwen-9B matches Gemini 3.1 Pro. Qwen-4B matches GPT-5.4. Both ahead of GPT-5-Mini (85.7), Claude Haiku (85.6), and Ministral-8B (85.7). A 4B model doing production-grade field extraction. Where frontier models are clearly better. Table extraction (GrITS): Gemini 3.1 Pro: 96.4 Frontier models are 85 to 96 on tables. Qwen is stuck at 76 to 77 regardless of size. The 4B and 9B are essentially identical. This looks like an architecture limit, not a scale limit. Handwriting OCR: Gemini 3.1 Pro: 82.8 Gemini dominates handwriting. Qwen is behind but not drastically behind GPT-5.4 (69.1 vs 65.5). Scaling within the Qwen family: Overall: 0.8B 58.0, 2B 63.2, 4B 73.1, 9B 77.0 Summary: OCR extraction: Qwen 4B/9B ahead of all frontier models Every prediction is visible. Compare Qwen outputs against any model on the same documents. [link] [comments] |
Qwen3.5-9B on document benchmarks: where it beats frontier models and where it doesn't.
Reddit r/LocalLLaMA / 3/16/2026
📰 NewsModels & Research
Key Points
- オープンな文書AIベンチマークで20モデル、9,000件以上の実文書を評価し、Qwen3.5シリーズの全サイズを追加して各タスクの内訳を公開した。
- 生テキスト抽出ではQwen3.5-9BとQwen3.5-4Bがフロンティアモデルを上回り、9Bと4Bは全 frontier モデルを凌駕する。2BはGPT-5.4とほぼ互換。
- VQAではQwen3.5-9BがGemini 3.1 Proに次ぐ成績で、GPT-5.4を上回り、Claude Sonnet 4.6やGemini Flashを大きく上回る。
- KIE(請求書番号・日付・金額の抽出)ではQwen3.5-9Bが86.5、Qwen3.5-4Bが86.0でGPT-5.4を上回る一方、Gemini系には及ばない。
- 表データ抽出(GrITS)ではフロンティアモデルが高得点を取る一方、Qwenは76.6–76.7に留まり、アーキテクチャ上の限界と推定される。
Related Articles

ラピダス、半導体設計AIエージェント「国内2社海外1社が使用中」
日経XTECH

Superposition and the Capsule: Quantum State Collapse Meets AI Identity
Dev.to

The Basilisk Inversion: Why Coercive AI Futures Are Thermodynamically Unlikely
Dev.to

The Loop as Laboratory: What 3,190 Cycles of Autonomous AI Operation Reveal
Dev.to

MiMo-V2-Pro & Omni & TTS: "We will open-source — when the models are stable enough to deserve it."
Reddit r/LocalLLaMA