AI Navigate

Mistral Small 4 vs Qwen3.5-9B on document understanding benchmarks, but it does better than GPT-4.1

Reddit r/LocalLLaMA / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • On the document tasks leaderboard, Qwen3.5-9B wins 10 of 14 sub-benchmarks while Mistral Small 4 wins 2 and two are ties; overall Qwen ranks #9 with 77.0 and Mistral #11 with 71.5.
  • In the OlmOCR Bench, Qwen outperforms Mistral across all sub-categories (78.1 vs 69.6), with the largest gap in math OCR (85.5 vs 66) and both models showing weak absent detection (57.2 vs 44.7).
  • OmniDocBench results are very close (76.7 vs 76.4); Mistral leads on table-structure metrics (TEDS 75.1 vs 73.9; TEDS-S 82.7 vs 77.6) while Qwen takes the CDM and read order tasks.
  • In IDP Core Bench, Qwen dominates (76.2 vs 68.5; KIE 86.5 vs 78.3; OCR 65.5 vs 57.4), indicating broader strength for Qwen across metrics.
  • A notable takeaway is that a 9B dense model can outperform a 119B MoE in these document tasks, underscoring that parameter count isn’t everything; the post also discusses NVFP4 4-bit quantization as a practical path for local running, with caveats about vision quality under aggressive compression.
Mistral Small 4 vs Qwen3.5-9B on document understanding benchmarks, but it does better than GPT-4.1

Ran Mistral Small 4 through some document tasks via the Mistral API and wanted to see where it actually lands.

This leaderboard does head-to-head comparisons on document tasks:
https://www.idp-leaderboard.org/compare/?models=mistral-small-4,qwen3-5-9b

The short version: Qwen3.5-9B wins 10 out of 14 sub-benchmarks. Mistral wins 2. Two ties. Qwen is rank #9 with 77.0, Mistral is rank #11 with 71.5.

OlmOCR Bench: Qwen 78.1, Mistral 69.6. Qwen wins every sub-category. The math OCR gap is the biggest, 85.5 vs 66. Absent detection is bad on both (57.2 vs 44.7) but Mistral is worse.

OmniDocBench: closest of the three, 76.7 vs 76.4. Mistral actually wins on table structure metrics, TEDS at 75.1 vs 73.9 and TEDS-S at 82.7 vs 77.6. Qwen takes CDM and read order.

IDP Core Bench: Qwen 76.2, Mistral 68.5. KIE is 86.5 vs 78.3, OCR is 65.5 vs 57.4. Qwen across the board.

The radar charts tell the story visually. Qwen's is larger and spikier, peaks at 84.7 on text extraction. Mistral's is a smaller, tighter hexagon. Everything between 75.5 and 78.3, less than 3 points of spread. High floor, low ceiling.

Worth noting this is a 9B dense model beating a 119B MoE (6B active). Parameter count obviously isn't everything for document tasks.

One thing I'm curious about is the NVFP4 quant. Mistral released a 4-bit quantized checkpoint and the model is 242GB at full precision. For anyone who wants to run this locally, quantization is the only realistic path unless you have 4xH100s. But I don't know if the vision capabilities survive that compression. The benchmarks above are full precision via API.

Anyone running the NVFP4 quant for doc tasks? Curious if the vision quality survives quantization?

submitted by /u/shhdwi
[link] [comments]