Multimodal OCR: Parse Anything from Documents
arXiv cs.CV / 3/16/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- MOCR introduces dots.mocr, a multimodal OCR system that jointly parses text and graphics into unified textual representations, treating charts, diagrams, tables, and icons as first-class parsing targets.
- The approach enables end-to-end training over heterogeneous document elements and converts graphical regions into reusable code-level supervision for multimodal learning.
- The authors build a data engine from PDFs, rendered webpages, and native SVG assets and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning.
- In evaluations, dots.mocr ranks near the top on document parsing benchmarks (second to Gemini 3 Pro on OCR Arena Elo) and achieves a new state-of-the-art 83.9 on olmOCR Bench, while also outperforming in structured graphics parsing on image-to-SVG tasks.
- The work demonstrates a scalable path toward large-scale image-to-code corpora for multimodal pretraining, with code and models publicly available.
Related Articles

Black Hat USA
AI Business
The passive income blueprint that AI makes possible in 2026
Dev.to
BizNode uses Ollama (Qwen3.5) running locally on your hardware — your data never leaves your machine. True AI privacy
Dev.to
Why I Ditched Helicone for a EU-Hosted LLM Observability Platform (and Saved €400/month)
Dev.to
Beyond Prompt Engineering: The Shift to Agentic Orchestration
Dev.to