Multimodal OCR: Parse Anything from Documents
arXiv cs.CV / 3/16/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- MOCR introduces dots.mocr, a multimodal OCR system that jointly parses text and graphics into unified textual representations, treating charts, diagrams, tables, and icons as first-class parsing targets.
- The approach enables end-to-end training over heterogeneous document elements and converts graphical regions into reusable code-level supervision for multimodal learning.
- The authors build a data engine from PDFs, rendered webpages, and native SVG assets and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning.
- In evaluations, dots.mocr ranks near the top on document parsing benchmarks (second to Gemini 3 Pro on OCR Arena Elo) and achieves a new state-of-the-art 83.9 on olmOCR Bench, while also outperforming in structured graphics parsing on image-to-SVG tasks.
- The work demonstrates a scalable path toward large-scale image-to-code corpora for multimodal pretraining, with code and models publicly available.




