AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper presents AtlasOCR, described as the first open-source OCR model tailored specifically for Darija (Moroccan Arabic), built by fine-tuning a 3B-parameter Vision Language Model.
  • It details a data pipeline combining Darija-specific dataset curation with synthetic text generation (via the authors’ OCRSmith library) plus carefully sourced real-world samples.
  • The authors use parameter-efficient fine-tuning (Q- LoRA) with Unsloth to efficiently train Qwen2.5-VL 3B, along with ablation studies to optimize training hyperparameters.
  • AtlasOCR is evaluated on a new benchmark (AtlasOCRBench) and the established KITAB-Bench, where it reportedly achieves state-of-the-art results and demonstrates strong generalization across Darija and standard Arabic OCR tasks.
  • The work positions the model as competitive with larger OCR systems, emphasizing robustness and transferability rather than relying solely on scale.

Abstract

Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.