DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces DharmaOCR Full (7B) and DharmaOCR Lite (3B), specialized small language models for structured OCR that jointly target transcription quality, stable generation, and low inference cost.
  • It proposes DharmaOCR-Benchmark and a unified evaluation protocol that measures not only fidelity/structure but also text degeneration as a first-class metric, alongside unit cost.
  • Using Direct Preference Optimization (DPO) for OCR with degenerate generations as rejected examples helps reduce degeneration rates (up to 87.6% relative) while maintaining or improving extraction quality.
  • The models achieve new state-of-the-art results on DharmaOCR-Benchmark, scoring 0.925 (Full) and 0.911 (Lite) with very low degeneration rates (0.40% and 0.20%), and AWQ quantization cuts per-page cost by up to 22% with negligible quality loss.
  • The study argues degeneration has real downstream production impact by increasing response time, reducing throughput, and inflating compute cost due to abnormally long generations, not just as an accuracy issue.

Abstract

This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author's knowledge, as a methodological contribution, this is the first application of Direct Preference Optimization (DPO) for OCR, explicitly using degenerate generations as rejected examples to penalize looping behavior. Combined with Supervised Fine-Tuning (SFT) for enforcing a strict JSON schema (header, margin, footer, and text), DPO consistently reduces degeneration rate across model families (up to 87.6% relative) while preserving or improving extraction quality. The resulting models, namely, DharmaOCR Full (7B) and DharmaOCR Lite (3B), set a new state-of-the-art on DharmaOCR-Benchmark, outperforming each open-source and commercial baseline model evaluated regarding extraction quality, reaching 0.925 and 0.911 scores with 0.40% and 0.20% degeneration rates. AWQ quantization reduced up to 22% per-page cost with negligible quality loss, enabling a strong quality-cost trade-off in comparison to proprietary OCR APIs and open-source alternatives.