Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning
arXiv cs.CV / 4/23/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper presents the first adaptation of TrOCR, a Transformer-based OCR model, to recognize printed Tigrinya using the Ge'ez (Ethiopic) script.
- It extends a pre-trained TrOCR tokenizer by adding a byte-level BPE vocabulary covering 230 Ge'ez characters, but notes that the unmodified model yields unusable results on Ge'ez text.
- To fix systematic word-boundary errors caused by Latin-centric tokenization conventions, the authors introduce a Word-Aware Loss Weighting method.
- After adaptation, the TrOCR-Printed model reaches 0.22% Character Error Rate (CER) and 97.20% exact match accuracy on 5,000 synthetic test images from the GLOCR dataset.
- An ablation study shows Word-Aware Loss Weighting is the key improvement, cutting CER by two orders of magnitude beyond vocabulary extension alone, and the full training pipeline runs in under three hours on a single 8 GB consumer GPU with public releases.
Related Articles

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

Elevating Austria: Google invests in its first data center in the Alps.
Google Blog

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to

AI Tutor That Works Offline — Study Anywhere with EaseLearn AI
Dev.to