HalalBench: A Multilingual OCR Benchmark for Food Packaging Ingredient Extraction

arXiv cs.CV / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • HalalBench is a new open multilingual OCR benchmark focused specifically on food packaging ingredient label extraction, addressing the lack of standardized evaluation for this use case.
  • The benchmark includes 1,043 images (50 real and 993 synthetic) with 36,438 COCO-format annotations across 14 languages, reflecting real-world challenges like curved packaging surfaces and dense multilingual text.
  • Four OCR engines were evaluated (docTR, ML Kit, EasyOCR, and others), with overall F1 scores around 0.167–0.193 and complete failure on Japanese (F1=0.000).
  • A post-processing clustering ablation improved F1 by 36%, and results are validated with HalalLens, a production halal scanner deployed across 20+ countries.
  • The dataset and code are released under open licenses to enable further research and benchmarking in food packaging OCR.
  • Categories: []

Abstract

No standardized benchmark exists for evaluating OCR on food packaging, despite its critical role in automated halal food verification. Existing benchmarks target documents or scene text, missing the unique challenges of ingredient labels: curved surfaces, dense multilingual text, and sub-8pt fonts. We present HalalBench, the first open multilingual benchmark for food packaging OCR, comprising 1,043 images (50 real, 993 synthetic) with 36,438 annotations in COCO format spanning 14 languages. We evaluate four engines: docTR achieves F1=0.193, ML Kit 0.180, EasyOCR 0.167, while all fail on Japanese (F1=0.000). A clustering ablation shows 36% F1 improvement from our post-processing algorithm. We validate findings through HalalLens (https://halallens.no), a production halal scanner serving 20+ countries. Dataset and code are released under open licenses.