AI Navigate

A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR

arXiv cs.CV / 3/12/2026

📰 NewsModels & Research

Key Points

  • The paper introduces a robust Bangla license plate recognition system that combines a deep learning–based localization model with OCR to extract text, achieving 97.83% accuracy and an IoU of 91.3% on Bangla plates.
  • It evaluates multiple object detection architectures, including U-Net and YOLO variants, and proposes a two-stage adaptive training strategy based on YOLOv8 to enhance localization performance.
  • Text recognition is formulated as a sequence generation problem using a VisionEncoderDecoder framework, with ViT + BanglaBERT achieving a character error rate of 0.1323 and a word error rate of 0.1068.
  • The framework demonstrates robustness across diverse real-world conditions and is positioned for deployment in intelligent transportation applications such as automated law enforcement and access control.

Abstract

An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.