Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset

arXiv cs.CV / 4/14/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • 新たに、手書きバングラ文字の認識向けに78クラス・各約650サンプルのバランス型「Primary Dataset」を構築し、年齢や職業層、右利き/左利きなどの多様性も含めたと報告している。
  • 提案手法として、EfficientNetB3、Vision Transformer、Conformerを並列統合し、マルチヘッドのクロスアテンション融合で特徴の相互作用を高める「interaction-aware hybrid deep learning architecture」を提示している。
  • 内製データセットで98.84%の精度、外部ベンチマークCHBCRで96.49%を達成し、クラス不均衡やクラス間の視覚類似性に対して良好な汎化が示された。
  • Grad-CAMによる可視化で、識別に寄与する領域を解釈可能にした点も含めている。
  • データセットとソースコードはHugging Faceで公開されており、研究・再利用を促進する内容となっている。

Abstract

Character recognition is the fundamental part of an optical character recognition (OCR) system. Word recognition, sentence transcription, document digitization, and language processing are some of the higher-order activities that can be done accurately through character recognition. Nonetheless, recognizing handwritten Bangla characters is not an easy task because they are written in different styles with inconsistent stroke patterns and a high degree of visual character resemblance. The datasets available are usually limited in intra-class and inequitable in class distribution. We have constructed a new balanced dataset of Bangla written characters to overcome those problems. This consists of 78 classes and each class has approximately 650 samples. It contains the basic characters, composite (Juktobarno) characters and numerals. The samples were a diverse group comprising a large age range and socioeconomic groups. Elementary and high school students, university students, and professionals are the contributing factors. The sample also has right and left-handed writers. We have further proposed an interaction-aware hybrid deep learning architecture that integrates EfficientNetB3, Vision Transformer, and Conformer modules in parallel. A multi-head cross-attention fusion mechanism enables effective feature interaction across these components. The proposed model achieves 98.84% accuracy on the constructed dataset and 96.49% on the external CHBCR benchmark, demonstrating strong generalization capability. Grad-CAM visualizations further provide interpretability by highlighting discriminative regions. The dataset and source code of this research is publicly available at: https://huggingface.co/MIRZARAQUIB/Bangla_Handwritten_Character_Recognition.