A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an end-to-end Optical Music Recognition (OMR) framework that combines residual bottleneck convolutions with BiGRU-based sequence modeling to convert score images into symbolic representations.
  • A CNN using ResNet-v2-style residual bottleneck blocks and multi-scale dilated convolutions is employed to capture both fine symbol details and global staff-line structures.
  • The approach uses Connectionist Temporal Classification (CTC) loss to perform predictions without requiring explicit alignment annotations between image regions and output sequences.
  • Experiments on the Camera-PrIMuS and PrIMuS datasets show strong performance, including low sequence error rates (7.52% on Camera-PrIMuS, 8.11% on PrIMuS) and symbol error rates (0.45% and 0.49% respectively).
  • The model also reports high pitch/type/note accuracies (around 99% for all metrics) while maintaining computational efficiency (about 1.74 seconds per training epoch on average).

Abstract

Optical Music Recognition (OMR) aims to convert printed or handwritten music score images into editable symbolic representations. This paper presents an end-to-end OMR framework that combines residual bottleneck convolutions with bidirectional gated recurrent unit (BiGRU)-based sequence modeling. A convolutional neural network with ResNet-v2-style residual bottleneck blocks and multi-scale dilated convolutions is used to extract features that encode both fine-grained symbol details and global staff-line structures. The extracted feature sequences are then fed into a BiGRU network to model temporal dependencies among musical symbols. The model is trained using the Connectionist Temporal Classification loss, enabling end-to-end prediction without explicit alignment annotations. Experimental results on the Camera-PrIMuS and PrIMuS datasets demonstrate the effectiveness of the proposed framework. On Camera-PrIMuS, the proposed method achieves a sequence error rate (SeER) of 7.52\% and a symbol error rate (SyER) of 0.45\%, with pitch, type, and note accuracies of 99.33\%, 99.60\%, and 99.28\%, respectively. The average training time is 1.74~s per epoch, demonstrating high computational efficiency while maintaining strong recognition performance. On PrIMuS, the method achieves a SeER of 8.11\% and a SyER of 0.49\%, with pitch, type, and note accuracies of 99.27\%, 99.58\%, and 99.21\%, respectively. A fine-grained error analysis further confirms the effectiveness of the proposed model.