A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes an end-to-end Optical Music Recognition (OMR) framework that combines residual bottleneck convolutions with BiGRU-based sequence modeling to convert score images into symbolic representations.
A CNN using ResNet-v2-style residual bottleneck blocks and multi-scale dilated convolutions is employed to capture both fine symbol details and global staff-line structures.
The approach uses Connectionist Temporal Classification (CTC) loss to perform predictions without requiring explicit alignment annotations between image regions and output sequences.
Experiments on the Camera-PrIMuS and PrIMuS datasets show strong performance, including low sequence error rates (7.52% on Camera-PrIMuS, 8.11% on PrIMuS) and symbol error rates (0.45% and 0.49% respectively).
The model also reports high pitch/type/note accuracies (around 99% for all metrics) while maintaining computational efficiency (about 1.74 seconds per training epoch on average).

Abstract

Optical Music Recognition (OMR) aims to convert printed or handwritten music score images into editable symbolic representations. This paper presents an end-to-end OMR framework that combines residual bottleneck convolutions with bidirectional gated recurrent unit (BiGRU)-based sequence modeling. A convolutional neural network with ResNet-v2-style residual bottleneck blocks and multi-scale dilated convolutions is used to extract features that encode both fine-grained symbol details and global staff-line structures. The extracted feature sequences are then fed into a BiGRU network to model temporal dependencies among musical symbols. The model is trained using the Connectionist Temporal Classification loss, enabling end-to-end prediction without explicit alignment annotations. Experimental results on the Camera-PrIMuS and PrIMuS datasets demonstrate the effectiveness of the proposed framework. On Camera-PrIMuS, the proposed method achieves a sequence error rate (SeER) of

7.52\%

and a symbol error rate (SyER) of

0.45\%

, with pitch, type, and note accuracies of

99.33\%

99.60\%

, and

99.28\%

, respectively. The average training time is 1.74~s per epoch, demonstrating high computational efficiency while maintaining strong recognition performance. On PrIMuS, the method achieves a SeER of

8.11\%

and a SyER of

0.49\%

, with pitch, type, and note accuracies of

99.27\%

99.58\%

, and

99.21\%

, respectively. A fine-grained error analysis further confirms the effectiveness of the proposed model.

Claude and I aren't vibing at all

Dev.to

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)

Dev.to

From Generic to Granular: AI-Powered CMA Personalization for Solo Agents

Dev.to

Kiwi-chan Devlog #007: The Audit Never Sleeps (and Neither Does My GPU)

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions

Key Points

Abstract

Related Articles

Claude and I aren't vibing at all

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)

From Generic to Granular: AI-Powered CMA Personalization for Solo Agents

Kiwi-chan Devlog #007: The Audit Never Sleeps (and Neither Does My GPU)

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer