AI Navigate

[P] I've trained my own OMR model (Optical Music Recognition)

Reddit r/MachineLearning / 3/15/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • Clarity-OMR processes sheet music PDFs into MusicXML via a four-stage pipeline: YOLO-based staff detection, a DaViT+RoPE encoder/decoder for recognition, a grammar FSA for constrained decoding, and MusicXML export.
  • The model uses a DaViT-Base encoder with a custom Transformer decoder that outputs a 487-token music vocabulary, and performs staff-level recognition at 192px height to preserve fine detail.
  • Structural validity is enforced during decoding with a grammar FSA and DoRA rank-64 applied to all linear layers to improve stability and accuracy.
  • In benchmarking against Audiveris on 10 classical piano pieces, Clarity-OMR is roughly competitive (42.8 vs 44.0) and excels on cleaner, more rhythmic scores but struggles when notes are off-stave.
  • The author calls for improvements such as better polyphonic training data, smarter grammar constraints, more diverse rendering, and acknowledges potential benefits from combining model with vision approaches; all code and weights are open-source.

Hi i trained an optical music recognition model and wanted to share it here because I think my approach can get improvments and feedback.

Clarity-OMR takes sheet music PDFs and converts them to MusicXML files. The core is a DaViT-Base encoder paired with a custom Transformer decoder that outputs a 487-token music vocabulary. The whole thing runs as a 4-stage pipeline: YOLO for staff detection → DaViT+RoPE decoder for recognition → grammar FSA for constrained beam search → MusicXML export.

Some key design choices:

- Staff-level recognition at 192px height instead of full-page end-to-end (preserves fine detail)

- DoRA rank-64 on all linear layers

- Grammar FSA enforces structural validity during decoding (beat consistency, chord well-formedness)

I benchmarked against Audiveris on 10 classical piano pieces using mir_eval. It's roughly competitive overall (42.8 vs 44.0 avg quality score), with clear wins on cleaner/more rhythmic scores (69.5 vs 25.9 on Bartók, 66.2 vs 33.9 on The Entertainer) and weaknesses when the notes are not proprely on the stave with cherry picked scores it should out perform audiveris. Details on the benchmark can be found on the huggingface link.

I think there's a ton of room to push this further — better polyphonic training data, smarter grammar constraints, and more diverse synthetic rendering could all help significantly. As well as another approach than the stave by stave one. Or just use a mix of model + vision to get the best score possible.

Everything is open-source:

- Inference: https://github.com/clquwu/Clarity-OMR

- Training: https://github.com/clquwu/Clarity-OMR-Train

- Weights: https://huggingface.co/clquwu/Clarity-OMR

There is much more details in Clarity-OMR-Train about the model itself the code is a bit messy beceause it's literraly all the code i've produced for it.

submitted by /u/Clarity___
[link] [comments]