Hi i trained an optical music recognition model and wanted to share it here because I think my approach can get improvments and feedback.
Clarity-OMR takes sheet music PDFs and converts them to MusicXML files. The core is a DaViT-Base encoder paired with a custom Transformer decoder that outputs a 487-token music vocabulary. The whole thing runs as a 4-stage pipeline: YOLO for staff detection → DaViT+RoPE decoder for recognition → grammar FSA for constrained beam search → MusicXML export.
Some key design choices:
- Staff-level recognition at 192px height instead of full-page end-to-end (preserves fine detail)
- DoRA rank-64 on all linear layers
- Grammar FSA enforces structural validity during decoding (beat consistency, chord well-formedness)
I benchmarked against Audiveris on 10 classical piano pieces using mir_eval. It's roughly competitive overall (42.8 vs 44.0 avg quality score), with clear wins on cleaner/more rhythmic scores (69.5 vs 25.9 on Bartók, 66.2 vs 33.9 on The Entertainer) and weaknesses when the notes are not proprely on the stave with cherry picked scores it should out perform audiveris. Details on the benchmark can be found on the huggingface link.
I think there's a ton of room to push this further — better polyphonic training data, smarter grammar constraints, and more diverse synthetic rendering could all help significantly. As well as another approach than the stave by stave one. Or just use a mix of model + vision to get the best score possible.
Everything is open-source:
- Inference: https://github.com/clquwu/Clarity-OMR
- Training: https://github.com/clquwu/Clarity-OMR-Train
- Weights: https://huggingface.co/clquwu/Clarity-OMR
There is much more details in Clarity-OMR-Train about the model itself the code is a bit messy beceause it's literraly all the code i've produced for it.
[link] [comments]




