easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]

Reddit r/MachineLearning / 4/18/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Read original →

共有:

Key Points

easyaligner is a new forced-alignment library aimed at making speech-to-text preprocessing faster and more convenient, especially for large-scale audio/text workflows.
It adds workflow features such as automatically detecting which audio region matches an incomplete transcript, trimming irrelevant speech at segment boundaries, and aligning long audio/text without mandatory chunking.
The tool provides flexible ground-truth text normalization to improve alignment quality while preserving a mapping back to the original text so formatting can be recovered after alignment.
Under the hood, easyaligner uses PyTorch’s forced alignment API with a GPU-accelerated Viterbi-based implementation and is adapted to support emission extraction from wav2vec2 models on the Hugging Face Hub.

easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]

https://preview.redd.it/f4d5krhkjyvg1.png?width=1020&format=png&auto=webp&s=11310f377b22abbe3dd110cc7d362ba8aae35f8d

I have built easyaligner, a forced alignment library designed to be performant and easy to use.

Having worked with preprocessing hundreds of thousands of hours of audio and text for training speech-to-text models, I found that the available open source forced alignment libraries often missed some convenience features. For our purposes it was, in particular, important for the tooling to be able to:

Handle cases where the transcript does not cover all of the spoken content in the audio (by automatically detecting the relevant audio region).
Handle some irrelevant speech at the start/end of audio segments to be aligned.
Ideally handle long segments of audio and text without the need for chunking.
Normalize ground-truth texts for better alignment quality, while maintaining a mapping between the normalized text and the original text, so that the original text's formatting can be recovered after alignment.

easyaligner is an attempt to package all of these workflow improvements into a forced alignment library.

The documentation has tutorials for different alignment scenarios, and for custom text processing. The aligned outputs can be segmented at any level of granularity (sentence, paragraph, etc.), while preserving the original text’s formatting.

The forced alignment backend uses Pytorch's forced alignment API with a GPU based implementation of the Viterbi algorithm. It's both fast and memory-efficient, handling hours of audio/text in one pass without the need to chunk the audio. I've adapted the API to support emission extraction from all wav2vec2 on Hugging Face Hub. You can force align audio and text in any language, as long as there's a w2v2 model on HF Hub that can transcribe the language.

easyaligner supports aligning both from ground-truth transcripts, as well as from ASR model outputs. Check out its companion library easytranscriber for an example where easyaligner is used as a backend to align ASR outputs. It works the same way as WhisperX, but transcribes 35% to 102% faster, depending on the hardware.

The documentation: https://kb-labb.github.io/easyaligner/
Source code on Github (MIT licensed): https://github.com/kb-labb/easyaligner

submitted by /u/mLalush
[link] [comments]