ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization

arXiv cs.CL / 3/23/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

ShobdoSetu presents a data-centric framework for Bengali long-form automatic speech recognition and speaker diarization, addressing the language's under-resourced status.
The approach builds a high-quality training corpus from Bengali YouTube audiobooks and dramas, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation.
The authors fine-tune the tugstugi/whisper-medium model on ~21,000 data points with beam size 5, achieving a WER of 16.751 on the public leaderboard and 15.551 on the private test set.
For diarization, they fine-tune the pyannote.audio segmentation model in an extreme low-resource setting (10 training files), achieving a DER of 0.19974 on the public leaderboard and 0.26723 on the private test set.
The results suggest that careful data engineering and domain-adaptive fine-tuning can yield competitive Bengali speech processing performance without relying on large annotated corpora.

Abstract

Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). For Task~1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \cite{tabib2026bengaliloop}, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttt{tugstugi/whisper-medium} model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task~2, we fine-tune the pyannote.audio community-1 segmentation model with targeted hyperparameter optimization under an extreme low-resource setting (10 training files), achieving a Diarization Error Rate (DER) of 0.19974 on the public leaderboard, and .26723 on the private test set. Our results demonstrate that careful data engineering and domain-adaptive fine-tuning can yield competitive performance for Bengali speech processing even without large annotated corpora.

I built an online background remover and learned a lot from launching it

Dev.to

How AI is Transforming Dynamics 365 Business Central

Dev.to

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Reddit r/artificial

Do I need different approaches for different types of business information errors?

Dev.to

ShieldCortex: What We Learned Protecting AI Agent Memory

Dev.to

ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization

Key Points

Abstract

Related Articles

I built an online background remover and learned a lot from launching it

How AI is Transforming Dynamics 365 Business Central

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Do I need different approaches for different types of business information errors?

ShieldCortex: What We Learned Protecting AI Agent Memory

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer