Tagarela - A Portuguese speech dataset from podcasts
arXiv cs.CL / 3/17/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- TAGARELA is a public dataset for Portuguese ASR and TTS, comprising over 8,972 hours of podcast audio.
- Its scale rivals English GigaSpeech, enabling state-of-the-art Portuguese models and addressing resource scarcity for the language.
- Data quality was ensured via an audio pre-processing pipeline and a mixed transcription strategy using ASR models built on high-fidelity transcriptions from proprietary APIs.
- The dataset is publicly released at https://freds0.github.io/TAGARELA/ to accelerate development of robust Portuguese speech technologies.
Related Articles
Build a WhatsApp AI Assistant Using Laravel, Twilio and OpenAI
Dev.to
Santa Augmentcode Intent Ep.6
Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.
Dev.to
Anthropic shut down the Claude OAuth workaround. Here's the cheapest alternative in 2026.
Dev.to
ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't
Dev.to