Tagarela - A Portuguese speech dataset from podcasts
arXiv cs.CL / 3/17/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- TAGARELA is a public dataset for Portuguese ASR and TTS, comprising over 8,972 hours of podcast audio.
- Its scale rivals English GigaSpeech, enabling state-of-the-art Portuguese models and addressing resource scarcity for the language.
- Data quality was ensured via an audio pre-processing pipeline and a mixed transcription strategy using ASR models built on high-fidelity transcriptions from proprietary APIs.
- The dataset is publicly released at https://freds0.github.io/TAGARELA/ to accelerate development of robust Portuguese speech technologies.
Related Articles
Astral to Join OpenAI
Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic
Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.
Dev.to
ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA