AI Navigate

Tagarela - A Portuguese speech dataset from podcasts

arXiv cs.CL / 3/17/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • TAGARELA is a public dataset for Portuguese ASR and TTS, comprising over 8,972 hours of podcast audio.
  • Its scale rivals English GigaSpeech, enabling state-of-the-art Portuguese models and addressing resource scarcity for the language.
  • Data quality was ensured via an audio pre-processing pipeline and a mixed transcription strategy using ASR models built on high-fidelity transcriptions from proprietary APIs.
  • The dataset is publicly released at https://freds0.github.io/TAGARELA/ to accelerate development of robust Portuguese speech technologies.

Abstract

Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at https://freds0.github.io/TAGARELA/, to foster the development of robust speech technologies.