[D] Offering licensed Indian language speech datasets (with explicit contributor consent)

Reddit r/MachineLearning / 4/5/2026

📰 NewsSignals & Early TrendsTools & Practical Usage

Key Points

  • A small data initiative, DataCatalyst, offers licensed speech datasets covering multiple Indian languages sourced directly from contributors who provide explicit consent for their recordings’ use.
  • The datasets can be provided with either exclusive or non-exclusive licensing terms depending on the intended use case.
  • The initiative positions the data as ethically collected and aims to support teams building or researching ASR, TTS, and other voice AI applications.
  • Interested parties are invited to contact the founder for details on the datasets and the collection process.

Hi everyone,

I run a small data initiative where we collect speech datasets in multiple Indian languages directly from contributors who provide explicit consent for their recordings to be used and licensed.

We can provide datasets with either exclusive or non-exclusive rights depending on the use case. The goal is to make ethically sourced speech data available for teams working on ASR, TTS, voice AI, or related research.

If anyone here is working on speech models and might be looking for Indian language audio data, feel free to reach out. Happy to share more details about the datasets and collection process.

— Divyam
Founder, DataCatalyst
datacatalyst.in

submitted by /u/Trick-Praline6688
[link] [comments]