| I started working on a small coffee coaching app recently - something that could answer questions around brew methods, grind size, extraction, etc. I was looking for good data and realized most written sources are either shallow or scattered. YouTube, on the other hand, has insanely high-quality content (James Hoffmann, Lance Hedrick, etc.), but it’s not usable out of the box for RAG. Transcripts are messy, chunking is inconsistent, getting everything into a usable format took way more effort than expected. So I made a small CLI tool that:
It basically became the data layer for my app, and funnily ended up getting way more traction than my actual coffee coaching app! Repo: youtube-rag-scraper [link] [comments] |
[P] Using YouTube as a data source (lessons from building a coffee domain dataset)
Reddit r/MachineLearning / 3/30/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage
Key Points
- The author built a coffee coaching app and found that YouTube’s expert brew-method content is high-quality but not directly usable for RAG because transcripts require heavy preprocessing.
- They report that transcript quality issues, inconsistent chunking, and cleaning steps made turning YouTube content into an embedding-ready dataset more time-consuming than expected.
- To address this, they created a CLI tool that pulls videos from a channel, extracts transcripts, and cleans and chunks them for use with embeddings.
- The project’s GitHub repo, youtube-rag-scraper, is positioned as the data layer for their RAG workflow and reportedly gained more traction than the original app.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to