Every fine-tune project I've worked on followed the same pattern: model code done in an hour, data prep takes two days. Renaming columns, fixing encoding issues, filtering out garbage examples, converting to the right format. Not hard work, just slow work.
So I spent the last few months building Neurvance — a platform where every dataset is already cleaned, formatted, and structured for training. You can browse and download manually for free (everything's CC0-licensed).
What it does:
- Datasets are cleaned, deduplicated, and formatted for common training frameworks
- Manual downloads are free, no signup required
- API gives you bulk access and incremental pulls synced with your pipeline
- All data is CC0 — use it however you want
It's early and definitely rough in places. If anyone here is doing fine-tuning work and wants to try it, I'd genuinely appreciate honest feedback on what's missing or broken.
Happy to answer any questions about the data pipeline, how the cleaning works, or what datasets are available.
[link] [comments]




