[P] I got tired of spending more time on data prep than training, so I built a platform with pre-cleaned datasets ready for fine-tuning

Reddit r/MachineLearning / 3/20/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

Neurvance aims to eliminate the data-prep bottleneck in fine-tuning by providing pre-cleaned, formatted datasets ready for training.
Datasets on Neurvance are cleaned, deduplicated, and formatted for common training frameworks, and all data is CC0 licensed.
Manual downloads are free with no signup, and an API offers bulk access and incremental pulls that can sync with your pipeline.
The project is early and imperfect, and the author is seeking feedback on missing features or bugs.
The platform is designed to speed up ML fine-tuning workflows by removing repetitive data-prep tasks and simplifying dataset access.

Every fine-tune project I've worked on followed the same pattern: model code done in an hour, data prep takes two days. Renaming columns, fixing encoding issues, filtering out garbage examples, converting to the right format. Not hard work, just slow work.

So I spent the last few months building Neurvance — a platform where every dataset is already cleaned, formatted, and structured for training. You can browse and download manually for free (everything's CC0-licensed).

What it does:

- Datasets are cleaned, deduplicated, and formatted for common training frameworks

- Manual downloads are free, no signup required

- API gives you bulk access and incremental pulls synced with your pipeline

- All data is CC0 — use it however you want

It's early and definitely rough in places. If anyone here is doing fine-tuning work and wants to try it, I'd genuinely appreciate honest feedback on what's missing or broken.

neurvance.com

Happy to answer any questions about the data pipeline, how the cleaning works, or what datasets are available.

submitted by /u/IndependentRatio2336
[link] [comments]