[D] Released a 100k-sample dataset on Hugging Face

Reddit r/LocalLLaMA / 4/16/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

A 100,000-sample Chain-of-Thought (CoT) dataset has been released on Hugging Face for fine-tuning local reasoning models.
The dataset provides explicit intermediate reasoning traces (not answer-only supervision) to help improve reasoning consistency during supervised fine-tuning.
The release targets especially smaller local models, aiming to evaluate whether full reasoning traces improve or degrade performance.
The author is soliciting community feedback on CoT length, consistency of reasoning style, and the tradeoffs of including full traces for smaller models.
The dataset is shared specifically to support work on local LLM fine-tuning and reasoning distillation, with a direct link to the Hugging Face dataset page.

We’ve released a 100,000-sample Chain-of-Thought (CoT) dataset for fine-tuning local reasoning models.

Each sample includes explicit intermediate reasoning traces, rather than answer-only supervision. The goal is to improve reasoning consistency during supervised fine-tuning, especially for smaller local models.

We’re sharing it here to gather feedback from people working on local LLM fine-tuning and reasoning distillation.

I’d especially love feedback on:

- CoT length

- consistency of reasoning style

- whether full reasoning traces help or hurt smaller local models

Hugging Face:

https://huggingface.co/datasets/Kamisori-daijin/email-datasets-v2-100k

submitted by /u/AdhesivenessSea9511
[link] [comments]