Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A Hugging Face dataset (angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k) provides 8,706 synthetic fine-tuning chat examples generated from Claude 4.6/4.7 with reasoning included in every example.
The dataset includes multiple splits—Full, Instruct (7,217 examples across 24 instructional categories), Roleplay (1,489 across four creative roleplay categories), and Code (1,840 limited to coding + math).
It contains an estimated 17,013,533 tokens overall, with most samples being multi-turn (39.7%) versus single-turn (60.3%), and category/token distributions that vary significantly (e.g., coding, humanities, and science are prominent in counts).
The dataset notes basic cleaning was applied and suggests safety/refusal behavior was “repressed,” and the submitter reports having briefly consumed plan usage before the limit expired.
Examples are split across two source models—claude-opus-4.6 (4,675; 53.7%) and claude-opus-4.7 (4,031; 46.3%)—with the majority of estimated tokens associated with the source model mix.

https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k

A synthetic fine-tuning dataset created from Claude 4.6/4.7. 8,706 total examples all with reasoning. I haven't reviewed the data but there was some basic cleaning applied. Refusals and safety should be repressed. I ended up with extra usage on a plan before it expired.

| Split | File | Examples | Contents | |-------|------|---------:|----------| | **Full** | `full_train.jsonl` | 8,706 | All examples across all 28 categories. | | **Instruct** | `instruct_train.jsonl` | 7,217 | All 24 instructional categories — coding, math, sciences, humanities, arts, finance, medicine, law, business, linguistics, creative writing, general. | | **Roleplay** | `roleplay_train.jsonl` | 1,489 | The four creative categories — `roleplay_hero`, `roleplay_villain`, `roleplay_crossover`, `narrative_prose`. | | **Code** | `code_train.jsonl` | 1,840 | `coding` + `math` only. For coding/math-focused fine-tunes. | ## Overall | Metric | Value | |---|---:| | Examples | 8,706 | | Tokens (estimated) | 17,013,533 | | Avg tokens / example | 1,954 | | Multi-turn | 3,454 (39.7%) | | Single-turn | 5,252 (60.3%) | ## Category Counts | Category | Examples | Tokens | Multi-turn % | |----------|---------:|-------:|-------------:| | coding | 1,628 | 2,545,221 | 30.4% | | humanities | 862 | 1,849,708 | 32.5% | | science | 737 | 1,681,346 | 37.4% | | roleplay_hero | 419 | 640,084 | 63.5% | | roleplay_villain | 378 | 635,984 | 60.8% | | narrative_prose | 377 | 710,807 | 43.0% | | roleplay_crossover | 315 | 581,188 | 56.8% | | creative_writing | 281 | 532,504 | 30.6% | | medicine | 280 | 519,662 | 22.1% | | biology | 277 | 541,013 | 21.3% | | general | 276 | 284,696 | 37.0% | | arts | 245 | 576,170 | 41.2% | | chemistry | 221 | 508,546 | 52.9% | | physics | 220 | 512,196 | 56.8% | | math | 212 | 394,907 | 54.2% | | geography | 155 | 358,321 | 42.6% | | history | 155 | 348,822 | 41.3% | | economics | 155 | 380,372 | 42.6% | | political_science | 154 | 374,901 | 38.3% | | sociology | 154 | 378,261 | 42.2% | | business | 152 | 315,065 | 38.2% | | earth_science | 152 | 358,209 | 41.4% | | finance | 151 | 328,607 | 38.4% | | philosophy | 150 | 335,514 | 41.3% | | linguistics | 150 | 306,889 | 39.3% | | literature | 150 | 299,606 | 38.7% | | psychology | 150 | 339,565 | 39.3% | | law | 150 | 375,360 | 41.3% | ## By Model | Model | Count | Share | Tokens | |---|---:|---:|---:| | claude-opus-4-6 | 4,675 | 53.7% | 6,304,169 | | claude-opus-4-7 | 4,031 | 46.3% | 10,709,363 |

submitted by /u/AldebaranBefore
[link] [comments]