AI Navigate

Trained a GPT transformer from scratch on a $300 CPU — 39 minutes, 0.82M params, no GPU needed

Reddit r/LocalLLaMA / 3/21/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A character-level GPT transformer with 0.82M parameters was trained from scratch in PyTorch on CPU only (no GPU) in 39 minutes on a $300 machine.
  • The model learns character patterns rather than words, with a vocab size of 28, and demonstrates end-to-end training without fine-tuning or pre-trained weights.
  • The training run shows continuous improvement, with train and val losses decreasing together and no overfitting observed across 3000 steps.
  • Generated output reveals learned story structure and character names but shortfalls in spelling and global coherence, reflecting a character-level rather than word-level understanding.
  • The article outlines next steps: scale data to 1M+ characters, extend training to 5000-10000 iterations, and consider larger models after increasing data/steps.

Character-level GPT transformer built in PyTorch from scratch — pure architecture and training from zero. No fine-tuning, no pre-trained weights, no cloud compute.

Can be trained on $300 machine

Git hub repo : https://github.com/Eamon2009/Transformer-language-model

What I trained:

Parameters : 0.82M Dataset : 201K characters of children's stories Vocab size : 28 unique characters Hardware : CPU only — AMD Ryzen 5 Train time : 39 minutes Best val : 1.3145 — still improving at step 3000 

Full training log:

[ 0/3000] train=3.2961 val=3.2981 << best! [ 200/3000] train=2.3038 val=2.2490 << best! [ 400/3000] train=2.2469 val=2.1950 << best! [ 800/3000] train=1.9742 val=1.9103 << best! [ 1400/3000] train=1.5889 val=1.5360 << best! [ 2000/3000] train=1.4604 val=1.4081 << best! [ 2600/3000] train=1.3501 val=1.3446 << best! [ 2999/3000] train=1.3191 val=1.3145 << best! 

Every single checkpoint improved. No overfitting at all — train and val loss decreased together the entire run.

Actual output the model generated:

one day and was arroom him that she rabbing animals the dreezed at neard had to there man owl them one smiled the mushrought boy he rabbit to havin after the but help 

Story structure learned. Character names learned. Narrative flow learned. Spelling breaks because the model works character by character — it learned that after fr comes i,e,n,d but sometimes gets the sequence slightly wrong. No concept of words, only character patterns.

What it got right vs wrong:

✓ Story structure → "one day...", paragraphs, narrative flow ✓ Character names → jack, tim, lucy, mary ✓ Sentence patterns → "he said", "she was", "they went" ✗ Spelling → "driendly", "mushrought", "surpring" ✗ Logic → sentences don't connect coherently 

The architecture runs on any hardware:

batch_size = 16 block_size = 128 n_embd = 128 n_head = 4 n_layer = 4 dropout = 0.2 

If you have a GPU, scale to 10.8M parameters by changing 4 lines in the config. The model hasn't hit its ceiling — val loss was still falling at step 3000. More data and more steps would directly improve output.

Highest impact next steps for anyone wanting to extend this:

1. Scale data to 1M+ characters — TinyStories dataset is perfect 2. Increase max_iters to 5000-10000 3. Larger model only after steps 1 and 2 

Full training logs, output analysis, overfitting breakdown and GPU config in the repo

submitted by /u/Suspicious_Gap1121
[link] [comments]