Built a character-level GPT from scratch in PyTorch — no pre-trained weights, no HuggingFace, no shortcuts. Trained the same architecture twice under very different compute conditions to measure exactly what scaling does to loss and output quality.
Repo: https://github.com/Eamon2009/Transformer-language-model
---
**Architecture (both runs)**
Standard GPT decoder stack — multi-head causal self-attention, learned positional embeddings, LayerNorm + residuals, AdamW (lr=3e-4), dropout=0.2. Only the scale differs between runs.
---
**Run 1 — CPU (AMD Ryzen 5 PRO 3500U)**
- 0.82M params | 4 layers × 4 heads × 128d
- 201,570 chars | vocab=28 | block=128 | batch=16
- 3,000 iters | 39.4 minutes
- Best val loss: **1.3145** | no overfitting
**Run 2 — CUDA (Google Colab GPU)**
- 10.82M params | 6 layers × 6 heads × 384d
- 88,406,739 chars | vocab=110 | block=256 | batch=64
- 5,000 iters | 61.3 minutes
- Best val loss: **0.7176** | no overfitting
---
**The numbers that matter**
- Parameters: 0.82M → 10.82M **(13.2× more)**
- Dataset: 201K → 88.4M chars **(438× more)**
- Training time: 39.4 → 61.3 min **(only 1.55× longer)**
- Val loss: 1.3145 → 0.7176 **(45% drop)**
- Overfitting: none in either run — best! at every single checkpoint
- Ceiling hit: no — loss still falling in both runs at final iter
438× more data and 13× more parameters, for only 1.55× the time. That's what CUDA gives you.
---
**Run 2 full loss log**
Iter Train Val
0 4.9244 4.9262
250 2.1218 2.1169
500 1.3606 1.3500
1000 1.0332 1.0296
1500 0.9305 0.9189
2000 0.8673 0.8602
2500 0.8162 0.8141
3000 0.7888 0.7803
3500 0.7634 0.7551
4000 0.7480 0.7434
4500 0.7371 0.7314
4999 0.7259 0.7176 ← best!
Train/val gap at end: 0.0083. Loss was still falling at the final checkpoint — this model has not plateaued.
---
**Chinchilla position (20× rule)**
- Run 1: 0.82M params → needs ~16.4M tokens → had 200K → **1.2% of optimal**
- Run 2: 10.82M params → needs ~216M tokens → had 79.6M → **36.8% of optimal**
Run 2 is 30× closer to compute-optimal. The output quality gap is a direct consequence.
---
**Actual output — same architecture, only scale differs**
Run 2 (10.82M, val loss 0.7176):
> Upon a time, there were two friends, Jack and Tom. They had a cold doll in the sunshine.
>
> One day, Jack saw that he was universed. He used the sky at past it to march around the garden. He felt dizzy and wanted to share his happy with them.
Run 1 (0.82M, val loss 1.3145):
> when years me told be found a big ea reak abig driendly they named not she rabbit smiled by aded he what in again one smiled the mushrought boy
Run 2: coherent paragraphs, consistent character names, proper sentence boundaries. Run 1: character-pattern noise. Same architecture — only scale differs.
---
**What's next**
- Push to 10,000 iters — loss still falling, ceiling not reached
- Expand dataset toward compute-optimal (~216M tokens for this model size)
- Hold off on growing the model until data catches up
Full logs, architecture code, and README with detailed comparisons at the repo. Happy to answer questions in the comments.
[link] [comments]
