AI Navigate

同じGPTアーキテクチャをCPUとGPUで2回訓練した — 0.82M対10.82Mパラメータ、全ログ付き

Reddit r/LocalLLaMA / 2026/3/22

💬 オピニオンDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

要点

  • 文字レベルのGPTをPyTorchでゼロから構築し、事前学習済みのウェイトやサードパーティのショートカットを使わずに作成。CPUとGPUという異なる計算条件の下で2回訓練し、損失と出力品質のスケーリング効果を測定した。
  • 実行1は0.82Mパラメータと約201千文字、実行2は10.82Mパラメータと約8,840万文字を使用し、パラメータは13.2倍、データ量は438倍に増加したことを示している。
  • 検証損失はRun1の1.3145からRun2の0.7176へ改善し、いずれの実行でも過学習は観察されなかった。
  • より大きなモデルとデータセットにもかかわらず、訓練時間はわずか1.55倍(39.4分→61.3分)に留まり、CUDAの効率向上を強調している。
  • 結果はChinchillaのスケーリング理論と一致し、Run2はより計算資源を効率的に活用しており、スケールが大きくなるにつれて出力品質の差が縮小していく。

Built a character-level GPT from scratch in PyTorch — no pre-trained weights, no HuggingFace, no shortcuts. Trained the same architecture twice under very different compute conditions to measure exactly what scaling does to loss and output quality.

Repo: https://github.com/Eamon2009/Transformer-language-model

---

**Architecture (both runs)**

Standard GPT decoder stack — multi-head causal self-attention, learned positional embeddings, LayerNorm + residuals, AdamW (lr=3e-4), dropout=0.2. Only the scale differs between runs.

---

**Run 1 — CPU (AMD Ryzen 5 PRO 3500U)**

- 0.82M params | 4 layers × 4 heads × 128d

- 201,570 chars | vocab=28 | block=128 | batch=16

- 3,000 iters | 39.4 minutes

- Best val loss: **1.3145** | no overfitting

**Run 2 — CUDA (Google Colab GPU)**

- 10.82M params | 6 layers × 6 heads × 384d

- 88,406,739 chars | vocab=110 | block=256 | batch=64

- 5,000 iters | 61.3 minutes

- Best val loss: **0.7176** | no overfitting

---

**The numbers that matter**

- Parameters: 0.82M → 10.82M **(13.2× more)**

- Dataset: 201K → 88.4M chars **(438× more)**

- Training time: 39.4 → 61.3 min **(only 1.55× longer)**

- Val loss: 1.3145 → 0.7176 **(45% drop)**

- Overfitting: none in either run — best! at every single checkpoint

- Ceiling hit: no — loss still falling in both runs at final iter

438× more data and 13× more parameters, for only 1.55× the time. That's what CUDA gives you.

---

**Run 2 full loss log**

Iter Train Val

0 4.9244 4.9262

250 2.1218 2.1169

500 1.3606 1.3500

1000 1.0332 1.0296

1500 0.9305 0.9189

2000 0.8673 0.8602

2500 0.8162 0.8141

3000 0.7888 0.7803

3500 0.7634 0.7551

4000 0.7480 0.7434

4500 0.7371 0.7314

4999 0.7259 0.7176 ← best!

Train/val gap at end: 0.0083. Loss was still falling at the final checkpoint — this model has not plateaued.

---

**Chinchilla position (20× rule)**

- Run 1: 0.82M params → needs ~16.4M tokens → had 200K → **1.2% of optimal**

- Run 2: 10.82M params → needs ~216M tokens → had 79.6M → **36.8% of optimal**

Run 2 is 30× closer to compute-optimal. The output quality gap is a direct consequence.

---

**Actual output — same architecture, only scale differs**

Run 2 (10.82M, val loss 0.7176):

> Upon a time, there were two friends, Jack and Tom. They had a cold doll in the sunshine.

>

> One day, Jack saw that he was universed. He used the sky at past it to march around the garden. He felt dizzy and wanted to share his happy with them.

Run 1 (0.82M, val loss 1.3145):

> when years me told be found a big ea reak abig driendly they named not she rabbit smiled by aded he what in again one smiled the mushrought boy

Run 2: coherent paragraphs, consistent character names, proper sentence boundaries. Run 1: character-pattern noise. Same architecture — only scale differs.

---

**What's next**

- Push to 10,000 iters — loss still falling, ceiling not reached

- Expand dataset toward compute-optimal (~216M tokens for this model size)

- Hold off on growing the model until data catches up

Full logs, architecture code, and README with detailed comparisons at the repo. Happy to answer questions in the comments.

https://github.com/Eamon2009/Transformer-language-model

submitted by /u/Suspicious_Gap1121
[link] [comments]