I built a transformer in C++17 from scratch — no PyTorch, no BLAS, no dependencies. Trains on CPU. 0.83M params, full analytical backprop, 76 min to val loss 1.64.

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A developer built Quadtrix.cpp, a GPT-style transformer language model in C++17 implemented from scratch with only the C++ standard library and POSIX sockets, avoiding PyTorch, LibTorch, BLAS, and any autograd library.
  • The project includes a hand-written tensor library and a full analytical backward pass with explicit gradient derivations for each operator, including tricky components like LayerNorm and attention with correct dropout-mask tracking.
  • The model is a decoder-only transformer (4 layers, 4 heads, 200d) with 0.83M parameters and a 128-character context window, trained on 31.4M characters of children’s stories.
  • In one reported training run, it reached a best validation loss of 1.6371 nats after 76.2 minutes on a single CPU core, with OpenMP parallelization delivering about a 5–7x speedup on an 8-core machine.
  • Example generations are largely gibberish, but the author emphasizes that the output comes from the fully implemented gradients and training loop running in a dependency-free C++ binary.

For the past few months I've been working on Quadtrix.cpp — a complete GPT-style language model implemented in C++17. No PyTorch. No LibTorch. No BLAS. No auto-differentiation library of any kind. The only dependency is the C++17 standard library and POSIX sockets.

Repo: https://github.com/Eamon2009/Quadtrix.cpp

Everything is hand-written: the tensor library, all forward pass operations, and the full analytical backward pass with explicit gradient derivations for every operator.

Training run v1.0

  • Architecture: 4 layers x 4 heads x 200d decoder-only transformer
  • Parameters: 826,985 (0.83 M)
  • Context window: 128 characters
  • Corpus: 31.4 M characters of children's stories
  • Best val loss: 1.6371 nats
  • Training time: 76.2 minutes on a single CPU core
  • External dependencies: zero

What is actually implemented

  • Lightweight CPU float tensor library (2D/3D, row-major storage)
  • Token and position embeddings, LayerNorm, Linear, Dropout
  • Multi-head causal self-attention with causal mask
  • Feed-forward blocks: Linear -> ReLU -> Linear
  • Complete backward pass: cross-entropy, softmax, layer normalisation (Ba et al. 3-term formula), scaled dot-product attention, Q/K/V gradients, ReLU, dropout, embedding scatter-add
  • AdamW optimiser with bias correction
  • Character-level tokeniser and batch sampler
  • OpenMP parallelisation across all CPU cores — matmul, bmm, softmax, and layernorm all parallelised. Around 5-7x speedup on an 8-core machine

The gradient derivations alone took about a week.

The layernorm backward is the part that trips everyone up. You need to save mu, inverse-std, and x-hat per row during the forward pass and apply the full 3-term formula in the backward. The attention backward requires careful tracking of which dropout mask was applied to the attention weights versus the projection output.

Sample output after training

```

You > Once upon a time

Quadtrix > , and said askiced and so owas said sri. The his brickerys and stew hhat and saw and stark a din't. She stingry and asked day. Timmy watch and played to cones.

You > Timmy is a

Quadtrix > bog the scated justo prove the bret you. Timmy nevery some the gecid. Her neplay to bet starked a way, that litked cliend.

You > what is life

Quadtrix > st happe. It happ a liked back abp happy thing flongs way. Lily lood take maked a fiside apie? Tom and abed Timm.

```

Yes it is gibberish. It is a 0.83M parameter model trained for 76 minutes on a CPU. But it is my gibberish, produced by gradients I derived and implemented myself, running in a binary that links to absolutely nothing outside the standard library.

The LibTorch GPU port is also done as a separate branch. Same architecture, same hyperparameters, same training loop. The only difference is model->to(torch::CUDA) and the entire 600-line backward.h gets deleted because autograd handles it. Roughly 75x faster on an RTX 3080.

submitted by /u/Suspicious_Gap1121
[link] [comments]