For the past few months I've been working on Quadtrix.cpp — a complete GPT-style language model implemented in C++17. No PyTorch. No LibTorch. No BLAS. No auto-differentiation library of any kind. The only dependency is the C++17 standard library and POSIX sockets.
Repo: https://github.com/Eamon2009/Quadtrix.cpp
Everything is hand-written: the tensor library, all forward pass operations, and the full analytical backward pass with explicit gradient derivations for every operator.
Training run v1.0
- Architecture: 4 layers x 4 heads x 200d decoder-only transformer
- Parameters: 826,985 (0.83 M)
- Context window: 128 characters
- Corpus: 31.4 M characters of children's stories
- Best val loss: 1.6371 nats
- Training time: 76.2 minutes on a single CPU core
- External dependencies: zero
What is actually implemented
- Lightweight CPU float tensor library (2D/3D, row-major storage)
- Token and position embeddings, LayerNorm, Linear, Dropout
- Multi-head causal self-attention with causal mask
- Feed-forward blocks: Linear -> ReLU -> Linear
- Complete backward pass: cross-entropy, softmax, layer normalisation (Ba et al. 3-term formula), scaled dot-product attention, Q/K/V gradients, ReLU, dropout, embedding scatter-add
- AdamW optimiser with bias correction
- Character-level tokeniser and batch sampler
- OpenMP parallelisation across all CPU cores — matmul, bmm, softmax, and layernorm all parallelised. Around 5-7x speedup on an 8-core machine
The gradient derivations alone took about a week.
The layernorm backward is the part that trips everyone up. You need to save mu, inverse-std, and x-hat per row during the forward pass and apply the full 3-term formula in the backward. The attention backward requires careful tracking of which dropout mask was applied to the attention weights versus the projection output.
Sample output after training
```
You > Once upon a time
Quadtrix > , and said askiced and so owas said sri. The his brickerys and stew hhat and saw and stark a din't. She stingry and asked day. Timmy watch and played to cones.
You > Timmy is a
Quadtrix > bog the scated justo prove the bret you. Timmy nevery some the gecid. Her neplay to bet starked a way, that litked cliend.
You > what is life
Quadtrix > st happe. It happ a liked back abp happy thing flongs way. Lily lood take maked a fiside apie? Tom and abed Timm.
```
Yes it is gibberish. It is a 0.83M parameter model trained for 76 minutes on a CPU. But it is my gibberish, produced by gradients I derived and implemented myself, running in a binary that links to absolutely nothing outside the standard library.
The LibTorch GPU port is also done as a separate branch. Same architecture, same hyperparameters, same training loop. The only difference is model->to(torch::CUDA) and the entire 600-line backward.h gets deleted because autograd handles it. Roughly 75x faster on an RTX 3080.
[link] [comments]




