Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book [p]

Reddit r/MachineLearning / 4/15/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The post shares an open-source PyTorch implementation and accompanying book that demonstrates how to build multiple LLM architectures from scratch over the course of a year.
  • It covers a vanilla transformer for translation plus detailed GPT-2 implementation with support for loading OpenAI pretrained weights.
  • It extends the GPT-2 baseline to implement Llama 3.2-3B by swapping key components (RMSNorm, RoPE, SwiGLU, and GQA) and loading Meta pretrained weights.
  • The material explains performance-critical inference mechanisms such as KV cache, as well as related attention variants like MQA and GQA.
  • It further documents DeepSeek features including Multi-Head Latent Attention with an “absorption” trick, DeepSeekMoE with shared experts and segmentation, Multi-Token Prediction, and FP8 quantization, with all code released on GitHub.

I spent the past year implementing five LLM architectures from scratch in PyTorch and wrote a book documenting the process.

What's covered:

  • Vanilla encoder-decoder transformer (English to Hindi translation)
  • GPT-2 (124M), loading real OpenAI pretrained weights
  • Llama 3.2-3B, showing the exact 4 component swaps from GPT-2 (RMSNorm, RoPE, SwiGLU, GQA), loading Meta's pretrained weights
  • KV cache mechanics, MQA, GQA
  • DeepSeek: Multi-Head Latent Attention with absorption trick and decoupled RoPE, DeepSeekMoE with shared experts and fine-grained segmentation, Multi-Token Prediction, FP8 quantisation

All code is open source: https://github.com/S1LV3RJ1NX/mal-code

The book (explanations, derivations, diagrams) is on Leanpub with a free sample: https://leanpub.com/adventures-with-llms

I'm a Senior Forward Deployed Engineer at TrueFoundry, where I work with enterprises on LLM systems. I wrote this because I wanted a resource that went past GPT-2 and into the architectures actually running in production. Happy to discuss any of the implementations.

submitted by /u/s1lv3rj1nx
[link] [comments]