Educational PyTorch repo for distributed training from scratch: DP, FSDP, TP, FSDP+TP, and PP [P]

Reddit r/MachineLearning / 4/12/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author shares an educational PyTorch GitHub repo that implements distributed training parallelism from scratch, explicitly coding forward/backward logic and collective communications.
  • The repository demonstrates multiple strategies, including Data Parallel (DP), Fully Sharded Data Parallel (FSDP), Tensor Parallel (TP), and combinations like FSDP+TP and Pipeline Parallel (PP).
  • The model and task are intentionally simple (repeated 2-matmul MLP blocks on a synthetic dataset) so that the focus stays on communication patterns and algorithmic behavior.
  • The repo is designed to help learners map the math of distributed training to runnable code without relying on large framework abstractions, while drawing inspiration from the JAX ML Scaling book training section.

I put together a small educational repo that implements distributed training parallelism from scratch in PyTorch:

https://github.com/shreyansh26/pytorch-distributed-training-from-scratch

Instead of using high-level abstractions, the code writes the forward/backward logic and collectives explicitly so you can see the algorithm directly.

The model is intentionally just repeated 2-matmul MLP blocks on a synthetic task, so the communication patterns are the main thing being studied.

Built this mainly for people who want to map the math of distributed training to runnable code without digging through a large framework.

Based on Part-5: Training of JAX ML Scaling book

submitted by /u/shreyansh26
[link] [comments]