Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python

Reddit r/LocalLLaMA / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The author proposes building a hackable LLM compiler from scratch in pure Python and raw CUDA to make the modern ML compiler stack easier to understand and modify.
  • The project compiles small LLMs (e.g., Qwen2.5-7B and TinyLlama) into a sequence of CUDA kernels, targeting performance around 50–90% of the production stack (vs PyTorch Eager/torch.compile).
  • The compiler is designed as a layered pipeline with clearly separated IR stages: Torch IR (FX graph), Tensor IR (decomposed ops), Loop IR (fused loop nests), Tile IR (GPU scheduling), Kernel IR (hardware schedule), and CUDA (nvcc-ready code generation).
  • It adds Tensor IR for future frontends (ONNX/JAX) and uses loop fusion plus lowering optimizations like tiled matmul, shared-memory staging, and double-buffering.
  • The repository (deplodock) supports independent inspection/debugging of each IR stage and provides commands for compiling, running benchmarks/profiling, and end-to-end compilation to CUDA kernels.

Hey r/LocalLLaMA,

I wanted to come up with a simple overview of the modern ML compiler stack, essentially what happens between model.generate()and the GPU executing a kernel. However, the stack is brutal to read. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. Then there's XLA, MLIR, Halide, and Mojo.

Instead, I decided to take a different approach and just build one from scratch. Just pure Python and raw CUDA. Take a small model (Qwen2.5-7B, TinyLlama) and compile it into a sequence of CUDA kernels. The goal isn't to beat Triton today, but to create a hackable compiler that doesn't require a PhD in compilers to modify, or at least make it easier to follow.

The final performance is about 50-90% of the production stack (as compared to PyTorch Eager and torch.compile).

I built it in a principled way, with a layered pipeline and concerns clearly separated:

  1. Torch IR — captured FX graph (rmsnorm, linear, softmax, ...)
  2. Tensor IR — every op decomposed into Elementwise / Reduction / IndexMap
  3. Loop IR — a kernel written as a loop nest fused with other kernels
  4. Tile IR — a kernel scheduled onto the GPU (threads, blocks, shared memory)
  5. Kernel IR — schedule materialized into hardware primitives
  6. CUDA — emitted source ready for nvcc

Tensor IR is introduced to support future frontends, such as ONNX and Jax. Loop fusion handles the fusion of long pointwise and reduction chains. Lowering stages introduce optimizations such as tiled matmul, smem staging, and double-buffering.

Each stage can be inspected and debugged independently (repository link). No GPU needed:

deplodock compile -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" --ir tensor|loop|tile|kernel|cuda 

Benchmarking:

deplodock run --bench --profile -c "torch.nn.Softmax(dim=-1)(torch.randn(1,28,2048,2048))" 

End-to-end compilation:

deplodock compile Qwen/Qwen2.5-7B 

The generated CUDA kernel for RMSNorm looks like this:

extern "C" __global__ __launch_bounds__(256) void k_rms_norm_reduce(const float* x, const float* p_weight, float* rms_norm) { float in0 = 2048.0f; float in1 = 1e-06f; { int a1 = blockIdx.x; int a0 = threadIdx.x; float acc0 = 0.0f; __syncthreads(); __shared__ float x_smem[2048]; for (int x_smem_flat = a0; x_smem_flat < 2048; x_smem_flat += 256) { { unsigned int _smem_addr = __cvta_generic_to_shared(&x_smem[x_smem_flat]); asm volatile("cp.async.ca.shared.global [%0], [%1], 4;
" :: "r"(_smem_addr), "l"(&x[a1 * 2048 + x_smem_flat]) : "memory"); } } asm volatile("cp.async.commit_group;
" ::: "memory"); asm volatile("cp.async.wait_group 0;
" ::: "memory"); __syncthreads(); __shared__ float p_weight_smem[2048]; for (int p_weight_smem_flat = a0; p_weight_smem_flat < 2048; p_weight_smem_flat += 256) { { unsigned int _smem_addr = __cvta_generic_to_shared(&p_weight_smem[p_weight_smem_flat]); asm volatile("cp.async.ca.shared.global [%0], [%1], 4;
" :: "r"(_smem_addr), "l"(&p_weight[p_weight_smem_flat]) : "memory"); } } asm volatile("cp.async.commit_group;
" ::: "memory"); asm volatile("cp.async.wait_group 0;
" ::: "memory"); __syncthreads(); for (int a2 = a0; a2 < 2048; a2 += 256) { float in2 = x_smem[a2]; float v0 = in2 * in2; acc0 += v0; } __shared__ float acc0_smem[256]; acc0_smem[a0] = acc0; __syncthreads(); for (int s = 128; s > 0; s >>= 1) { if (a0 < s) { acc0_smem[a0] = acc0_smem[a0] + acc0_smem[a0 + s]; } __syncthreads(); } __syncthreads(); float acc0_b = acc0_smem[0]; float v1 = acc0_b / in0; float v2 = v1 + in1; float v3 = rsqrtf(v2); for (int a3 = a0; a3 < 2048; a3 += 256) { float in3 = x_smem[a3]; float in4 = p_weight_smem[a3]; float v4 = in3 * v3; float v5 = v4 * in4; rms_norm[a1 * 2048 + a3] = v5; } } } 
submitted by /u/NoVibeCoding
[link] [comments]