ZINC — LLM inference engine written in Zig, running 35B models on $550 AMD GPUs

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The article introduces “ZINC,” an LLM inference engine built in Zig that targets AMD consumer GPUs using Vulkan for performance-critical GPU calls and memory management.
The developer argues Zig is a good fit due to direct Vulkan C ABI access via `@cImport`, `comptime`-driven per-quantization dispatch, and build-time automation that compiles GLSL shaders into SPIR-V.
ZINC loads GGUF models by memory-mapping ~21GB of weights into GPU VRAM and runs a 40-layer hybrid transformer with MoE expert routing and SSM layers, producing correct output on Qwen3.5-35B.
Current throughput is about 7.1 tokens/second on RDNA4, limited by per-layer GPU synchronization overhead (about 120 round-trips per token), with planned optimization to record the full decode graph into a single command buffer.
The codebase is relatively compact (~5k lines Zig and ~2k lines GLSL) and is shared via a public GitHub repository for review and use.

Hey reddit fam! If you have an AMD GPU and have ever tried to run a local LLM on it, you know the pain. ROCm doesn't support consumer cards. vLLM won't work. llama.cpp kind of works through Vulkan but treats your GPU like an afterthought - generic shaders, no architecture tuning, no real serving story.

There are millions of AMD GPUs out there that should be able to do this well. The hardware has the bandwidth and the compute. The software just isn't there. So I'm building it in Zig.

Why Zig, why now:

Zig turned out to be a surprisingly good fit for this. The hot path is Vulkan API calls, GPU memory management, and command buffer recording — pure systems work. `@cImport` gives direct Vulkan C ABI access without binding generators. `comptime` handles per-quantization dispatch tables. `errdefer` keeps GPU resource cleanup sane. And `zig build` wires in the GLSL shader compilation as build steps, so `glslc` runs automatically and SPIR-V lands in `zig-out/`. One command, one binary.

The timing is right because open models are finally good enough to serve locally, AMD's RDNA4 hardware is genuinely capable (640 GB/s bandwidth, cooperative matrix cores), and nobody else is building specifically for these GPUs from the Zig/Vulkan angle. u/tinygrad proved AMD consumer GPUs can compete on training. This is the inference side of that same bet.

Where it's at:

The engine loads GGUF models (parsed natively in Zig), memory-maps 21 GB of weights to GPU VRAM, and runs a full 40-layer hybrid transformer with MoE expert routing and SSM layers. It produces recognizably correct output on Qwen3.5-35B. Currently 7.1 tok/s on RDNA4, slow because of per-layer GPU sync overhead (120 round-trips per token), not because the GPU is slow. The fix is recording the full decode graph as a single command buffer, which is the main problem I'm working on.

The codebase is about 5k lines of Zig and 2k lines of GLSL. Small enough to actually read through.

Repo: https://github.com/zolotukhin/zinc

submitted by /u/Mammoth_Radish2
[link] [comments]