torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

Reddit r/MachineLearning / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author presents torch-nvenc-compress, a Python library that repurposes the GPU’s NVENC/NVDEC hardware to compress LLM activations and KV cache to reduce PCIe peer-to-peer bottlenecks across consumer multi-GPU setups.
  • The approach uses PCA with rank truncation as a preprocessing step, claiming that while activations/KV look noise-like in the standard basis, a PCA basis exposes a heavy-tailed covariance structure that video-codec-style compression can exploit.
  • It reframes the system as a parallel-path “dual-lane” design where CUDA-stream pipelining lets NVENC codec work be overlapped with computation and transfers of other tensors, turning compression into an effective bandwidth multiplier.
  • Early measurements on a real GEMM + encode workload report parallel-path overlap of about 67% of the theoretical maximum, indicating substantial hiding of codec and transfer costs.
  • The project provides a “pure-ctypes” wrapper around the Video Codec SDK (DirectBackend) to reduce overhead (e.g., avoiding FFmpeg subprocess costs) and emphasizes efficient, zero-copy handling from CUDA tensors.
torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

I've been working on the consumer-multi-GPU PCIe bottleneck — Nvidia removed NVLink from the 4090/5090, and splitting a 70B model across two consumer cards drops you to ~30 GB/s over PCIe peer-to-peer.

Spent the last few months building a Python library that uses the GPU's otherwise-idle NVENC/NVDEC silicon to compress activations and KV cache on the fly, then ships the small bitstream across the same wire.

Repo: https://github.com/shootthesound/torch-nvenc-compress (Apache 2.0)

Prior art (this isn't novel as an idea)

  • LLM.265 — "Video Codecs are Secretly Tensor Codecs" (late 2025). The closest direct precedent: same insight applied to LLM weights, activations, KV cache.
  • KVFetcher (April 2026). KV compression for remote prefix fetching.
  • CodecFlow (April 2026). Codec motion-vector metadata for KV refresh during prefill.

The "video codec on tensors" idea was already in the literature when I started. What's added in this work:

  1. PCA + rank-truncation as preprocessing. Activations and KV in their standard basis are noise-like (~4× compression floor, basically the Gaussian-noise limit). The PCA basis reveals a heavy-tailed channel covariance that the codec can actually exploit. The basis is per-layer, computed offline, ships with the model LoRA-style (~32 MB for FLUX.2 Klein 9B's 8 double-blocks at K=500).
  2. Parallel-path / dual-lane architectural reframe. NVENC and NVDEC are physically separate hardware units from the SM cluster and the PCIe controller. With CUDA-stream pipelining, the codec time hides behind compute and transfer of other tensors. Compression ratio becomes effective-bandwidth multiplier rather than just a smaller payload.
  3. Pure-ctypes Direct Video Codec SDK wrapper (DirectBackend) — kills the FFmpeg subprocess overhead. Zero-copy from torch CUDA tensors, 8-deep async output ring per NVENC engine, optional CUDA stream binding via nvEncSetIOCudaStreams, MultiEngineDirectBackend across all 3 NVENC engines on the 5090.
  4. Three documented null findings — sparse residual, AV1 NVENC on Blackwell, channel reordering. So nobody else has to rerun the dead ends.

Measured results (RTX 5090, real workloads)

  • Compression ratios: 6.1× lossless on diffusion (FLUX.2 Klein 9B mid-block), 2.7× lossless on LLM KV cache (Mistral 7B v0.3). LOO-validated across 1,735 diffusion captures and 6 LLM prompts. (FLUX.2 Klein 9B was the internal research target; the public PoC repo uses FLUX.1-schnell since it's Apache 2.0 and freely downloadable. Numbers reproduce qualitatively on schnell — heavy-tailed PCA spectrum, similar Pareto.)
  • Codec speed: DirectBackend 0.243 ms/frame encode, 0.435 ms/frame decode at 256×256 YUV444 QP=18 on real PCA-rotated FLUX activations. MultiEngineDirectBackend across the 5090's 3 NVENC engines: 0.180 ms/frame encode, 0.262 ms/frame decode. ~7.9× over an FFmpeg subprocess baseline.
  • Parallel-path overlap empirically measured: 30×4096² fp16 GEMM on CUDA stream A + 64-frame DirectBackend encode on stream B (encoder bound to stream B via nvEncSetIOCudaStreams). Serialized wall-clock 40.1 ms; parallel wall-clock 26.0 ms; theoretical max overlap floor 20.9 ms. 1.34× speedup over serialized = 67% of theoretical max overlap realized. This is the load-bearing measurement for the architectural claim that NVENC silicon runs concurrently with SM compute.
  • Slow-wire wins, end-to-end: measured 3.13× wall-clock speedup at 100 Mbps residential broadband, 5.29× at 50 Mbps (real codec round-trip + simulated wire). 1.69× dual-lane on simulated 1 Gbit ethernet.

What is not measured end-to-end (projections from the above)

Multi-GPU PCIe peer-to-peer activation transfer recovering ~180 GB/s effective bandwidth — codec primitive is ready and benchmarked, but the cross-GPU PCIe peer-to-peer wiring is pending. (This is where I need community help, as my validation rig only has one desktop GPU and you need two on the same motherboard to test this).

Real two-machine ethernet split-model inference — wire-simulation PoC measures real codec time + simulated wire, but isn't a true two-machine deployment yet. (I have a 4090 laptop incoming next week to physically validate this networked leg).

Long-context KV-spill end-to-end tok/s on a real model decode loop — compression ratio is measured, but the actual N tok/s → 3N tok/s benchmark on e.g. 32B + 64K context isn't in the repo yet. The math implies it; the benchmark hasn't been written.

Where I'd value help

  • Anyone with a dual-4090 / dual-5090 / two-machine-with-PCIe-P2P rig who'd want to run the cross-GPU peer-to-peer benchmark when I write it. Would shrink the "75%" gap meaningfully.
  • Anyone running long-context KV-spill workloads who'd want to wire DirectBackend into their decode loop for the end-to-end tok/s measurement. I'd write the integration with you.
  • Cross-vendor coverage — AMD VCN and Intel QSV/Arc paths are completely open. Same architectural claim, different SDK surface.

What's in the repo

19 numbered runnable PoCs, every measured number reproducible. Honest status table at the top of the README. PCA basis builder + per-channel quantize + YUV pack/unpack + codec wrappers all separable so you can swap pieces.

Built solo around full-time caregiving — technical feedback, criticism, or pointers to related work I missed are genuinely appreciated.

submitted by /u/shootthesound
[link] [comments]