| I've been working on the consumer-multi-GPU PCIe bottleneck — Nvidia removed NVLink from the 4090/5090, and splitting a 70B model across two consumer cards drops you to ~30 GB/s over PCIe peer-to-peer. Spent the last few months building a Python library that uses the GPU's otherwise-idle NVENC/NVDEC silicon to compress activations and KV cache on the fly, then ships the small bitstream across the same wire. Repo: https://github.com/shootthesound/torch-nvenc-compress (Apache 2.0) Prior art (this isn't novel as an idea)
The "video codec on tensors" idea was already in the literature when I started. What's added in this work:
Measured results (RTX 5090, real workloads)
What is not measured end-to-end (projections from the above)Multi-GPU PCIe peer-to-peer activation transfer recovering ~180 GB/s effective bandwidth — codec primitive is ready and benchmarked, but the cross-GPU PCIe peer-to-peer wiring is pending. (This is where I need community help, as my validation rig only has one desktop GPU and you need two on the same motherboard to test this). Real two-machine ethernet split-model inference — wire-simulation PoC measures real codec time + simulated wire, but isn't a true two-machine deployment yet. (I have a 4090 laptop incoming next week to physically validate this networked leg). Long-context KV-spill end-to-end tok/s on a real model decode loop — compression ratio is measured, but the actual N tok/s → 3N tok/s benchmark on e.g. 32B + 64K context isn't in the repo yet. The math implies it; the benchmark hasn't been written. Where I'd value help
What's in the repo19 numbered runnable PoCs, every measured number reproducible. Honest status table at the top of the README. PCA basis builder + per-channel quantize + YUV pack/unpack + codec wrappers all separable so you can swap pieces. Built solo around full-time caregiving — technical feedback, criticism, or pointers to related work I missed are genuinely appreciated. [link] [comments] |
torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]
Reddit r/MachineLearning / 5/4/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The author presents torch-nvenc-compress, a Python library that repurposes the GPU’s NVENC/NVDEC hardware to compress LLM activations and KV cache to reduce PCIe peer-to-peer bottlenecks across consumer multi-GPU setups.
- The approach uses PCA with rank truncation as a preprocessing step, claiming that while activations/KV look noise-like in the standard basis, a PCA basis exposes a heavy-tailed covariance structure that video-codec-style compression can exploit.
- It reframes the system as a parallel-path “dual-lane” design where CUDA-stream pipelining lets NVENC codec work be overlapped with computation and transfers of other tensors, turning compression into an effective bandwidth multiplier.
- Early measurements on a real GEMM + encode workload report parallel-path overlap of about 67% of the theoretical maximum, indicating substantial hiding of codec and transfer costs.
- The project provides a “pure-ctypes” wrapper around the Video Codec SDK (DirectBackend) to reduce overhead (e.g., avoiding FFmpeg subprocess costs) and emphasizes efficient, zero-copy handling from CUDA tensors.
Related Articles

Black Hat USA
AI Business

5 AI Prompts That Write Better Marketing Copy Than Most Humans
Dev.to

Giving an AI agent a recon toolbox: wiring 30+ security tools into an MCP server
Dev.to

I'm Offering AI-Powered Copywriting Services - Starting at /Post
Dev.to

Agent Workspace as Code: stop copy-pasting your CLAUDE.md across projects
Dev.to