The guy that won the DGX Spark GB10 at NVIDIA and Cartesia Hackathon Won an NVIDIA 5080 at Pytorch's Hackathon doing GPU Kernel Optimization!

Reddit r/LocalLLaMA / 3/16/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author topped the leaderboard for causal depthwise 1D convolution in a kernel-level optimization hackathon for B200 GPUs, achieving about 10 microseconds per operation.
PyTorch Helion's autotuner helped reduce the search space and compile down to Triton, allowing dozens of permutations to be tested and delivering roughly 90–95% of the optimization, with manual tuning handling the rest.
The setup included a Dell Pro Max T2 Tower with an NVIDIA Pro 6000 to run local inference for a private agent harness, enabling fast in-home inference via Lemonade hosting the local model.
The post notes the difficulty of optimizing hardware across different LLM architectures, citing patterns like Gated DeltaNet and Mixture of Experts, inter- and intra-chunk state handling, KV caching, padding, and fusion, and mentions sharing slides about the learnings.

The guy that won the DGX Spark GB10 at NVIDIA and Cartesia Hackathon Won an NVIDIA 5080 at Pytorch's Hackathon doing GPU Kernel Optimization!

I just wanted to give you all another update. Eventually I will stop competing in hackathons, BUT NOT TODAY!

I made some slides of my learnings if anyone is interested! I am doing some interesting stuff in neurotech and brain health trying to detect neurological disorders, but that is a longer journey. So you'll have to settle with this.

https://medium.com/p/f995a53f14b4?postPublishedType=initial

At the last minute, I decided to get way outside my comfort zone and jump into a hackathon focused on kernel-level optimization for B200 GPUs.

I wanted to share some of my learnings here so I made some slides!

This gave me a whole new level of respect for inference providers. The optimization problem is brutal: the number of configuration combinations explodes fast, and tiny changes can have a huge impact on performance.

Before this, I did not fully appreciate how difficult it is to optimize hardware across different LLM architectures. Every model can require a different strategy, and you have to think through things like Gated DeltaNet patterns, Mixture of Experts, inter-chunk state handling, intra-chunk attention, KV caching, padding, and fusion.

My best result: I topped the leaderboard for causal depthwise 1D convolution, getting the benchmark down to around 10 microseconds.

At that level, even shaving off fractions of a microsecond matters. That is where performance wins happen.

A big part of this was using PyTorch Helion, which made it much easier to reduce the search space and find the needle in the haystack. Its autotuner compiles down to Triton, and I was able to automatically test dozens of permutations to get roughly 90–95% of the optimization. The rest came from manual tuning and grinding out the last bits of performance.

One of the coolest parts was using the Dell Pro Max T2 Tower with an NVIDIA Pro 6000, to run local inference for my agent harness. It reinforced something I keep seeing over and over: local LLM workflows can be incredibly fast when you have the right setup. I was able to beam run inference from my machine at home all the way to my Dell Pro Max GB10 for private, fast, and reliable inference with Lemonade hosting my local model!

Here was the past articles I did about my wins trying to leave the world a better place:

Creating personalized Learning for People using Computer Adaptive Learning

Finding the Social Determinants of Health to improve the lives of everyone

UPDATE: here is the repository if anyone is interested in GPU Kernel Optimization

UPDATE #2: I almost forgot to mention, I also won another DGX Spark GB10 from NVIDIA and a Golden Ticket to GTC now I have 3 GB10s FOR THE ULTIMATE LocalLLaMA!

submitted by /u/brandon-i
[link] [comments]