C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

Reddit r/MachineLearning / 4/20/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The post discusses a hiring and learning dilemma for 2026 GPU kernel and LLM inference engineers, contrasting “C++17, CuTe, CUTLASS” job requirements with NVIDIA’s push for the new CuTeDSL workflow.
NVIDIA is presented as recommending CuTeDSL (Python DSL in CUTLASS 4.x) to reduce template metaprogramming complexity, enable faster iteration via JIT, and integrate more directly with TorchInductor.
The author suggests the shift may be reflected in projects and roadmaps such as FlashAttention-4, FlashInfer, and SGLang’s NVIDIA collaboration efforts.
The core question is whether newcomers should deeply learn legacy C++ CuTe/CUTLASS templates or prioritize CuTeDSL → Triton → (possibly) Mojo/Rust for serving, keeping C++ only for reading old code.
The post also asks whether the “new stack” is already production-viable, or whether strong C++ CUTLASS skills are still necessary for getting hired and shipping real GPU kernels.

For people just starting out in GPU kernel engineering or LLM inference (FlashAttention / FlashInfer / SGLang / vLLM style work), most job postings still list “C++17, CuTe, CUTLASS” as hard requirements.

At the same time NVIDIA has been pushing CuTeDSL (the Python DSL in CUTLASS 4.x) hard since late 2025 as the new recommended path for new kernels — same performance, no template metaprogramming, JIT, much faster iteration, and direct TorchInductor integration.

The shift feels real in FlashAttention-4, FlashInfer, and SGLang’s NVIDIA collab roadmap.

Question for those already working in this space:

For someone starting fresh in 2026, is it still worth going deep on legacy C++ CuTe/CUTLASS templates, or should they prioritize CuTeDSL → Triton → Mojo (and keep only light C++ for reading old code)?

Is the “new stack” (CuTeDSL + Triton + Rust/Mojo for serving) actually production-viable right now, or are the job postings correct that you still need strong C++ CUTLASS skills to get hired and ship real kernels?

Any war stories or advice on the right learning order for new kernel engineers who want to contribute to FlashInfer / SGLang / FlashAttention?

Looking for honest takes — thanks!

submitted by /u/Daemontatox
[link] [comments]