Hey all,
If you're running GPTQ models on a Jetson Orin (AGX, NX, or Nano), you've probably noticed that stock vLLM doesn't ship Marlin kernels for SM 8.7. It covers 8.0, 8.6, 8.9, 9.0 — but not the Orin family. Which means your tensor cores just sit there doing nothing during GPTQ inference.
I ran into this while trying to serve Qwen3.5-35B-A3B-GPTQ-Int4 on an AGX Orin 64GB. The performance without Marlin was underwhelming, so I compiled vLLM 0.17.0 with the SM 8.7 target included and packaged it as a wheel.
The difference was significant:
- Prefill went from 523 tok/s (llama.cpp) to 2,001 tok/s — about 3.8x
- Decode improved from ~22.5 to ~31 tok/s at short context (within vllm)
- End-to-end at 20K context: 17s vs 47s with llama.cpp (2.8x faster)
The wheel is on HuggingFace so you can install it with one line:
pip install https://huggingface.co/thehighnotes/vllm-jetson-orin/resolve/main/vllm-0.17.0+cu126-cp310-cp310-linux_aarch64.whl Built for JetPack 6.x / CUDA 12.6 / Python 3.10 (the standard Jetson stack).
Full benchmarks and setup notes in the repo: https://github.com/thehighnotes/vllm-jetson-orin
Hope it helps anyone and am happy to answer questions if anyone's working with a similar setup.
~Mark
[link] [comments]




