vLLM on Jetson Orin — pre-built wheel with Marlin GPTQ support (3.8x prefill speedup)

Reddit r/LocalLLaMA / 3/15/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A new pre-built vLLM wheel for Jetson Orin (AGX, NX, Nano) includes Marlin kernels for SM 8.7 to enable tensor cores during GPTQ inference.
The user built vLLM 0.17.0 with SM 8.7 support and packaged it as a wheel for JetPack 6.x / CUDA 12.6 / Python 3.10.
Benchmarks show significant speedups: prefill ~3.8x (523 tok/s to 2,001 tok/s), decode improvement from ~22.5 to ~31 tok/s, and end-to-end at 20K context from 47s to 17s (2.8x faster).
Installation is a one-line pip install from a HuggingFace wheel, with full benchmarks and setup notes available in the GitHub repo.

Hey all,

If you're running GPTQ models on a Jetson Orin (AGX, NX, or Nano), you've probably noticed that stock vLLM doesn't ship Marlin kernels for SM 8.7. It covers 8.0, 8.6, 8.9, 9.0 — but not the Orin family. Which means your tensor cores just sit there doing nothing during GPTQ inference.

I ran into this while trying to serve Qwen3.5-35B-A3B-GPTQ-Int4 on an AGX Orin 64GB. The performance without Marlin was underwhelming, so I compiled vLLM 0.17.0 with the SM 8.7 target included and packaged it as a wheel.

The difference was significant:

- Prefill went from 523 tok/s (llama.cpp) to 2,001 tok/s — about 3.8x

- Decode improved from ~22.5 to ~31 tok/s at short context (within vllm)

- End-to-end at 20K context: 17s vs 47s with llama.cpp (2.8x faster)

The wheel is on HuggingFace so you can install it with one line:

 pip install https://huggingface.co/thehighnotes/vllm-jetson-orin/resolve/main/vllm-0.17.0+cu126-cp310-cp310-linux_aarch64.whl

Built for JetPack 6.x / CUDA 12.6 / Python 3.10 (the standard Jetson stack).

Full benchmarks and setup notes in the repo: https://github.com/thehighnotes/vllm-jetson-orin

Hope it helps anyone and am happy to answer questions if anyone's working with a similar setup.

~Mark

submitted by /u/thehighnotes
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/15DailyView insight →

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

日経XTECH

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Dev.to

Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents

Dev.to

Perplexity Hub

Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

Dev.to

vLLM on Jetson Orin — pre-built wheel with Marlin GPTQ support (3.8x prefill speedup)

Key Points

💡 Insights using this article

Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents

Perplexity Hub

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer