DGX Spark just arrived — planning to run vLLM + local models, looking for advice

Reddit r/LocalLLaMA / 4/15/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A user recently set up a DGX Spark for on-prem local LLM inference and plans to use vLLM with PyTorch and Hugging Face models as a private API backend.
  • They are seeking community guidance on which models run efficiently on this specific unified-memory hardware configuration.
  • The user asks for vLLM tuning advice tailored to unified memory systems, including practical configuration considerations.
  • They want realistic expectations for throughput and performance compared with what they previously saw on cloud GPUs.
  • The post is effectively a request for field-tested best practices for deploying and scaling local model inference on DGX Spark.
DGX Spark just arrived — planning to run vLLM + local models, looking for advice

Just got a DGX Spark set up today and starting to configure it for local LLM inference.

Plan is to run:

• vLLM • PyTorch • Hugging Face models 

as a local API backend for an application I’m building (education / analytics use case, trying to keep everything local/private).

I’ve mostly been working with cloud GPUs up to now, so this is my first time running something like this fully on-prem.

A few things I’m curious about:

• Best models people are running efficiently on this hardware? • Any tuning tips for vLLM on unified memory systems like this? • Real-world throughput vs expectations? 

Would appreciate any insights from people running similar setups.

submitted by /u/dalemusser
[link] [comments]