Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp

Reddit r/LocalLLaMA / 4/17/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Read original →

共有:

Key Points

Qwen3.6-35B-A3Bを「フルコンテキスト」で動かす具体例として、4090ではllama.cpp（IQ4_XS gguf）を用いる構成が示されています。
同じくフルコンテキスト対応として、Spark（推定GB10）ではvLLMでFP8モデルを提供する構成と起動パラメータ（GPUメモリ利用率など）が提示されています。
llama.cpp側のdocker-compose例では、--ctx-size=262144や--n-gpu-layers=999、flash-attnやキャッシュ設定（K/V）を含む詳細な実行オプションが記載されています。
vLLM側では、pandasを追加したcu130-nightly系のDockerfileが必要になる可能性に言及し、--reasoning-parserやツール呼び出し関連のオプションも含めた起動例が示されています。
両方の構成とも、コンテナ化（docker compose）とモデルのマウント、ポート公開、NVIDIA環境変数・IPC設定など運用に直結する手順がまとまっています。

Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp

Here is how to run the new Qwen3.6-35B-A3B

> At full context on a 4090 - IQ4_XS gguf with llama cpp

> At full context on a Spark - FP8 with a tweaked vLLM

Here is the docker compose with llama cpp

services: llamacpp: container_name: llamacpp-qwen3-6-35b-a3b-iq4xs image: ghcr.io/ggml-org/llama.cpp:server-cuda restart: unless-stopped gpus: all shm_size: "8gb" ipc: host environment: - NVIDIA_VISIBLE_DEVICES=all - NVIDIA_DRIVER_CAPABILITIES=compute,utility command: - -m - /models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf - --host - 0.0.0.0 - --port - "8000" - --alias - qwen3.6-35b-a3b-iq4xs - --ctx-size - "262144" - --n-gpu-layers - "999" - --parallel - "1" - --threads - "8" - --flash-attn - on - --batch-size - "256" - --ubatch-size - "256" - --cache-type-k - f16 - --cache-type-v - f16 - --temp - "0.6" - --top-p - "0.95" - --top-k - "20" - --min-p - "0.0" - --presence-penalty - "0.0" - --repeat-penalty - "1.0" volumes: - /root/tank/models:/models:ro ports: - 9998:8000

Here is the docker compose with vllm
You need a dockerfile that paches vllm/vllm-openai:cu130-nightly with pandas for some reason

services: vllm: build: context: . dockerfile: Dockerfile image: vllm-qwen3.6-35b-a3b-fp8:local container_name: vllm-qwen3.6-35b-a3b-fp8 runtime: nvidia ports: - "8000:8000" volumes: - /home/etoprak/Documents/models/Qwen-Qwen3.6-35B-A3B-FP8:/models/Qwen3.6-35B-A3B-FP8:ro environment: - NVIDIA_VISIBLE_DEVICES=all - VLLM_LOGGING_LEVEL=INFO ipc: host command: - --model - /models/Qwen3.6-35B-A3B-FP8 - --served-model-name - Qwen3.6-35B-A3B-FP8 - --gpu-memory-utilization - "0.70" - --reasoning-parser - qwen3 - --enable-auto-tool-choice - --tool-call-parser - hermes deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] restart: unless-stopped

submitted by /u/erdaltoprak
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

The AI Hype Cycle Is Lying to You About What to Learn

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

OpenAI Codex April 2026 Update Review: Computer Use, Memory & 90+ Plugins — Is the Hype Real?

Dev.to

Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp

Key Points

Related Articles

Black Hat USA

Black Hat Asia

The AI Hype Cycle Is Lying to You About What to Learn

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

OpenAI Codex April 2026 Update Review: Computer Use, Memory & 90+ Plugins — Is the Hype Real?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer