Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect!

Reddit r/LocalLLaMA / 4/25/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

The author runs Qwen3.6-35B-A3B on an 8GB VRAM RTX 3070 system by using a very small Q4 quant (about 18GB) with 32k context, achieving roughly 25–30 tokens per second.
They encounter looping issues during “thinking,” so they test a larger Q4 quant variant (about 23GB) and find it runs substantially faster despite the increased memory use, reaching about 32 tokens per second at 128k context.
They ultimately use a Q5_K_S quant as the best quality/speed trade-off, sustaining around 30 tokens per second with a 128k context window.
Performance decreases with longer contexts, but the system still stays above 25 tokens per second even at 50k context, leading to the practical takeaway to try bigger quants than expected for MoE models.

So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me.

I am running a 3070 8gb + 64gb DDR4. Pretty lightweight setup so I chose the smallest Q4 unsloth model Qwen3.6-35B-A3B-UD-IQ4_XS.gguf - which is ~18gb. It does run ok, and with some optimizations in llama.cpp I got about 25-30 tokens/s with a 32k context window.

I did have some problems with looping during thinking so I tried a bigger Q4 model Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - ~23gb. To my surprise, this is much faster! With a 128k context window, I am seeing 32 tokens/s.

I ended up using Q5_K_S for best quality/speed balance - about 30 tokens/s. Oh, and I'm also using 128k context window. The speed does go down with long context. It's still over 25 at 50k context though! (haven't tested higher yet)

Bottom line - for MoE models like this, experiment with bigger quants than you'd expect to be able to use!

submitted by /u/jeremynsl
[link] [comments]

Black Hat USA

AI Business

The 2AM Discipline: What an AI Agent Does When There's Nothing Left But the Clock (Day 63)

Dev.to

Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition

Dev.to

Built a multi-model AI platform with real-time WebRTC voice, persistent cross-model memory, and a full generation suite - free account gets 1 min voice/month

Reddit r/artificial

Self-Supervised Temporal Pattern Mining for smart agriculture microgrid orchestration under multi-jurisdictional compliance

Dev.to

Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect!

Key Points

Related Articles

Black Hat USA

The 2AM Discipline: What an AI Agent Does When There's Nothing Left But the Clock (Day 63)

Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition

Built a multi-model AI platform with real-time WebRTC voice, persistent cross-model memory, and a full generation suite - free account gets 1 min voice/month

Self-Supervised Temporal Pattern Mining for smart agriculture microgrid orchestration under multi-jurisdictional compliance

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer