RTX 5070 Ti 16GB + 32GB RAM: Running Qwen3.6-35B-A3B Q8_0 @ 44 t/s (128K context)

Reddit r/LocalLLaMA / 4/24/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The post shares a practical setup for running the unsloth Qwen3.6-35B-A3B GGUF model (Q8_0) using LM Studio on an RTX 5070 Ti with 32GB DDR5 RAM.
It reports observed performance of about 44 tokens per second with a large 128K context window.
The configuration includes specific LM Studio settings such as GPU offload set to 40 and offloading MoE experts to the CPU at 26.
It uses memory-mapping (mmap) and sets both K cache and V cache to Q8_0 to fit the workload within available hardware resources.
The author suggests that using llama.cpp may yield better results than LM Studio for this use case.

32GB DDR5 RAM.

unsloth/Qwen3.6-35B-A3B-GGUF Q8_0 : 36.9 GB

LM studio settings：

- GPU Offload: 40

- Offload MoE Experts to CPU: 26

-Try mmap: on

-K cache:Q8_0

-V cache:Q8_0

llama.cpp will be better.

AI Business

Dev.to

Dev.to

Reddit r/MachineLearning

Dev.to