Qwen 27b and Other Dense Models Optimization

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A user reports that switching from Qwen 3.5 35B to a dense 27B model on a 64GB Mac M2 Max Studio greatly improves output quality but still yields very low generation speed (~3 tokens/second).
They list a number of performance-related settings they are already using (KV cache quantization at Q8, GPU offload, flash attention, mmap, max concurrency 4, eval batch 2048, CPU threads set to 8) while running via LM Studios and Openclaw.
The post asks for additional tips to increase throughput and reduce latency, especially to avoid issues with scheduled jobs and timing conflicts.
The user highlights that model speed affects downstream workflow reliability even when scheduler parameters are adjusted.

Hi All,

I hadn't realized the kv cache quant made such a big difference, so I took my 64 gig mac M2 Max Studio and switched from Qwen 3.5 35b a3b to the dense 27b. I love it, it's a huge difference, but I get maybe 3 tokens a second. I have kv cache at q8, offload to gpu, flash attention, mmap, max concurrent 4, eval batch 2048, cpu set to 8, gpu offload full (64). I'm on LM Studios and run everything through Openclaw.

Just wondering if there's anything I can do to speed it up. The output is wonderful, but man the slow speed causes some issues, especially for my scheduled jobs, even when I adjust them. If a heartbeat runs up against a regular message I'm f'd, Any tips would be greatly appreciated.

submitted by /u/Jordanthecomeback
[link] [comments]