AI Navigate

Speed Benchmark: GLM 4.7 Flash vs Qwen 3.5 27B vs Qwen 3.5 35B A3B (Q4 Quants)

Reddit r/LocalLLaMA / 3/11/2026

Tools & Practical UsageModels & Research

Key Points

  • The author benchmarked three large language models—GLM 4.7 Flash, Qwen 3.5 27B, and Qwen 3.5 35B A3B—focusing purely on speed performance without evaluating output quality.
  • GLM 4.7 Flash significantly outperforms the two Qwen models in tokens per second (t/s) and thinking times, being nearly three times faster on shorter contexts and substantially faster in generating initial tokens.
  • The Qwen 3.5 35B A3B model benefits from a Mixture of Experts architecture, allowing it to maintain better throughput than the 27B dense Qwen model at large context sizes (32K tokens), although it experiences longer Time to First Token (TTFT).
  • All models show increased latency with very large context windows (32K tokens), with GLM handling this better than Qwen models but still facing notable TTFT delays.
  • The test setup included AMD Ryzen 5 3600X CPU, RTX 3090 GPU, and 64GB DDR4 memory using LM Studio with maximum GPU offload and context length settings.
Speed Benchmark: GLM 4.7 Flash vs Qwen 3.5 27B vs Qwen 3.5 35B A3B (Q4 Quants)

Speed Benchmark: GLM 4.7 Flash vs Qwen 3.5 27B vs Qwen 3.5 35B A3B (Q4 Quants)

Tested how fast these three thinking models run on my setup. Didn't check output quality at all, just raw speed. I was using LM Studio with the max context being 64k and GPU offload at max for each model.

Hardware:

  • AMD Ryzen 5 3600X
  • RTX 3090
  • 64GB DDR4 @ 3600 MT/s

Plots:

https://preview.redd.it/evr1gbqiobog1.png?width=3573&format=png&auto=webp&s=b2f56092db4137d00de29d683bee89bdbe1b413d

Table:

Model t/s (short) t/s (32K) TTFT (short) TTFT (32K) Thinking (short) Thinking (32K)
GLM 4.7 Flash 96.68 65.08 0.44s 31.39s 11.62s 22.95s
Qwen 3.5 27B 30.26 27.61 0.35s 40.76s 118s 78s
Qwen 3.5 35B A3B 32.00 32.12 0.52s 55.23s 113s 59.46s

What stood out to me:

GLM 4.7 Flash is ridiculously fast compared to the other two. Almost 3x the tokens/s and thinking times that are a fraction of what the Qwen models need. On short context, GLM thinks for about 12 seconds while both Qwen models sit there for almost 2 minutes.

The two Qwen models are pretty close to each other speed-wise, which makes sense given the MoE architecture on the 35B variant. The 35B A3B actually holds up better at 32K context than the 27B dense model on tokens/s, but takes longer on TTFT (Time to First Token).

32K context TTFT is painful on all of them honestly, but manageable on GLM at around 31 seconds. The Qwen models go up to 40-55 seconds.

Im currently trying each model out using OpenCode, but I dont have a conclusion yet on what I think works best, my first feeling is that the Qwen models do a better job.

submitted by /u/aiko929
[link] [comments]