Speed Benchmark: GLM 4.7 Flash vs Qwen 3.5 27B vs Qwen 3.5 35B A3B (Q4 Quants)

Reddit r/LocalLLaMA / 3/11/2026

Tools & Practical UsageModels & Research

共有:

Key Points

The author benchmarked three large language models—GLM 4.7 Flash, Qwen 3.5 27B, and Qwen 3.5 35B A3B—focusing purely on speed performance without evaluating output quality.
GLM 4.7 Flash significantly outperforms the two Qwen models in tokens per second (t/s) and thinking times, being nearly three times faster on shorter contexts and substantially faster in generating initial tokens.
The Qwen 3.5 35B A3B model benefits from a Mixture of Experts architecture, allowing it to maintain better throughput than the 27B dense Qwen model at large context sizes (32K tokens), although it experiences longer Time to First Token (TTFT).
All models show increased latency with very large context windows (32K tokens), with GLM handling this better than Qwen models but still facing notable TTFT delays.
The test setup included AMD Ryzen 5 3600X CPU, RTX 3090 GPU, and 64GB DDR4 memory using LM Studio with maximum GPU offload and context length settings.

Speed Benchmark: GLM 4.7 Flash vs Qwen 3.5 27B vs Qwen 3.5 35B A3B (Q4 Quants)

Tested how fast these three thinking models run on my setup. Didn't check output quality at all, just raw speed. I was using LM Studio with the max context being 64k and GPU offload at max for each model.

Hardware:

AMD Ryzen 5 3600X
RTX 3090
64GB DDR4 @ 3600 MT/s

Plots:

https://preview.redd.it/evr1gbqiobog1.png?width=3573&format=png&auto=webp&s=b2f56092db4137d00de29d683bee89bdbe1b413d

Table:

Model	t/s (short)	t/s (32K)	TTFT (short)	TTFT (32K)	Thinking (short)	Thinking (32K)
GLM 4.7 Flash	96.68	65.08	0.44s	31.39s	11.62s	22.95s
Qwen 3.5 27B	30.26	27.61	0.35s	40.76s	118s	78s
Qwen 3.5 35B A3B	32.00	32.12	0.52s	55.23s	113s	59.46s

What stood out to me:

GLM 4.7 Flash is ridiculously fast compared to the other two. Almost 3x the tokens/s and thinking times that are a fraction of what the Qwen models need. On short context, GLM thinks for about 12 seconds while both Qwen models sit there for almost 2 minutes.

The two Qwen models are pretty close to each other speed-wise, which makes sense given the MoE architecture on the 35B variant. The 35B A3B actually holds up better at 32K context than the 27B dense model on tokens/s, but takes longer on TTFT (Time to First Token).

32K context TTFT is painful on all of them honestly, but manageable on GLM at around 31 seconds. The Qwen models go up to 40-55 seconds.

Im currently trying each model out using OpenCode, but I dont have a conclusion yet on what I think works best, my first feeling is that the Qwen models do a better job.

submitted by /u/aiko929
[link] [comments]

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

Ledge.ai

The programming passion is melting

Dev.to

Best AI Tools for Property Managers in 2026

Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Speed Benchmark: GLM 4.7 Flash vs Qwen 3.5 27B vs Qwen 3.5 35B A3B (Q4 Quants)

Key Points

Related Articles

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像