Qwen 3.6 vs 6 other models across 5 agent frameworks on M3 Ultra

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The article presents an “agent compatibility matrix” benchmark comparing Qwen 3.6 plus six other models across five agent frameworks on an Apple M3 Ultra (256GB unified memory).
Qwen models consistently achieve very high structured tool-calling pass rates, with Qwen 3.6 35B hitting 100% across most frameworks and overall showing the best combination of correctness and speed.
Non-Qwen models show highly variable compatibility (often much lower or framework-dependent), indicating that tool-calling reliability may require framework-specific tuning rather than assuming general performance.
The benchmarks also include speed measurements (tokens per second), showing Qwen 3.6 35B at about 100 tok/s and lower throughput for several competing models.
The findings emphasize that selecting a model like Qwen for agentic tool use can dramatically reduce integration friction, while alternatives may require more engineering effort to reach comparable reliability.

I benchmarked Qwen 3.6, Qwen 3.5, and 5 other models across 5 agent frameworks on Apple Silicon — here's the full compatibility matrix

Hardware: Apple M3 Ultra, 256GB unified memory

Frameworks tested: Hermes Agent (64K stars), PydanticAI, LangChain, smolagents (HuggingFace), OpenClaude/Anthropic SDK

Models tested: Qwen 3.6 35B (brand new), Qwen 3.5 35B, Qwopus 27B, Qwen 3.5 27B, Llama 3.3 70B, DeepSeek-R1 32B, Gemma 4 26B

The Agent Compatibility Matrix

This is the part I wish existed before I started. Each cell = pass rate across structured tool calling tests (single tool, multi-tool selection, multi-turn, streaming, stress test, many-tools injection, no-leak check).

Model	Hermes	PydanticAI	LangChain	smolagents	OpenClaude	Speed
Qwen 3.6 35B (4bit)	100%	100%	93%	100%	100%	100 tok/s
Qwen 3.5 35B (8bit)	100%	100%	100%	100%	100%	83 tok/s
Qwopus 27B (4bit)	100%	100%	100%	100%	100%	38 tok/s
Qwen 3.5 27B (4bit)	100%	100%	100%	—	—	38 tok/s
Gemma 4 26B (4bit)	100%	67%	—	100%	80%	~40 tok/s
DeepSeek-R1 32B (4bit)	55%	50%	—	100%	40%	~30 tok/s
Llama 3.3 70B (4bit)	45%	67%	67%	100%	—	~20 tok/s

Key takeaway: The Qwen family completely dominates tool calling — every Qwen model hits 100% (or near-100%) across all frameworks. Non-Qwen models are a coin flip depending on which framework you use.

Speed Benchmarks (decode tok/s, same hardware)

Model	RAM	Speed	Tool Calling	Best For
Qwen3.5-4B (4bit)	2.4 GB	168 tok/s	100%	16GB MacBook, fast iteration
GPT-OSS 20B (mxfp4)	12 GB	127 tok/s	80%	Speed + decent quality
Qwen3.5-9B (4bit)	5.1 GB	108 tok/s	100%	Sweet spot for most Macs
Qwen 3.6 35B (4bit)	~20 GB	100 tok/s	100%	NEW — 256 experts, 262K ctx
Qwen3.5-35B (8bit)	37 GB	83 tok/s	100%	Best quality-per-token
Qwen3.5-122B (mxfp4)	65 GB	57 tok/s	100%	Frontier-level, 96GB+ Mac

For reference, Ollama gets ~41 tok/s on Qwen3.5-9B on the same machine. So these numbers are 2-3x faster.

Model Quality Baselines (HumanEval + tinyMMLU)

Speed isn't everything — here's how the models do on code generation and knowledge:

Model	HumanEval (10)	MMLU (10)	Tool Calling	MHI Score
Qwopus 27B	80%	90%	100%	92
Qwen 3.5 27B	40%	100%	100%	82
Qwen 3.5 35B (8bit)	60%	40%	100%	76
Qwen 3.6 35B (4bit)	20%	30%	100%	56
Llama 3.3 70B	50%	90%	varies	56-83
DeepSeek-R1 32B	30%	100%	varies	49-79

MHI = Model-Harness Index: 50% tool calling + 30% HumanEval + 20% MMLU. Measures "how well does this model work as an agent backend."

Qwen 3.6 note: The low HumanEval/MMLU is likely a 4-bit quantization artifact on a day-0 model. It was released days ago. Tool calling is flawless though — if you just need an agent backend, it's the fastest option at 100 tok/s with 100% compatibility.

Interesting Findings

Qwen 3.6 is blazing fast — 100 tok/s on a 35B MoE with 256 experts and 262K context. Only 3B active params means it fits in ~20GB.
smolagents is the most forgiving framework — even DeepSeek-R1 and Llama 3.3 hit 100% with smolagents because it uses text-based code generation instead of structured function calling. If your model sucks at FC, try smolagents.
Hermes Agent is the hardest test — 62 tools injected, multi-turn chains, streaming. Models that pass Hermes pass everything.
8-bit > 4-bit for quality — Qwen 3.5 35B at 8-bit scores 60% HumanEval vs the 4-bit version's lower scores. If you have the RAM, 8-bit is worth it.
Don't use DeepSeek-R1 for tool calling — it's a reasoning model, not an agent model. 40-55% tool calling rate across frameworks. Great for math though.

How I Tested

All tests use the same methodology:

Tool calling: 7-11 API tests per harness — single tool, tool choice, multi-turn with tool results, streaming tool calls, many-tools injection (62 tools for Hermes), stress test (5 rapid calls checking for tag leaks), no-tool-needed (model should answer directly)
Framework-specific: Each framework's own test suite (PydanticAI structured output, LangChain with_structured_output, smolagents CodeAgent + ToolCallingAgent)
HumanEval: 10 tasks via completions endpoint, temp=0
MMLU: 10 tinyMMLU questions via completions endpoint
Speed: Measured at steady-state decode, not first-token

The server is Rapid-MLX — an OpenAI-compatible inference server built on Apple's MLX framework. All test code is open source in the repo under vllm_mlx/agents/testing.py and scripts/mhi_eval.py if you want to reproduce.

TL;DR

If you're running agents on Apple Silicon:

Best overall: Qwopus 27B (MHI 92, works with everything)
Fastest with perfect compatibility: Qwen 3.6 35B at 100 tok/s
Best quality-per-token: Qwen 3.5 35B 8-bit (60% HumanEval, 100% tools)
Budget pick: Qwen3.5-4B at 168 tok/s on a 16GB MacBook Air
Avoid for agents: DeepSeek-R1, Llama 3.3 (unless you use smolagents)

Happy to answer questions or run additional models if there's interest.

submitted by /u/Striking-Swim6702
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

As China’s biotech firms shift gears, can AI floor the accelerator?

SCMP Tech

Why AI Teams Are Standardizing on a Multi-Model Gateway

Dev.to

From Chaos to Cadence: Automating Your Post-Show Follow-Up with AI