I benchmarked Qwen 3.6, Qwen 3.5, and 5 other models across 5 agent frameworks on Apple Silicon — here's the full compatibility matrix
Hardware: Apple M3 Ultra, 256GB unified memory
Frameworks tested: Hermes Agent (64K stars), PydanticAI, LangChain, smolagents (HuggingFace), OpenClaude/Anthropic SDK
Models tested: Qwen 3.6 35B (brand new), Qwen 3.5 35B, Qwopus 27B, Qwen 3.5 27B, Llama 3.3 70B, DeepSeek-R1 32B, Gemma 4 26B
The Agent Compatibility Matrix
This is the part I wish existed before I started. Each cell = pass rate across structured tool calling tests (single tool, multi-tool selection, multi-turn, streaming, stress test, many-tools injection, no-leak check).
| Model | Hermes | PydanticAI | LangChain | smolagents | OpenClaude | Speed |
|---|---|---|---|---|---|---|
| Qwen 3.6 35B (4bit) | 100% | 100% | 93% | 100% | 100% | 100 tok/s |
| Qwen 3.5 35B (8bit) | 100% | 100% | 100% | 100% | 100% | 83 tok/s |
| Qwopus 27B (4bit) | 100% | 100% | 100% | 100% | 100% | 38 tok/s |
| Qwen 3.5 27B (4bit) | 100% | 100% | 100% | — | — | 38 tok/s |
| Gemma 4 26B (4bit) | 100% | 67% | — | 100% | 80% | ~40 tok/s |
| DeepSeek-R1 32B (4bit) | 55% | 50% | — | 100% | 40% | ~30 tok/s |
| Llama 3.3 70B (4bit) | 45% | 67% | 67% | 100% | — | ~20 tok/s |
Key takeaway: The Qwen family completely dominates tool calling — every Qwen model hits 100% (or near-100%) across all frameworks. Non-Qwen models are a coin flip depending on which framework you use.
Speed Benchmarks (decode tok/s, same hardware)
| Model | RAM | Speed | Tool Calling | Best For |
|---|---|---|---|---|
| Qwen3.5-4B (4bit) | 2.4 GB | 168 tok/s | 100% | 16GB MacBook, fast iteration |
| GPT-OSS 20B (mxfp4) | 12 GB | 127 tok/s | 80% | Speed + decent quality |
| Qwen3.5-9B (4bit) | 5.1 GB | 108 tok/s | 100% | Sweet spot for most Macs |
| Qwen 3.6 35B (4bit) | ~20 GB | 100 tok/s | 100% | NEW — 256 experts, 262K ctx |
| Qwen3.5-35B (8bit) | 37 GB | 83 tok/s | 100% | Best quality-per-token |
| Qwen3.5-122B (mxfp4) | 65 GB | 57 tok/s | 100% | Frontier-level, 96GB+ Mac |
For reference, Ollama gets ~41 tok/s on Qwen3.5-9B on the same machine. So these numbers are 2-3x faster.
Model Quality Baselines (HumanEval + tinyMMLU)
Speed isn't everything — here's how the models do on code generation and knowledge:
| Model | HumanEval (10) | MMLU (10) | Tool Calling | MHI Score |
|---|---|---|---|---|
| Qwopus 27B | 80% | 90% | 100% | 92 |
| Qwen 3.5 27B | 40% | 100% | 100% | 82 |
| Qwen 3.5 35B (8bit) | 60% | 40% | 100% | 76 |
| Qwen 3.6 35B (4bit) | 20% | 30% | 100% | 56 |
| Llama 3.3 70B | 50% | 90% | varies | 56-83 |
| DeepSeek-R1 32B | 30% | 100% | varies | 49-79 |
MHI = Model-Harness Index: 50% tool calling + 30% HumanEval + 20% MMLU. Measures "how well does this model work as an agent backend."
Qwen 3.6 note: The low HumanEval/MMLU is likely a 4-bit quantization artifact on a day-0 model. It was released days ago. Tool calling is flawless though — if you just need an agent backend, it's the fastest option at 100 tok/s with 100% compatibility.
Interesting Findings
- Qwen 3.6 is blazing fast — 100 tok/s on a 35B MoE with 256 experts and 262K context. Only 3B active params means it fits in ~20GB.
- smolagents is the most forgiving framework — even DeepSeek-R1 and Llama 3.3 hit 100% with smolagents because it uses text-based code generation instead of structured function calling. If your model sucks at FC, try smolagents.
- Hermes Agent is the hardest test — 62 tools injected, multi-turn chains, streaming. Models that pass Hermes pass everything.
- 8-bit > 4-bit for quality — Qwen 3.5 35B at 8-bit scores 60% HumanEval vs the 4-bit version's lower scores. If you have the RAM, 8-bit is worth it.
- Don't use DeepSeek-R1 for tool calling — it's a reasoning model, not an agent model. 40-55% tool calling rate across frameworks. Great for math though.
How I Tested
All tests use the same methodology:
- Tool calling: 7-11 API tests per harness — single tool, tool choice, multi-turn with tool results, streaming tool calls, many-tools injection (62 tools for Hermes), stress test (5 rapid calls checking for tag leaks), no-tool-needed (model should answer directly)
- Framework-specific: Each framework's own test suite (PydanticAI structured output, LangChain with_structured_output, smolagents CodeAgent + ToolCallingAgent)
- HumanEval: 10 tasks via completions endpoint, temp=0
- MMLU: 10 tinyMMLU questions via completions endpoint
- Speed: Measured at steady-state decode, not first-token
The server is Rapid-MLX — an OpenAI-compatible inference server built on Apple's MLX framework. All test code is open source in the repo under vllm_mlx/agents/testing.py and scripts/mhi_eval.py if you want to reproduce.
TL;DR
If you're running agents on Apple Silicon:
- Best overall: Qwopus 27B (MHI 92, works with everything)
- Fastest with perfect compatibility: Qwen 3.6 35B at 100 tok/s
- Best quality-per-token: Qwen 3.5 35B 8-bit (60% HumanEval, 100% tools)
- Budget pick: Qwen3.5-4B at 168 tok/s on a 16GB MacBook Air
- Avoid for agents: DeepSeek-R1, Llama 3.3 (unless you use smolagents)
Happy to answer questions or run additional models if there's interest.
[link] [comments]




