Qwen 3.6 vs 6 other models across 5 agent frameworks on M3 Ultra

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The article presents an “agent compatibility matrix” benchmark comparing Qwen 3.6 plus six other models across five agent frameworks on an Apple M3 Ultra (256GB unified memory).
  • Qwen models consistently achieve very high structured tool-calling pass rates, with Qwen 3.6 35B hitting 100% across most frameworks and overall showing the best combination of correctness and speed.
  • Non-Qwen models show highly variable compatibility (often much lower or framework-dependent), indicating that tool-calling reliability may require framework-specific tuning rather than assuming general performance.
  • The benchmarks also include speed measurements (tokens per second), showing Qwen 3.6 35B at about 100 tok/s and lower throughput for several competing models.
  • The findings emphasize that selecting a model like Qwen for agentic tool use can dramatically reduce integration friction, while alternatives may require more engineering effort to reach comparable reliability.

I benchmarked Qwen 3.6, Qwen 3.5, and 5 other models across 5 agent frameworks on Apple Silicon — here's the full compatibility matrix

Hardware: Apple M3 Ultra, 256GB unified memory

Frameworks tested: Hermes Agent (64K stars), PydanticAI, LangChain, smolagents (HuggingFace), OpenClaude/Anthropic SDK

Models tested: Qwen 3.6 35B (brand new), Qwen 3.5 35B, Qwopus 27B, Qwen 3.5 27B, Llama 3.3 70B, DeepSeek-R1 32B, Gemma 4 26B

The Agent Compatibility Matrix

This is the part I wish existed before I started. Each cell = pass rate across structured tool calling tests (single tool, multi-tool selection, multi-turn, streaming, stress test, many-tools injection, no-leak check).

Model Hermes PydanticAI LangChain smolagents OpenClaude Speed
Qwen 3.6 35B (4bit) 100% 100% 93% 100% 100% 100 tok/s
Qwen 3.5 35B (8bit) 100% 100% 100% 100% 100% 83 tok/s
Qwopus 27B (4bit) 100% 100% 100% 100% 100% 38 tok/s
Qwen 3.5 27B (4bit) 100% 100% 100% 38 tok/s
Gemma 4 26B (4bit) 100% 67% 100% 80% ~40 tok/s
DeepSeek-R1 32B (4bit) 55% 50% 100% 40% ~30 tok/s
Llama 3.3 70B (4bit) 45% 67% 67% 100% ~20 tok/s

Key takeaway: The Qwen family completely dominates tool calling — every Qwen model hits 100% (or near-100%) across all frameworks. Non-Qwen models are a coin flip depending on which framework you use.

Speed Benchmarks (decode tok/s, same hardware)

Model RAM Speed Tool Calling Best For
Qwen3.5-4B (4bit) 2.4 GB 168 tok/s 100% 16GB MacBook, fast iteration
GPT-OSS 20B (mxfp4) 12 GB 127 tok/s 80% Speed + decent quality
Qwen3.5-9B (4bit) 5.1 GB 108 tok/s 100% Sweet spot for most Macs
Qwen 3.6 35B (4bit) ~20 GB 100 tok/s 100% NEW — 256 experts, 262K ctx
Qwen3.5-35B (8bit) 37 GB 83 tok/s 100% Best quality-per-token
Qwen3.5-122B (mxfp4) 65 GB 57 tok/s 100% Frontier-level, 96GB+ Mac

For reference, Ollama gets ~41 tok/s on Qwen3.5-9B on the same machine. So these numbers are 2-3x faster.

Model Quality Baselines (HumanEval + tinyMMLU)

Speed isn't everything — here's how the models do on code generation and knowledge:

Model HumanEval (10) MMLU (10) Tool Calling MHI Score
Qwopus 27B 80% 90% 100% 92
Qwen 3.5 27B 40% 100% 100% 82
Qwen 3.5 35B (8bit) 60% 40% 100% 76
Qwen 3.6 35B (4bit) 20% 30% 100% 56
Llama 3.3 70B 50% 90% varies 56-83
DeepSeek-R1 32B 30% 100% varies 49-79

MHI = Model-Harness Index: 50% tool calling + 30% HumanEval + 20% MMLU. Measures "how well does this model work as an agent backend."

Qwen 3.6 note: The low HumanEval/MMLU is likely a 4-bit quantization artifact on a day-0 model. It was released days ago. Tool calling is flawless though — if you just need an agent backend, it's the fastest option at 100 tok/s with 100% compatibility.

Interesting Findings

  1. Qwen 3.6 is blazing fast — 100 tok/s on a 35B MoE with 256 experts and 262K context. Only 3B active params means it fits in ~20GB.
  2. smolagents is the most forgiving framework — even DeepSeek-R1 and Llama 3.3 hit 100% with smolagents because it uses text-based code generation instead of structured function calling. If your model sucks at FC, try smolagents.
  3. Hermes Agent is the hardest test — 62 tools injected, multi-turn chains, streaming. Models that pass Hermes pass everything.
  4. 8-bit > 4-bit for quality — Qwen 3.5 35B at 8-bit scores 60% HumanEval vs the 4-bit version's lower scores. If you have the RAM, 8-bit is worth it.
  5. Don't use DeepSeek-R1 for tool calling — it's a reasoning model, not an agent model. 40-55% tool calling rate across frameworks. Great for math though.

How I Tested

All tests use the same methodology:

  • Tool calling: 7-11 API tests per harness — single tool, tool choice, multi-turn with tool results, streaming tool calls, many-tools injection (62 tools for Hermes), stress test (5 rapid calls checking for tag leaks), no-tool-needed (model should answer directly)
  • Framework-specific: Each framework's own test suite (PydanticAI structured output, LangChain with_structured_output, smolagents CodeAgent + ToolCallingAgent)
  • HumanEval: 10 tasks via completions endpoint, temp=0
  • MMLU: 10 tinyMMLU questions via completions endpoint
  • Speed: Measured at steady-state decode, not first-token

The server is Rapid-MLX — an OpenAI-compatible inference server built on Apple's MLX framework. All test code is open source in the repo under vllm_mlx/agents/testing.py and scripts/mhi_eval.py if you want to reproduce.

TL;DR

If you're running agents on Apple Silicon:

  • Best overall: Qwopus 27B (MHI 92, works with everything)
  • Fastest with perfect compatibility: Qwen 3.6 35B at 100 tok/s
  • Best quality-per-token: Qwen 3.5 35B 8-bit (60% HumanEval, 100% tools)
  • Budget pick: Qwen3.5-4B at 168 tok/s on a 16GB MacBook Air
  • Avoid for agents: DeepSeek-R1, Llama 3.3 (unless you use smolagents)

Happy to answer questions or run additional models if there's interest.

submitted by /u/Striking-Swim6702
[link] [comments]