M5-Max Macbook Pro 128GB RAM - Qwen3 Coder Next 8-Bit Benchmark

Reddit r/LocalLLaMA / 3/29/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The article benchmarks Qwen3-Coder-Next in 8-bit quantization on Apple Silicon using two local inference backends—Apple MLX (via mlx-lm) and Ollama (llama.cpp-based)—to compare coding performance.
  • It measures throughput (tokens/sec), latency to the first token (TTFT), total wall-clock response time, and system memory usage using a shared Python harness with streaming enabled.
  • The reported headline result is that an M5-Max MacBook Pro with 128GB RAM achieves 72 tokens per second with the MLX backend for the 8-bit model.
  • The tests run multiple iterations per prompt and average results while excluding the cold-start iteration’s TTFT for the first prompt to reduce load-time bias.
  • The prompt suite includes six programming-oriented tasks ranging from short code completions to longer, reasoning-heavy implementations to assess practical coding capability.

Qwen3-Coder-Next 8-Bit Benchmark: MLX vs Ollama

TLDR: M5-Max with 128gb of RAM gets 72 tokens per second from Qwen3-Coder-Next 8-Bit using MLX

Overview

This benchmark compares two local inference backends — MLX (Apple's native ML framework) and Ollama (llama.cpp-based) — running the same Qwen3-Coder-Next model in 8-bit quantization on Apple Silicon. The goal is to measure raw throughput (tokens per second), time to first token (TTFT), and overall coding capability across a range of real-world programming tasks.

Methodology

Setup

  • MLX backend: mlx-lm v0.29.1 serving mlx-community/Qwen3-Coder-Next-8bit via its built-in OpenAI-compatible HTTP server on port 8080.
  • Ollama backend: Ollama serving qwen3-coder-next:Q8_0 via its OpenAI-compatible API on port 11434.
  • Both backends were accessed through the same Python benchmark harness using the OpenAI client library with streaming enabled.
  • Each test was run 3 iterations per prompt. Results were averaged, excluding the first iteration's TTFT for the initial cold-start prompt (model load).

Metrics

Metric Description
Tokens/sec (tok/s) Output tokens generated per second. Higher is better. Approximated by counting streamed chunks (1 chunk ≈ 1 token).
TTFT (Time to First Token) Latency from request sent to first token received. Lower is better. Measures prompt processing + initial decode.
Total Time Wall-clock time for the full response. Lower is better.
Memory System memory usage before and after each run, measured via psutil.

Test Suite

Six prompts were designed to cover a spectrum of coding tasks, from trivial completions to complex reasoning:

Test Description Max Tokens What It Measures
Short Completion Write a palindrome check function 150 Minimal-latency code generation
Medium Generation Implement an LRU cache class with type hints 500 Structured class design, API correctness
Long Reasoning Explain async/await vs threading with examples 1000 Extended prose generation, technical accuracy
Debug Task Find and fix bugs in merge sort + binary search 800 Bug identification, code comprehension, explanation
Complex Coding Thread-safe bounded blocking queue with context manager 1000 Advanced concurrency patterns, API design
Code Review Review 3 functions for performance/correctness/style 1000 Multi-function analysis, concrete suggestions

Results

Throughput (Tokens per Second)

Test Ollama (tok/s) MLX (tok/s) MLX Advantage
Short Completion 32.51* 69.62* +114%
Medium Generation 35.97 78.28 +118%
Long Reasoning 40.45 78.29 +94%
Debug Task 37.06 74.89 +102%
Complex Coding 35.84 76.99 +115%
Code Review 39.00 74.98 +92%
Overall Average 35.01 72.33 +107%

\Short completion warm-run averages (excluding cold start iterations).*

Time to First Token (TTFT)

Test Ollama TTFT MLX TTFT MLX Advantage
Short Completion 0.182s* 0.076s* 58% faster
Medium Generation 0.213s 0.103s 52% faster
Long Reasoning 0.212s 0.105s 50% faster
Debug Task 0.396s 0.179s 55% faster
Complex Coding 0.237s 0.126s 47% faster
Code Review 0.405s 0.176s 57% faster

\Warm-run values only. Cold start was 65.3s (Ollama) vs 2.4s (MLX) for initial model load.*

Cold Start

The first request to each backend includes model loading time:

Backend Cold Start TTFT Notes
Ollama 65.3 seconds Loading 84 GB Q8_0 GGUF into memory
MLX 2.4 seconds Loading pre-sharded MLX weights

MLX's cold start is 27x faster because MLX weights are pre-sharded for Apple Silicon's unified memory architecture, while Ollama must convert and map GGUF weights through llama.cpp.

Memory Usage

Backend Memory Before Memory After (Stabilized)
Ollama 89.5 GB ~102 GB
MLX 54.5 GB ~93 GB

Both backends settle to similar memory footprints once the model is fully loaded (~90-102 GB for an 84 GB model plus runtime overhead). MLX started with lower baseline memory because the model wasn't yet resident.

Capability Assessment

Beyond raw speed, the model produced high-quality outputs across all coding tasks on both backends (identical model weights, so output quality is backend-independent):

  • Bug Detection: Correctly identified both bugs in the test code (missing tail elements in merge, integer division and infinite loop in binary search) across all iterations on both backends.
  • Code Generation: Produced well-structured, type-hinted implementations for LRU cache and blocking queue. Used appropriate stdlib components (OrderedDict, threading.Condition).
  • Code Review: Identified real issues (naive email regex, manual word counting vs Counter, type() vs isinstance()) and provided concrete improved implementations.
  • Consistency: Response quality was stable across iterations — same bugs found, same patterns used, similar token counts — indicating deterministic behavior at the tested temperature (0.7).

Conclusions

  1. MLX is 2x faster than Ollama for this model on Apple Silicon, averaging 72.3 tok/s vs 35.0 tok/s.
  2. TTFT is ~50% lower on MLX across all prompt types once warm.
  3. Cold start is dramatically better on MLX (2.4s vs 65.3s), which matters for interactive use.
  4. Qwen3-Coder-Next 8-bit at ~75 tok/s on MLX is fast enough for real-time coding assistance — responses feel instantaneous for short completions and stream smoothly for longer outputs.
  5. For local inference of large models on Apple Silicon, MLX is the clear winner over Ollama's llama.cpp backend, leveraging the unified memory architecture and Metal GPU acceleration more effectively.
submitted by /u/paddybuc
[link] [comments]