AI Navigate

llama.cpp on $500 MacBook Neo: Prompt: 7.8 t/s / Generation: 3.9 t/s on Qwen3.5 9B Q3_K_M

Reddit r/LocalLLaMA / 3/12/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • A llama.cpp build was compiled to run the 9B Qwen3.5 model (Q3_K_M.gguf) on a $500 MacBook Neo with 8 GB RAM (Apple A18 Pro) using GGUF.
  • This demonstrates that large language models can operate on consumer hardware with careful optimization, albeit slowly.
  • Observed speeds were about 7.8 tokens per second for prompting and 3.9 tokens per second for generation on that device.
  • The setup used 4 CPU threads, a 4k context, batch size 128, and quantization/config options (e.g., -ctk q4_0, -ctv q4_0, -ngl all) launched with device MTL0.
  • The model file is 4.4 GB on disk, illustrating the memory footprint required to run a 9B model on a laptop.
llama.cpp on $500 MacBook Neo: Prompt: 7.8 t/s / Generation: 3.9 t/s on Qwen3.5 9B Q3_K_M

Just compiled llama.cpp on MacBook Neo with 8 Gb RAM and 9b Qwen 3.5 and it works (slowly, but anyway)

Config used:

Build - llama.cpp version: 8294 (76ea1c1c4) Machine - Model: MacBook Neo (Mac17,5) - Chip: Apple A18 Pro - CPU: 6 cores (2 performance + 4 efficiency) - GPU: Apple A18 Pro, 5 cores, Metal supported - Memory: 8 GB unified Model - Hugging Face repo: unsloth/Qwen3.5-9B-GGUF - GGUF file: models/Qwen3.5-9B-Q3_K_M.gguf - File size on disk: 4.4 GB Launch hyperparams ./build/bin/llama-cli \ -m models/Qwen3.5-9B-Q3_K_M.gguf \ --device MTL0 \ -ngl all \ -c 4096 \ -b 128 \ -ub 64 \ -ctk q4_0 \ -ctv q4_0 \ --reasoning on \ -t 4 \ -tb 6 \ -cnv 
submitted by /u/Shir_man
[link] [comments]