tested 4 local models on iphone - benchmarks + the 9.9 vs 9.11 math trick

Reddit r/LocalLLaMA / 2026/3/25

💬 オピニオンSignals & Early TrendsTools & Practical UsageModels & Research

要点

  • The post reports an on-device benchmark of four Q4-quantized local LLMs running fully on an iPhone 15 Pro Max without using the internet.
  • In a simple sanity check comparing “9.9 vs 9.11,” all four models answered correctly, though their internal reasoning styles differed substantially.
  • The author measured performance using GPU tokens/sec and time-to-first-token, with LFM2.5 VL (1.6B) showing the highest tokens/sec and relatively fast first-token times.
  • The benchmark results suggest meaningful variation in both throughput and latency across small local models, even when quantized to the same Q4 level.
  • Readers are invited to request additional models to test next, indicating an ongoing community-driven evaluation effort.
tested 4 local models on iphone - benchmarks + the 9.9 vs 9.11 math trick

did a local LLM benchmark on my iphone 15 pro max last night. tested 4 models, all Q4 quantized, running fully on-device with no internet.

first the sanity check. asked each one "which number is larger, 9.9 or 9.11" and all 4 got it right. the reasoning styles were pretty different though. qwen3.5 went full thinking mode with a step-by-step breakdown, minicpm literally just answered "9.9" and called it a day lmao :)

Model GPU Tokens/s Time to First Token
Qwen3.5 4B Q4 10.4 0.7s
LFM2.5 VL 1.6B 44.6 0.2s
Gemma3 4B MLX Q4 15.6 0.9s
MiniCPM-V 4 16.1 0.6s

drop a comment if there's a model you want me to test next, i'll get back to everyone later today!

submitted by /u/EthanJohnson01
[link] [comments]