| did a local LLM benchmark on my iphone 15 pro max last night. tested 4 models, all Q4 quantized, running fully on-device with no internet. first the sanity check. asked each one "which number is larger, 9.9 or 9.11" and all 4 got it right. the reasoning styles were pretty different though. qwen3.5 went full thinking mode with a step-by-step breakdown, minicpm literally just answered "9.9" and called it a day lmao :)
drop a comment if there's a model you want me to test next, i'll get back to everyone later today! [link] [comments] |
tested 4 local models on iphone - benchmarks + the 9.9 vs 9.11 math trick
Reddit r/LocalLLaMA / 2026/3/25
💬 オピニオンSignals & Early TrendsTools & Practical UsageModels & Research
要点
- The post reports an on-device benchmark of four Q4-quantized local LLMs running fully on an iPhone 15 Pro Max without using the internet.
- In a simple sanity check comparing “9.9 vs 9.11,” all four models answered correctly, though their internal reasoning styles differed substantially.
- The author measured performance using GPU tokens/sec and time-to-first-token, with LFM2.5 VL (1.6B) showing the highest tokens/sec and relatively fast first-token times.
- The benchmark results suggest meaningful variation in both throughput and latency across small local models, even when quantized to the same Q4 level.
- Readers are invited to request additional models to test next, indicating an ongoing community-driven evaluation effort.