Help my llm isn't llming

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

A Reddit user running llama.cpp on a MacBook Air M2 reports that Q4 and Q6 quantized variants of a Qwen3.5 9B model consume roughly the same RAM and generate at similar speeds, which they find unexpected.
They provide model details (UD-Q4_K_XL vs Q6_K), default llama.cpp sampling parameters, and mention they attempted to control for memory effects by purging between runs, limiting windows, and disabling swapping.
Using Activity Monitor and llama.cpp memory breakdown output after ~2.5 minutes of generation, they show different internal memory splits while overall “memory used” appears similar.
The post frames their issue as beginner confusion and asks the community for an explanation or help interpreting how quantization level relates to runtime memory and throughput on Apple Silicon via llama.cpp.
The key takeaway is a troubleshooting discussion about performance/memory behavior of quantized LLMs in a local inference setup rather than a new release or benchmark announcement.

Long story short, for some reasons Q4 and Q6 seem to be taking the same amount of RAM on my Macbook air M2 16GB? And also the same generation speed? I'm a beginner with little knowledge about this, and I hope some kind souls here can save me.

here are some stats.

models: unsloth Qwen3.5 9B UD-Q4_K_XL (5.97GB) and unsloth Qwen3.5 9B Q6_K (7.46)

temp 0.8
top-k 40
top-p 0.95
they, along with other stats, are all defaults of llama.cpp

I sudo purged every time before switching to the next model, turned off all windows except terminal and activity monitor, and made sure there's no swapping.

Memory it's using is in the pictures. The right one is the window of activity monitor, and I circled the "memory used."

For some additional data, here are the llama_memory_breakdown_print of Q4 and Q6, both after running for about 2.5 minutes, generating about 1425 and 1380 tokens each (time*t/s, a rough estimation). I changed the format a bit to make it more understandable.

Q4:

| memory breakdown [MiB] | total free self model context compute unaccounted |

| - MTL0 (Apple M2) | 12124 = 690 + (11433 = 5679 + 5178 + 575) + 0 |

| - Host | 882 = 545 + 0 + 336 |

Q6:

| memory breakdown [MiB] | total free self model context compute unaccounted |

| - MTL0 (Apple M2) | 12124 = 477 + (11645 = 7102 + 4050 + 493) + 0 |

| - Host | 1061 = 795 + 0 + 266 |

submitted by /u/Nicking0413
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

AI Agents Explained: 5 Types, Components, Frameworks, and Real-World Use Cases

Dev.to

Build Your Own JARVIS: A Deep Dive into Memo AI - The Privacy-First Local Voice Agent

Dev.to

Edge-to-Cloud Swarm Coordination for circular manufacturing supply chains with embodied agent feedback loops

Dev.to

Help my llm isn't llming

Key Points

Related Articles

Black Hat USA

Black Hat Asia

AI Agents Explained: 5 Types, Components, Frameworks, and Real-World Use Cases

Build Your Own JARVIS: A Deep Dive into Memo AI - The Privacy-First Local Voice Agent

Edge-to-Cloud Swarm Coordination for circular manufacturing supply chains with embodied agent feedback loops

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer