I posted about my APEX quantization of QWEN Coder 80B Next yesterday and got a ton of great questions. Some people loved it, some people were skeptical, and one person asked "what exactly is the point of this when K quants already do mixed precision?"
It's a great question. I've been deep in this for the last few days running APEX on my own hardware and I want to break down what I've learned because I think most people are missing the bigger picture here.
So yes K quants like Q4_K_M already apply different precision to different layers. Attention gets higher precision, feed-forward gets lower. That's been in llama.cpp for a while and it works.
But here's the thing nobody is talking about.
MoE models have a coherence problem. I was reading this article last night and it clicked for me. When your coding agent is working across multiple files, different experts handle different tokens. The expert that processed your collision logic might not be the same expert that processes your entity initialization. The routing is efficient but the representation gets fragmented.
Think about that. Your agent is writing a function in one file that references a variable in another file and different experts handled each piece. What holds it all together?
The shared experts and attention layers. These fire on EVERY token no matter which routed experts get selected. They're the coherence layer. The glue. Without them your MoE model falls apart on complex multi-file coding sessions.
This is where APEX changes the game.
APEX knows about MoE architecture. It keeps those shared experts and attention at Q8, near lossless. The routed experts that only fire 3% of the time? Those get compressed harder. You're preserving the exact layers that matter most for keeping your agent coherent across long sessions.
Standard K quants have no idea about MoE roles. They see a feed-forward layer and compress it the same whether it's a shared expert that fires on every token or a routed expert that fires on 3% of tokens.
Now here's where it gets even better.
I ran my APEX quantization with a code-calibrated imatrix. 50,575 code samples. Not Wikipedia, not general chat, CODE. That imatrix tells APEX which specific weights within those shared coherence layers fire most during code generation, tool calling, and error recovery.
So it's three layers of optimization stacked:
- APEX preserves the shared/attention layers that maintain coherence across expert routing
- The code imatrix prioritizes the weights within those layers that actually fire during coding
- MoE routing means 97% of expert weights are idle per token so they compress aggressively with almost zero quality loss
That's why Mudler's APEX I-Quality beats F16 on perplexity (6.527 vs 6.537). It's not just compressing less. It's compressing smarter. The coherence layers stay intact while everything else gets shrunk.
For anyone building coding agents on MoE models this matters. A lot. Your agent staying coherent across a 10 file refactoring session is literally the difference between useful output and garbage.
APEX is still very new like a week or two old I think but I believe this is way forward with quality and speed especially for people with limited hardware like myself
Again I'm learning this just like anyone else but I'm here to share what I'm learning as i learn it
Credit to Mudler (Ettore Di Giacinto) for creating APEX and LocalAI.
Credit to the article that helped me connect the dots on the coherence problem: https://x.com/sudoingX/status/2040836083731333381
My APEX I-Quality quant with code-calibrated imatrix: https://huggingface.co/stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Mudler APEX repo with tons of choices
https://huggingface.co/collections/mudler/apex-quants-gguf
[link] [comments]




