Quick follow-up on APEX, the MoE-aware mixed-precision quant strategy. The original post was just about Qwen 3.5 35B-A3B ( https://www.reddit.com/r/LocalLLaMA/comments/1s9vzry/apex_moe_quantized_models_boost_with_33_faster/ ); since then the collection has grown to 30+ MoEs across most major families. Plus a new ultra-compressed tier landed.
Feedback so far
The reports coming back have been honestly better than I expected!
- Long context holds up. People report APEX I-Balanced and I-Compact retaining coherence well past 32k tokens on the 30-50B-class MoEs, even at sizes where uniform Q4_K starts visibly degrading. The hypothesis: keeping shared experts and edge layers high-precision (where rare/long-range tokens get routed and embedded) preserves the long-context behavior that aggressive uniform quants tend to break. Numbers back this up by having by far best KL99% value across other models
- Coding quants punch above their size. Qwen3.6 35b a3b users in particular have been flagging that I-Compact and I-Mini stay surprisingly close to F16 on real code tasks vs the size class would suggest.
Thanks to everyone reporting back, that's what justifies pushing further on the low-bit tiers below.
Models added since the first post
Grouped by family, most are 30-70B-class MoEs that fit one consumer GPU at I-Mini/I-Compact:
Qwen lineage
- Qwen 3.5 122B-A10B, Qwen 3.5 397B-A17B, Qwen3.5 Claude-Distilled, Qwen3.5 Fernflower (uncensored), Qwen3.5 TQ
- Qwen 3.6 35B-A3B, +heretic, +Claude 4.6 distill, +Claude 4.7 distill
- Qwen3-Coder 30B, Qwen3-Coder Next
Frontier-size MoEs (rented Blackwell to quantize)
- MiniMax-M2.5, MiniMax-M2.7 — 228B / 24B active, the biggest yet
- Mistral-Small 4 119B-2603
- NVIDIA Nemotron-3-Super 120B-A12B
- GLM-4.7 Flash, Step-3.5 Flash
- Nemotron-3-Nano 30B-A3B, Nemotron-3-Nano-Omni Reasoning — multimodal (vision + audio + text)
- Holo3 35B-A3B
- Huihui3.5 67B-A3B
Hybrid Mamba / SSM MoEs
- Nemotron-3-Nano 30B-A3B, Nemotron-3-Nano-Omni Reasoning — multimodal (vision + audio + text)
- Holo3 35B-A3B
- LFM2 24B-A2B
Gemma 4 family
- gemma-4 26B-A4B-it (just re-quantized today with Google's updated chat template), +Claude Opus distill, +heretic, Gemopus-4 Preview
Community MoE merges
- Carnice MoE 35B-A3B, Carnice-Qwen3.6, Qwopus MoE 35B-A3B
New tier: I-Nano (IQ2_XXS)
Pushes mid-layer routed experts down to 2.06 bpw, near-edge to IQ2_S, edges to Q3_K, shared experts at Q5_K. About 20% smaller than I-Mini, viable only on MoE thanks to sparse per-token expert activation. Requires imatrix.
Examples:
- Qwen 3.5 35B-A3B: I-Mini 13 GB → I-Nano 11 GB
- Nemotron Omni 30B: I-Mini 18 GB → I-Nano 17 GB (less savings — denser shared expert)
Links
- Collection: https://huggingface.co/collections/mudler/apex-quants-gguf
- Project + paper: https://github.com/mudler/apex-quant
If you've used APEX quants and have feedback, comments welcome!
[link] [comments]




