APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier

Reddit r/LocalLLaMA / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

APEX（MoE-awareな混合精度量子化）について、最初のQwen 3.5 35B-A3Bの投稿以降に30+のMoE系モデル群へ拡大し、25件以上の追加モデルが報告されたとしています。
主要なフィードバックとして、I-Balanced / I-Compactが30〜50B級MoEでも32kトークン超の長文でのコヒーレンスを維持し、均一Q4_Kより劣化が小さい可能性が示されています。
コーディング用途では、Qwen 3.6 35B-A3BのユーザーからI-Compact / I-Miniがサイズの割に実コードタスクでF16に近い挙動だという報告が挙がっています。
新しい超圧縮ティアI-Nano（IQ2_XXS）が追加され、ビット幅を2.06 bpw級まで下げつつ、MoEのスパースなトークンごとのエキスパート活性前提でimatrixが必要になると説明されています。
追加されたモデルはQwen系、フロンティア規模のMoE（レンタルBlackwellで量子化）、Gemma 4ファミリー、コミュニティのMoEマージなど多岐にわたります。

Quick follow-up on APEX, the MoE-aware mixed-precision quant strategy. The original post was just about Qwen 3.5 35B-A3B ( https://www.reddit.com/r/LocalLLaMA/comments/1s9vzry/apex_moe_quantized_models_boost_with_33_faster/ ); since then the collection has grown to 30+ MoEs across most major families. Plus a new ultra-compressed tier landed.

Feedback so far

The reports coming back have been honestly better than I expected!

Long context holds up. People report APEX I-Balanced and I-Compact retaining coherence well past 32k tokens on the 30-50B-class MoEs, even at sizes where uniform Q4_K starts visibly degrading. The hypothesis: keeping shared experts and edge layers high-precision (where rare/long-range tokens get routed and embedded) preserves the long-context behavior that aggressive uniform quants tend to break. Numbers back this up by having by far best KL99% value across other models
Coding quants punch above their size. Qwen3.6 35b a3b users in particular have been flagging that I-Compact and I-Mini stay surprisingly close to F16 on real code tasks vs the size class would suggest.

Thanks to everyone reporting back, that's what justifies pushing further on the low-bit tiers below.

Models added since the first post

Grouped by family, most are 30-70B-class MoEs that fit one consumer GPU at I-Mini/I-Compact:

Qwen lineage

Qwen 3.5 122B-A10B, Qwen 3.5 397B-A17B, Qwen3.5 Claude-Distilled, Qwen3.5 Fernflower (uncensored), Qwen3.5 TQ
Qwen 3.6 35B-A3B, +heretic, +Claude 4.6 distill, +Claude 4.7 distill
Qwen3-Coder 30B, Qwen3-Coder Next

Frontier-size MoEs (rented Blackwell to quantize)

MiniMax-M2.5, MiniMax-M2.7 — 228B / 24B active, the biggest yet
Mistral-Small 4 119B-2603
NVIDIA Nemotron-3-Super 120B-A12B
GLM-4.7 Flash, Step-3.5 Flash
Nemotron-3-Nano 30B-A3B, Nemotron-3-Nano-Omni Reasoning — multimodal (vision + audio + text)
Holo3 35B-A3B
Huihui3.5 67B-A3B

Hybrid Mamba / SSM MoEs

Nemotron-3-Nano 30B-A3B, Nemotron-3-Nano-Omni Reasoning — multimodal (vision + audio + text)
Holo3 35B-A3B
LFM2 24B-A2B

Gemma 4 family

gemma-4 26B-A4B-it (just re-quantized today with Google's updated chat template), +Claude Opus distill, +heretic, Gemopus-4 Preview

Community MoE merges

Carnice MoE 35B-A3B, Carnice-Qwen3.6, Qwopus MoE 35B-A3B

New tier: I-Nano (IQ2_XXS)

Pushes mid-layer routed experts down to 2.06 bpw, near-edge to IQ2_S, edges to Q3_K, shared experts at Q5_K. About 20% smaller than I-Mini, viable only on MoE thanks to sparse per-token expert activation. Requires imatrix.

Examples: