APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing)

Reddit r/LocalLLaMA / 4/2/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

APEX (Adaptive Precision for Expert Models) is an open-source MoE quantization method claiming to deliver about 33% faster inference while improving or matching accuracy relative to Unsloth Dynamic 2.0.
The approach is demonstrated on Qwen3.5-35B-A3B and is reported to generalize to other MoE models, offering comparable perplexity to F16 while reducing model size (about 2× smaller for MoE vs. a baseline).
APEX works with stock llama.cpp without patches, making it easier to adopt for local LLM deployments.
The release introduces multiple APEX “tiers” (I-Quality, I-Balanced, I-Compact, Mini) with specific VRAM footprints and accuracy tradeoffs, from ~21.3GB down to ~12.2GB.
With TurboQuant, the article reports roughly 14% faster prompt processing at 8K context, with benchmarking underway on a DGX Spark, and points to published code and models on GitHub/Hugging Face.

APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing)

I've just released APEX (Adaptive Precision for EXpert Models): a novel MoE quantization technique that outperforms Unsloth Dynamic 2.0 on accuracy while being 2x smaller for MoE architectures.

Benchmarked on Qwen3.5-35B-A3B, but the method applies to any MoE model. Half the size of Q8. Perplexity comparable to F16.

Works with stock llama.cpp with no patches. Open source (of course!), with <3 from the github.com/mudler/LocalAI team!

https://preview.redd.it/uv2bnfheymsg1.jpg?width=1632&format=pjpg&auto=webp&s=3eca979e8f9ca6b75d206eecdf29308b74aed530

Perplexity by itself doesn't say the full story. KL divergence tells a story perplexity doesn't:

https://preview.redd.it/jn9ua2ksymsg1.jpg?width=1617&format=pjpg&auto=webp&s=7df969308e10aa6b6d31098c92fca1c14bb42a40

Tiers for every GPU:

- I-Quality: 21.3 GB -- best accuracy

- I-Balanced: 23.6 GB -- best all-rounder

- I-Compact: 16.1 GB -- fits 24GB GPUs

- Mini: 12.2 GB -- fits 16GB VRAM

https://preview.redd.it/zv3t6qynymsg1.jpg?width=1632&format=pjpg&auto=webp&s=6cb830e889dbeeda768f32be41b2bb02ce3bc11f

With TurboQuant, at 8K context, every APEX tier gets ~14% faster prompt processing (this is being benchmarked with a DGX Spark):