| I've just released APEX (Adaptive Precision for EXpert Models): a novel MoE quantization technique that outperforms Unsloth Dynamic 2.0 on accuracy while being 2x smaller for MoE architectures. Benchmarked on Qwen3.5-35B-A3B, but the method applies to any MoE model. Half the size of Q8. Perplexity comparable to F16. Works with stock llama.cpp with no patches. Open source (of course!), with <3 from the github.com/mudler/LocalAI team! Perplexity by itself doesn't say the full story. KL divergence tells a story perplexity doesn't: Tiers for every GPU: - I-Quality: 21.3 GB -- best accuracy - I-Balanced: 23.6 GB -- best all-rounder - I-Compact: 16.1 GB -- fits 24GB GPUs - Mini: 12.2 GB -- fits 16GB VRAM With TurboQuant, at 8K context, every APEX tier gets ~14% faster prompt processing (this is being benchmarked with a DGX Spark): Models: http://huggingface.co/mudler/Qwen3.5-35B-A3B-APEX-GGUF Method + technical paper: http://github.com/mudler/apex-quant Run locally: http://github.com/mudler/LocalAI Original post on twitter/X: https://x.com/mudler_it/status/2039364812463853708 [link] [comments] |
APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing)
Reddit r/LocalLLaMA / 4/2/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- APEX (Adaptive Precision for Expert Models) is an open-source MoE quantization method claiming to deliver about 33% faster inference while improving or matching accuracy relative to Unsloth Dynamic 2.0.
- The approach is demonstrated on Qwen3.5-35B-A3B and is reported to generalize to other MoE models, offering comparable perplexity to F16 while reducing model size (about 2× smaller for MoE vs. a baseline).
- APEX works with stock llama.cpp without patches, making it easier to adopt for local LLM deployments.
- The release introduces multiple APEX “tiers” (I-Quality, I-Balanced, I-Compact, Mini) with specific VRAM footprints and accuracy tradeoffs, from ~21.3GB down to ~12.2GB.
- With TurboQuant, the article reports roughly 14% faster prompt processing at 8K context, with benchmarking underway on a DGX Spark, and points to published code and models on GitHub/Hugging Face.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

Z.ai Launches GLM-5V-Turbo: A Native Multimodal Vision Coding Model Optimized for OpenClaw and High-Capacity Agentic Engineering Workflows Everywhere
MarkTechPost

How I Started Using AI Agents for End-to-End Testing (Autonoma AI)
Dev.to

How We Built an AI Coach That Understands PTSD — And Why It Matters
Dev.to