A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
arXiv cs.AI / 4/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper targets deploying LLM-like hybrid SSM-Transformer models on edge devices by using mixed-precision/quantization while mitigating accuracy loss from uneven quantization effects across components.
- It introduces a lightweight, surrogate-based, backpropagation-free sensitivity analysis method that uses only forward-pass metrics to rank which components are most vulnerable to quantization degradation.
- The authors argue and formally analyze that Kullback–Leibler (KL) divergence is a better quantization-sensitivity metric for language modeling than common alternatives like MSE and SQNR.
- Extensive experiments and ablation studies show KL-based component rankings correlate with observed performance drops and outperform other metrics, enabling more reliable mixed-precision decisions.
- The method is validated via real-world on-device profiling on Intel Lunar Lake hardware, where KL-guided mixed-precision achieves near-FP16 perplexity with throughput and model-size tradeoffs competitive with Uniform INT4.

