EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

arXiv cs.AI / 5/7/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper introduces EdgeRazor, a lightweight framework designed to deploy large language models on resource-constrained devices using mixed-precision and extremely low-bit quantization-aware distillation.
  • EdgeRazor combines three modules: mixed-precision quantization-aware distillation, adaptive feature distillation that maps an 16-bit teacher to an n-bit student, and entropy-aware KL divergence that balances forward/reverse objectives based only on the teacher’s output distribution.
  • Compared with existing approaches, the method aims to overcome the accuracy drop of PTQ below 4-bit while avoiding the heavy compute demands of QAT by automating feature selection and reducing teacher-data dependence.
  • Experiments across base, instruction-tuned, and multimodal LLMs show EdgeRazor at 1.88-bit outperforms methods using 3-bit precision, including leading 2-bit PTQ by 11.3 points, with a 4–10× lower training budget than the leading QAT approach.
  • The authors report higher compression and faster inference, e.g., 1.58-bit Qwen3-0.6B cuts storage from 1.41 GB to 0.28 GB and speeds decoding by 15.1× versus a 16-bit baseline.

Abstract

Recent years have witnessed an increasing interest in deploying LLMs on resource-constrained devices, among which quantization has emerged as a promising lightweight technique that converts full-precision model weights and activations into lower-bit formats. Existing weight quantization approaches can be roughly divided into three categories: Post-Training Quantization (PTQ) that calibrates quantized parameters on a small dataset without retraining but suffers from severe performance degradation below 4-bit, Quantization-Aware Training (QAT) that searches low-bit parameters using surrogate gradients but demands substantial computational resources, and Quantization-Aware Distillation that integrates QAT with knowledge transfer from a full-precision teacher but manually selects features to distill and relies heavily on teacher-specific data. In this paper, we propose EdgeRazor, a lightweight framework for LLMs with mixed-precision and extremely low-bit weight quantization. The EdgeRazor framework contains three modules: Mixed-Precision Quantization-Aware Distillation for the fine-grained control of precision, Adaptive Feature Distillation that derives an n-bit student from its 16-bit teacher, and Entropy-Aware KL Divergence on both human-annotated and distilled datasets, whose forward-reverse balance is determined solely by the teacher's output distribution. Empirical investigations of EdgeRazor are conducted on base, instruction-tuned, and multimodal LLMs. Notably, EdgeRazor with 1.88-bit surpasses all contenders with the 3-bit precision, especially outperforms the leading 2-bit PTQ methods by 11.3 points, within a 4-10\times lower training budget than the leading QAT approach. EdgeRazor delivers higher compression ratios at all bit width; the 1.58-bit Qwen3-0.6B reduces storage from 1.41 GB to 0.28 GB while accelerating decoding by 15.1\times relative to the 16-bit baseline.