RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
arXiv cs.LG / 3/19/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- RAMP is a reinforcement learning-based method that performs per-layer mixed-precision quantization to minimize perplexity under a global bit budget for efficient on-device LLM inference.
- The policy conditions on an 11-dimensional embedding of activation statistics, weight properties, and structural descriptors to enable zero-shot transfer across model families and scales.
- Scale Folding is a preconditioning technique that migrates activation outliers into weights via per-channel scaling and normalization layer compensation to enable stable sub-4-bit quantization.
- On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4-bit AWQ and GPTQ, and the policy generalizes zero-shot to Llama 2 13B and Mistral 7B, with the HALO pipeline exporting allocations to GGUF for kernel-free inference on CPUs, GPUs, and edge devices while retaining 99.5% of FP16 performance.




