RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
arXiv cs.LG / 3/19/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- RAMP is a reinforcement learning-based method that performs per-layer mixed-precision quantization to minimize perplexity under a global bit budget for efficient on-device LLM inference.
- The policy conditions on an 11-dimensional embedding of activation statistics, weight properties, and structural descriptors to enable zero-shot transfer across model families and scales.
- Scale Folding is a preconditioning technique that migrates activation outliers into weights via per-channel scaling and normalization layer compensation to enable stable sub-4-bit quantization.
- On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4-bit AWQ and GPTQ, and the policy generalizes zero-shot to Llama 2 13B and Mistral 7B, with the HALO pipeline exporting allocations to GGUF for kernel-free inference on CPUs, GPUs, and edge devices while retaining 99.5% of FP16 performance.
Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".
Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development
Dev.to

Windsurf’s New Pricing Explained: Simpler AI Coding or Hidden Trade-Offs?
Dev.to

Building Production RAG Systems with PostgreSQL: Complete Implementation Guide
Dev.to