MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
arXiv cs.LG / 4/24/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper proposes MCAP (Monte Carlo Activation Profiling), a load-time estimator that measures per-layer importance to address memory limits during LLM deployment on heterogeneous hardware.
- MCAP uses a lightweight per-layer signal to make dynamic decisions for both numeric precision (e.g., W4A8 vs. W4A16) and where each layer resides (GPU, RAM, or SSD), without changing model weights.
- The approach is implemented in a system called NVE and is designed to let the same model run under different memory budgets.
- Reported results show NVE delivers 1.5–1.8× higher decode throughput than llama.cpp Q4_0 on an NVIDIA T4, and allows operation in memory regimes previously impractical without weight modifications.


