MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

arXiv cs.LG / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper proposes MCAP (Monte Carlo Activation Profiling), a load-time estimator that measures per-layer importance to address memory limits during LLM deployment on heterogeneous hardware.
  • MCAP uses a lightweight per-layer signal to make dynamic decisions for both numeric precision (e.g., W4A8 vs. W4A16) and where each layer resides (GPU, RAM, or SSD), without changing model weights.
  • The approach is implemented in a system called NVE and is designed to let the same model run under different memory budgets.
  • Reported results show NVE delivers 1.5–1.8× higher decode throughput than llama.cpp Q4_0 on an NVIDIA T4, and allows operation in memory regimes previously impractical without weight modifications.

Abstract

Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per-layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5-1.8x higher decode throughput than llama.cpp Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference | AI Navigate