MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

arXiv cs.LG / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper proposes MCAP (Monte Carlo Activation Profiling), a load-time estimator that measures per-layer importance to address memory limits during LLM deployment on heterogeneous hardware.
MCAP uses a lightweight per-layer signal to make dynamic decisions for both numeric precision (e.g., W4A8 vs. W4A16) and where each layer resides (GPU, RAM, or SSD), without changing model weights.
The approach is implemented in a system called NVE and is designed to let the same model run under different memory budgets.
Reported results show NVE delivers 1.5–1.8× higher decode throughput than llama.cpp Q4_0 on an NVIDIA T4, and allows operation in memory regimes previously impractical without weight modifications.

Abstract

Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per-layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5-1.8x higher decode throughput than llama.cpp Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.

How to Stop Your AI Coding Assistant From Being Useless at Specialized Tasks

Dev.to

GPT-5.5 System Card

Dev.to

[NeurIPS 2026] Dumb Question about formating [D]

Reddit r/MachineLearning

Crafting Your AI Rulebook for Niche DTC Support

Dev.to

Multi-Perspective Context Matching for Machine Comprehension

Dev.to

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

Key Points

Abstract

Related Articles

How to Stop Your AI Coding Assistant From Being Useless at Specialized Tasks

GPT-5.5 System Card

[NeurIPS 2026] Dumb Question about formating [D]

Crafting Your AI Rulebook for Niche DTC Support

Multi-Perspective Context Matching for Machine Comprehension

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer