Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper proposes a unified framework that combines prompt compression and model pruning/structured sparsity to enable dynamic, inference-aware execution in large language models.
  • It uses compressed sensing with random measurement operators to infer task- and token-conditioned “supports” (which latent substructures should run) during decoding, rather than applying a static compressed model offline.
  • The recovered supports are compiled into hardware-efficient sparse execution paths across model components such as blocks, attention heads, channels, and feed-forward substructures.
  • The work includes formal sample-complexity guarantees under restricted isometry or mutual incoherence assumptions, and constrains recovery to GPU-efficient structures for deployable speedups.
  • The approach is framed as a measurement-and-recovery problem with explicit approximation guarantees, aiming to reduce parameter/memory costs and decoding latency simultaneously.

Abstract

Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate. Most model-compression methods are static and optimized offline, and they do not exploit the fact that different prompts and decoding steps activate different latent computational pathways. Prompt-compression methods reduce sequence length, but they do not adapt the executed model subnetwork. We propose a unified compressed-sensing-guided framework for dynamic LLM execution. Random measurement operators probe latent model usage, sparse recovery estimates task-conditioned and token-adaptive support sets, and the recovered supports are compiled into hardware-efficient sparse execution paths over blocks, attention heads, channels, and feed-forward substructures. The framework introduces five key contributions: task-conditioned measurements, so different prompts induce different sparse supports; token-adaptive recovery, so active substructures are re-estimated during decoding; formal sample-complexity bounds under restricted isometry or mutual incoherence assumptions; compile-to-hardware constraints that restrict recovery to GPU-efficient structures; and a joint objective that unifies prompt compression with model reduction. Together, these components recast LLM inference as a measurement-and-recovery problem with explicit approximation guarantees and deployment-oriented speedup constraints.