Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression
arXiv cs.CL / 3/26/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study argues that prompt compression should not be judged only by input-token reduction, because compression can change output length and total inference cost in benchmark-dependent ways.
- Using 5,400 API calls across three benchmarks and multiple providers under aggressive compression (r=0.3), it finds that DeepSeek shows extreme output expansion on MBPP (56x, low instruction survival probability) but much less on HumanEval (5x, higher survival probability), while GPT-4o-mini is comparatively stable.
- The authors introduce instruction survival probability (Ψ) as a structural metric to explain conflicting prior findings, showing that prompt structure and truncation effects matter more than provider identity alone.
- They propose the Compression Robustness Index (CRI) to enable safer cross-benchmark evaluation, warning that single-benchmark tests can mislead conclusions about “compression safety” and efficiency.
- Companion NVML-based energy measurements suggest that input token savings may overstate real joule (energy) savings, motivating benchmark-diverse and structure-aware compression policies for deployment.
Related Articles
Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets
Dev.to
Mercor competitor Deccan AI raises $25M, sources experts from India
Dev.to
How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)
Dev.to
How Should Students Document AI Usage in Academic Work?
Dev.to
They Did Not Accidentally Make Work the Answer to Who You Are
Dev.to