The Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression

arXiv cs.CL / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study tests prompt compression on 28,421 API trials across three LLM providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, DeepSeek-Chat) using multiple benchmarks and compression ratios.
  • It finds that compression can cause severe quality degradation, with benchmark pass rates dropping from 26.0% at the baseline to 1.5% at r=0.7.
  • Energy effects are highly provider-dependent: DeepSeek shows major output expansion under heavy compression (up to 21→798 tokens at r=0.3), driving energy increases as high as +2,140%.
  • In contrast, GPT-4o-mini exhibits mixed energy outcomes (including energy reductions at some ratios), indicating that input-token reduction alone cannot be assumed to improve inference efficiency.
  • The authors conclude that, for the evaluated settings, better energy–quality tradeoffs come from model selection and output-length control rather than prompt compression.

Abstract

The rapid proliferation of Large Language Models has created an environmental paradox: the very technology that could help solve climate challenges is itself becoming a significant contributor to global carbon emissions. We test whether prompt compression improves inference energy efficiency in 28,421 successful API trials (28,428 planned) across three providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, and DeepSeek-Chat), five benchmarks (HumanEval, MBPP, GSM8K, MATH, MMLU), and four compression ratios (r in {1.0, 0.7, 0.5, 0.3}). Energy is estimated with a token-based proxy calibrated against local direct measurements, and quality is tracked with benchmark pass rates. Compression produced substantial quality loss (overall pass rate 26.0% at baseline vs. 1.5% at r=0.7) and strongly provider-dependent energy behavior. DeepSeek exhibited output expansion under compression (21 to 798 tokens at r=0.3), corresponding to energy increases up to +2,140%, while GPT-4o-mini showed mixed effects including a reduction at r=0.5. These results indicate that input-token reduction alone is not a reliable energy optimization strategy in production inference. For the evaluated settings, model selection and output-length control provided more consistent energy-quality tradeoffs than prompt compression.