PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction

arXiv cs.CL / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • PoC shifts the focus from fixing a compression ratio to enforcing a user-defined performance floor, enabling more reliable and predictable LLM context compression decisions.
  • The approach uses a lightweight performance predictor to automatically identify the most aggressive compression ratio that satisfies the performance constraint, before applying an off-the-shelf compressor.
  • The authors compare a simple context-agnostic predictor with a more sophisticated context-aware predictor, finding the latter yields lower prediction error and better overall performance on QA and summarization tasks.
  • The proposed method promises more reliable, efficient, and performance-aware deployment of context compression for LLMs, with potential reductions in inference costs.

Abstract

While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input's inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.