Large Language Model as Token Compressor and Decompressor

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes that an off-the-shelf large language model (LLM) can serve as both a token compressor and decompressor by learning an internal representation.
  • It introduces a self-expressive autoencoding framework that fine-tunes a pretrained LLM to convert long text into discrete, variable-length latent codes (“Z-tokens”) and reconstruct the original text exactly.
  • The learned Z-token representation is content-adaptive, allocating more tokens to semantically dense segments while aggressively compressing redundant or predictable regions using lightweight LoRA-based adapter heads.
  • Experiments report up to 18× token reduction on datasets such as Wikipedia, CNN/DailyMail, HotpotQA, and long-query corpora, while maintaining reconstruction fidelity and downstream task performance.
  • The approach is positioned as enabling token-efficient long-context reasoning, including prompt compression and autoregressive generation directly in the Z-token space.

Abstract

In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.
広告