Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the quadratic compute and memory costs of LLM self-attention on long prompts by focusing on token compression, but specifically targets inefficiencies in the latent embedding space rather than only token-space methods.
  • It introduces K-Token Merging, which uses a lightweight encoder to merge each contiguous block of K token embeddings into a single embedding while keeping generation tied to the original vocabulary.
  • The compressed sequence is processed by a LoRA-adapted LLM, combining latent-space compression with parameter-efficient adaptation.
  • Experiments across structural reasoning, sentiment classification, and code editing tasks show favorable trade-offs, reaching up to 75% input-length reduction with minimal performance loss and positioning the method on the Pareto frontier.
  • Overall, the work suggests a practical pathway to accelerate long-context LLM inference by compressing representations in embedding space while preserving output semantics.
  • The approach emphasizes compatibility with existing generation mechanisms (original vocabulary) while reducing the effective sequence length seen by the model.
  • The paper’s evaluation indicates that latent-space compression can preserve task performance better than approaches that only compress in token space.
  • It provides empirical evidence that the proposed method can substantially improve efficiency without proportionally sacrificing accuracy.
  • The framework is designed to be lightweight enough to be feasible in real systems, leveraging simple merging operations over embedding blocks.
  • Results support K-Token Merging as a strong candidate for long-context deployment where latency and memory are critical constraints.

Abstract

Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.