Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the quadratic compute and memory costs of LLM self-attention on long prompts by focusing on token compression, but specifically targets inefficiencies in the latent embedding space rather than only token-space methods.
It introduces K-Token Merging, which uses a lightweight encoder to merge each contiguous block of K token embeddings into a single embedding while keeping generation tied to the original vocabulary.
The compressed sequence is processed by a LoRA-adapted LLM, combining latent-space compression with parameter-efficient adaptation.
Experiments across structural reasoning, sentiment classification, and code editing tasks show favorable trade-offs, reaching up to 75% input-length reduction with minimal performance loss and positioning the method on the Pareto frontier.
Overall, the work suggests a practical pathway to accelerate long-context LLM inference by compressing representations in embedding space while preserving output semantics.
The approach emphasizes compatibility with existing generation mechanisms (original vocabulary) while reducing the effective sequence length seen by the model.
The paper’s evaluation indicates that latent-space compression can preserve task performance better than approaches that only compress in token space.
It provides empirical evidence that the proposed method can substantially improve efficiency without proportionally sacrificing accuracy.
The framework is designed to be lightweight enough to be feasible in real systems, leveraging simple merging operations over embedding blocks.
Results support K-Token Merging as a strong candidate for long-context deployment where latency and memory are critical constraints.

Abstract

Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.