YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

arXiv cs.CL / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces YOCO++, an improved cross-layer key-value (KV) compression approach for more efficient LLM inference while targeting lower quality loss than prior KV compression methods.
  • YOCO++ enhances YOCO by adding weighted residual connections that link each bottom-half layer’s KV to the bottom layer, increasing effective model capacity without changing training/inference efficiency.
  • The method aims to preserve the benefits of reduced KV-cache memory usage at a fixed compression rate, addressing the common tradeoff between compression and performance.
  • Experiments report state-of-the-art results among cross-layer KV compression techniques at a 50% KV cache compression rate, beating a standard Transformer baseline.

Abstract

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.