Universal YOCO for Efficient Depth Scaling

arXiv cs.CL / 4/3/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that while test-time scaling improves LLM reasoning, conventional Transformer inference strategies do not scale compute efficiently due to looping overhead and KV-cache growth with depth.
  • It introduces Universal YOCO (YOCO-U), combining the YOCO decoder-decoder architecture with recursive computation to improve the capability-efficiency tradeoff beyond either approach alone.
  • YOCO-U uses a Universal Self-Decoder that runs multiple iterations via parameter sharing, while restricting iterations to shallow, efficient-attention layers to control overhead.
  • The approach aims to keep a constant global KV cache and provide linear pre-filling, using partial recursion to increase representational depth with limited additional cost.
  • Experiments reportedly show YOCO-U remains competitive on general and long-context benchmarks, indicating that integrating efficient-attention designs with recursion is a promising path for scalable LLM inference.

Abstract

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.