Universal YOCO for Efficient Depth Scaling

arXiv cs.CL / 4/3/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that while test-time scaling improves LLM reasoning, conventional Transformer inference strategies do not scale compute efficiently due to looping overhead and KV-cache growth with depth.
It introduces Universal YOCO (YOCO-U), combining the YOCO decoder-decoder architecture with recursive computation to improve the capability-efficiency tradeoff beyond either approach alone.
YOCO-U uses a Universal Self-Decoder that runs multiple iterations via parameter sharing, while restricting iterations to shallow, efficient-attention layers to control overhead.
The approach aims to keep a constant global KV cache and provide linear pre-filling, using partial recursion to increase representational depth with limited additional cost.
Experiments reportedly show YOCO-U remains competitive on general and long-context benchmarks, indicating that integrating efficient-attention designs with recursion is a promising path for scalable LLM inference.

Abstract

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.

Black Hat Asia

AI Business

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

MarkTechPost

Portable eye scanner powered by AI expands access to low-cost community screening

Reddit r/artificial

Universal YOCO for Efficient Depth Scaling

Key Points

Abstract

Related Articles

Black Hat Asia

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

Portable eye scanner powered by AI expands access to low-cost community screening

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer