When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper identifies a failure mode for LLM coding agents called “output stalling,” where the agent silently outputs empty responses when tasked with generating large, format-heavy documents.
It defines Output Generation Capacity (OGC), a formal and empirically smaller measure than the raw context window that better captures an agent’s real ability to produce output given its context state.
The authors prove a Format-Cost Separation Theorem that shows deferred template rendering is always at least as token-efficient as direct generation for formats with overhead multiplier μ_f > 1, providing tight bounds on token savings.
An Adaptive Strategy Selection framework chooses among direct, chunked, or deferred generation strategies based on the ratio between estimated output cost and available OGC.
Experiments across Claude 3.5 Sonnet, GPT-4o, and Llama 3.1 70B show deferred rendering reduces LLM generation tokens by 48–72% and completely eliminates output stalling, and the work is implemented as an open-source GEN-PILOT MCP server.

Abstract

LLM-powered coding agents suffer from a poorly understood failure mode we term output stalling: the agent silently produces empty responses when attempting to generate large, format-heavy documents. We present a theoretical framework that explains and prevents this failure through three contributions. (1) We introduce Output Generation Capacity (OGC), a formal measure of an agent's effective ability to produce output given its current context state - distinct from and empirically smaller than the raw context window. (2) We prove a Format-Cost Separation Theorem showing that deferred template rendering is always at least as token-efficient as direct generation for any format with overhead multiplier

\mu_f > 1

, and derive tight bounds on the savings. (3) We formalize Adaptive Strategy Selection, a decision framework that maps the ratio of estimated output cost to available OGC into an optimal generation strategy (direct, chunked, or deferred). We validate the theory through controlled experiments across three models (Claude 3.5 Sonnet, GPT-4o, Llama 3.1 70B), four document types, and an ablation study isolating each component's contribution. Deferred rendering reduces LLM generation tokens by 48-72% across all conditions and eliminates output stalling entirely. We instantiate the framework as GEN-PILOT, an open-source MCP server, demonstrating that the theory translates directly into a practical tool.

Black Hat USA

AI Business

How to Stop Your AI Coding Assistant From Being Useless at Specialized Tasks

Dev.to

GPT-5.5 System Card

Dev.to

[NeurIPS 2026] Dumb Question about formating [D]

Reddit r/MachineLearning

Crafting Your AI Rulebook for Niche DTC Support

Dev.to

When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis

Key Points

Abstract

Related Articles

Black Hat USA

How to Stop Your AI Coding Assistant From Being Useless at Specialized Tasks

GPT-5.5 System Card

[NeurIPS 2026] Dumb Question about formating [D]

Crafting Your AI Rulebook for Niche DTC Support

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer