LLMs Corrupt Your Documents When You Delegate

arXiv cs.CL / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces DELEGATE-52, a benchmark to test how reliably LLMs handle long delegated workflows that require extensive professional document editing across 52 domains.
In experiments using 19 LLMs, even leading frontier models were found to corrupt documents during delegation, averaging about 25% of document content over long workflows.
The study finds that agentic tool use does not improve performance on DELEGATE-52, indicating tool use alone doesn’t prevent document degradation.
Degradation is shown to worsen with larger document size, longer interaction length, and the presence of distractor files, with errors that can be sparse but severe.
The authors conclude that current LLMs are unreliable delegates because they can introduce silent, compounding errors that undermine document correctness over time.

Abstract

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/20DailyView insight →

Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]

Reddit r/MachineLearning

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark

Dev.to

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting

Dev.to

The $20/month AI subscription is gaslighting developers in emerging markets

Dev.to

A Claude Code hook that warns you before calling a low-trust MCP server

Dev.to

LLMs Corrupt Your Documents When You Delegate

Key Points

Abstract

💡 Insights using this article

Related Articles

Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting

The $20/month AI subscription is gaslighting developers in emerging markets

A Claude Code hook that warns you before calling a low-trust MCP server

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer