DeGenTWeb: A First Look at LLM-dominant Websites

arXiv cs.AI / 5/4/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

The paper argues that earlier claims about LLM-generated content “taking over” the web lack representative sampling and clear, transparent methodology.
It introduces DeGenTWeb, a system for systematically identifying LLM-dominant websites—sites where content is largely produced by LLMs with minimal human input.
The authors adapt LLM-text detectors for use on web pages and aggregate results across multiple pages to categorize websites more accurately.
Using DeGenTWeb, they find LLM-dominant sites are highly prevalent in Common Crawl data and in Bing search results, and their share increases over time.
They conclude that accurately identifying such sites will likely become increasingly difficult as newer LLMs improve at producing text that evades detectors.

Abstract

Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing's search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs.

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Open source models are going to be the future on Cursor, OpenCode etc.

Reddit r/LocalLLaMA

Claude Desktop + NFTs: MCP Tools for AI Agent NFT Management

Dev.to

DeGenTWeb: A First Look at LLM-dominant Websites

Key Points

Abstract

Related Articles

ALM on Power Platform: ADO + GitHub, the best of both worlds

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Open source models are going to be the future on Cursor, OpenCode etc.

Claude Desktop + NFTs: MCP Tools for AI Agent NFT Management

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer