DeGenTWeb: A First Look at LLM-dominant Websites

arXiv cs.AI / 5/4/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • The paper argues that earlier claims about LLM-generated content “taking over” the web lack representative sampling and clear, transparent methodology.
  • It introduces DeGenTWeb, a system for systematically identifying LLM-dominant websites—sites where content is largely produced by LLMs with minimal human input.
  • The authors adapt LLM-text detectors for use on web pages and aggregate results across multiple pages to categorize websites more accurately.
  • Using DeGenTWeb, they find LLM-dominant sites are highly prevalent in Common Crawl data and in Bing search results, and their share increases over time.
  • They conclude that accurately identifying such sites will likely become increasingly difficult as newer LLMs improve at producing text that evades detectors.

Abstract

Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing's search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs.