75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

Dev.to / 5/1/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageIndustry & Market Moves

Key Points

  • A cross-platform analysis by Position Digital (April 2026) found that 75% of websites that actively block AI crawlers still appear in AI-generated answers from ChatGPT, Perplexity, and Gemini.
  • The study argues that blocking mainly prevents crawl control, not citations, because AI systems can draw from multiple inputs beyond live crawling.
  • The report highlights that models may already have been trained on a site’s content before robots.txt/server blocks were added, and third-party mentions can create separate citation paths.
  • It also notes that content freshness can outweigh crawl permission, with a majority of highly cited pages recently updated or otherwise new.
  • The overall takeaway is that “shutting the door” with robots.txt/blocks may not reduce AI references and can reduce brands’ control over how their information is framed.

Seventy-five percent of websites that actively block AI crawlers through robots.txt, meta tags, or server-level rules still appear in AI-generated answers from ChatGPT, Perplexity, and Gemini. Blocking does not stop citations. It stops you from controlling them.

That number comes from new cross-platform citation analysis published by Position Digital in April 2026, and it dismantles the most common instinct brands have when they discover AI engines are using their content: shut the door.

Why Brands Block AI Bots

The logic feels sound. OpenAI, Google, Anthropic, and Perplexity all send crawlers across the web to ingest content. Their bots have user-agent strings like ChatGPT-User, Googlebot, CCBot, and PerplexityBot. You can add them to your robots.txt file and tell them to stay out.

Many sites did exactly that. After the AI training data controversies of 2023-2024, publishers ranging from major news outlets to niche SaaS blogs added Disallow rules targeting known AI user agents.

The result: most of them still show up in AI answers anyway.

The Data: Blocking vs. Citation Reality

Position Digital's April 2026 analysis tracked AI citation patterns across ChatGPT, Perplexity, and Gemini for thousands of domains. The key finding: 75% of sites with active AI bot blocks still appeared in AI-generated responses for queries related to their content.

Separate data from Demand Local:

  • 76.4% of ChatGPT's top-cited pages were updated within the last 30 days.
  • 50% of Perplexity citations came from content less than 13 weeks old.
  • Reddit appeared in 46.4% of AI responses. YouTube in 31.8%.
  • Google AI Overviews showed a 46.7% relative click reduction.

Four Reasons Blocking Fails

1. AI Engines Use Multiple Data Sources

ChatGPT does not learn only from live web crawls. Its knowledge comes from training datasets, RAG pipelines, and user-submitted content. When someone pastes a URL into ChatGPT and asks for a summary, that content enters the system regardless of robots.txt.

2. Training Data Already Contains Your Content

If your website was publicly accessible before you added bot blocks, AI models likely already trained on your content. Adding a robots.txt file today does not retroactively remove it.

3. Third-Party Mentions Create Independent Citation Paths

Other sites can still mention you, link to you, and quote your content. AI engines cite these third-party sources constantly. When you block your own site, you surrender control of your AI narrative.

4. Content Freshness Outranks Crawl Permission

Over three-quarters of ChatGPT's top citations are from pages updated within 30 days. A page updated weekly will outrank a static competitor regardless of crawler policy.

What to Do Instead: The GEO Offensive

  1. Allow crawling and optimize for it. Create a llms.txt file that gives AI crawlers a structured summary.
  2. Publish fresh content weekly. Keeps pages in the freshness window.
  3. Structure content for AI extraction. Front-load the answer in the first 1-2 sentences of each section.
  4. Build entity authority across 6+ domains. Brand mentions on independent sites signal credibility.
  5. Track AI visibility actively. Measure citation rates across platforms.

FAQ

Does robots.txt block AI training? No. It only tells compliant crawlers not to access your site.

Can I opt out entirely? Not practically. Third-party mentions bypass your blocks.

Most effective action? Update key content every 30 days. Freshness is the strongest citation signal.

Check your AI Visibility Score free at audit.searchless.ai.