How I tracked which AI bots actually crawl my site

Dev.to / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author launched a new easerva.com domain and used CloudFront standard access logs stored in S3 to empirically track which AI and search bots actually crawl their site, not just what they expect in theory.
  • A 5-day log analysis (filtering by specific user-agent strings) found activity from Bingbot, Googlebot, OAI-SearchBot, and especially ClaudeBot, while live-fetch agents like ChatGPT-User and Claude-User had zero hits.
  • ClaudeBot generated 80 requests in five days, all targeting robots.txt and sitemap.xml, suggesting aggressive early-stage discovery behavior that can still be normal but surprising in volume.
  • Bingbot acted as the “canary” by reaching real content and exposing a bug where IndexNow submitted URLs that didn’t exist in S3, causing 403 errors; the author fixed this by adjusting CloudFront error handling (403→404) and deriving IndexNow URLs from the sitemap.
  • The findings indicate that persistent index crawlers may probe discovery endpoints early, while live-fetch agents remain quiet until there is a user query that triggers real-time browsing.

I launched a new domain two weeks ago and wanted to know which AI bots were actually showing up — not theoretically, but in my CloudFront logs. So I built a small tracker that parses access logs from S3 and reports hits per bot per URL.

After 5 days, here's what the data shows.

The setup

The site is easerva.com — static HTML on S3 + CloudFront, zero JavaScript, JSON-LD on every page, sitemap submitted to GSC and Bing Webmaster Tools, IndexNow integrated.

I enabled CloudFront standard logging (free, writes gzipped logs to S3 every few minutes), then wrote a script that filters by user-agent string for the bots that matter: Googlebot, Bingbot, OAI-SearchBot, ChatGPT-User, GPTBot, PerplexityBot, ClaudeBot, Claude-User, Applebot.

The 5-day results

Bot                Type                        Hits   URLs   Errors
Bingbot            Search crawler                16      8        3
OAI-SearchBot      Persistent index crawler      28      2        0
ChatGPT-User       Live fetch agent               0      0        0
PerplexityBot      Persistent index crawler       0      0        0
Googlebot          Search crawler                10      4        0
ClaudeBot          Persistent index crawler      80      2        0
Claude-User        Live fetch agent               0      0        0

Three things jumped out

ClaudeBot is hungry. 80 hits in 5 days, all on /robots.txt and /sitemap.xml. No content fetches yet. This is normal early-stage discovery — crawlers poll permissions before allocating crawl budget — but the volume surprised me. 40 robots.txt fetches is significantly more than Googlebot or Bingbot did.

Bingbot is the canary. Only 16 hits, but unlike Claude and OpenAI it followed through to actual content. It also surfaced a real bug for me: 3 of those hits were 403 errors on URLs I hadn't actually published. My IndexNow code was generating URLs from a template pattern instead of from real S3 objects, so it was advertising pages that didn't exist. CloudFront returned 403 (S3's default for missing objects with restrictive bucket policies) instead of 404. I fixed both — added a CloudFront custom error response to rewrite 403 → 404, and refactored IndexNow to derive submitted URLs from the sitemap.

Live-fetch agents are silent. Zero hits from ChatGPT-User or Claude-User. Makes sense — these only fire when a user asks the AI a question that requires real-time browsing, and a brand-new domain isn't relevant to any query yet. Worth noting: as of December 2025, OpenAI's docs explicitly state ChatGPT-User does NOT respect robots.txt, since user-initiated fetches are treated as proxy human browsing.

What I'm operating on

  • Persistent crawlers (OAI-SearchBot, ClaudeBot, PerplexityBot) build indexes. Live-fetch agents (ChatGPT-User, Claude-User) fetch on demand. Different timing patterns, different optimization implications. Track them separately.
  • Don't read into early-stage silence. Discovery → robots.txt polling → sitemap fetch → content crawl is a multi-week process for new domains. Repeated robots.txt fetches are a good sign.
  • Bingbot surfaces bugs early because it follows through to content URLs faster than the AI-native crawlers. Watch its error column.

Setting up the same tracking on AWS

  1. Create an S3 bucket with BucketOwnerPreferred ownership and an ACL grant for CloudFront's log delivery canonical user
  2. Enable Standard Logging on your CloudFront distribution, point at the bucket
  3. Wait ~30 minutes, hit your site, confirm .gz files appear
  4. Parse logs: user-agent is tab-separated field 10, URI is field 7

Standard logging is free. Real-time via Kinesis costs money and isn't needed at low traffic.

Source for my tracker is on GitHub if you want to fork it instead of writing your own.

What I'm watching next

The transition from robots.txt polling to actual content crawling — when ClaudeBot and OAI-SearchBot start fetching /providers/... URLs instead of just /robots.txt. That's the signal the site has moved from "discovered" to "being indexed." I'll post a 30-day follow-up.

If you're tracking AI bot patterns on your own site, I'd love to hear what you're seeing.