AI Crawler Monitoring: Track GPTBot, ClaudeBot & PerplexityBot Traffic

Two years ago, "monitoring crawler traffic" meant watching Googlebot. Today it means watching a fleet: GPTBot, OAI-SearchBot, ClaudeBot, Claude-Web, PerplexityBot, Perplexity-User, Google-Extended, Bytespider, Amazonbot, Applebot-Extended, CCBot, and a dozen more that change their user-agent string every quarter. Some are crawling to cite you in AI answers. Some are crawling to train on you. Most of your existing monitoring stack treats them identically: as traffic that does not convert.

This guide is the practical playbook for monitoring AI crawler traffic. How to identify each bot, how to verify it is the bot it claims to be, how to read its behavior, what to allow versus block in robots.txt, and how to alert when an AI crawler is the actual cause of your next 5xx spike.

This is the inbound side of AI. For the outbound side - whether your brand appears in AI answers - see AI Search Visibility Monitoring.

Why AI Crawler Monitoring Is Suddenly A Topic

Three things changed at once:

AI search products started citing live web content. ChatGPT Search, Perplexity, Google AI Overviews, Bing Copilot — all fetch pages in real time to ground answers. If their crawler cannot reach you, you do not get cited.
AI labs started crawling to train models. Separate user agents, sometimes overlapping IP ranges, often more aggressive than search crawlers, and not always declared.
Volume scaled. Real-time AI search traffic plus training crawls plus agent-style "browse the web" fetches now adds up to a meaningful share of bot requests, and shows up as latency spikes, 429s, and 5xx storms on under-provisioned sites.

If your site logs do not separate AI crawlers from human traffic, you are flying blind on three problems at once: SEO/AEO visibility, infrastructure load, and content licensing.

For broader robots.txt and sitemap regressions that block crawlers entirely, see Sitemap & robots.txt Monitoring.

The AI Crawler Roster (As Of 2026)

User agents change. Always verify against the vendor's docs before alerting on a specific string. Categories below reflect the declared intent in each vendor's documentation - not a guarantee.

OpenAI

User agent	Purpose	Documented?
`GPTBot`	Model training	Yes
`OAI-SearchBot`	ChatGPT Search results (live citations)	Yes
`ChatGPT-User`	On-demand user-initiated fetches	Yes

OpenAI publishes IP ranges for verification. Treat all three as separate bots in monitoring - blocking GPTBot does not block search citations.

Anthropic

User agent	Purpose	Documented?
`ClaudeBot`	Model training	Yes
`Claude-Web`	Live answer grounding	Yes
`anthropic-ai`	Legacy / unspecified	Sometimes seen

Perplexity

User agent	Purpose	Documented?
`PerplexityBot`	Indexing for Perplexity Search	Yes
`Perplexity-User`	On-demand user query fetches (sometimes ignores robots.txt)	Disputed

Perplexity has been publicly accused of fetching pages despite robots.txt blocks via "user-driven" fetches. Monitor both PerplexityBot and Perplexity-User separately and alert if Perplexity-User continues hitting paths you have disallowed.

Google

User agent	Purpose
`Googlebot`	Classic search index
`Google-Extended`	Opt-out flag for Gemini training (no actual crawler with this exact name)
`GoogleOther`	Misc product crawls

Google-Extended is unusual: it is only a robots.txt token, not a real user agent string in logs. You opt out of Gemini training by Disallow: / Google-Extended while keeping Googlebot allowed.

Microsoft

User agent	Purpose
`Bingbot`	Bing search index
`Bing-Copilot` (varies)	Live Copilot grounding fetches

User agent	Purpose
`Meta-ExternalAgent`	Meta AI training
`Meta-ExternalFetcher`	On-demand fetches
`FacebookExternalHit`	Link previews (not AI but commonly mis-classified)

Apple

User agent	Purpose
`Applebot`	Spotlight / Siri search
`Applebot-Extended`	Opt-out flag for Apple AI training

Amazon

User agent	Purpose
`Amazonbot`	Alexa answers and Amazon AI training

Others worth watching

CCBot — Common Crawl, the upstream dataset many models train on.
Bytespider — ByteDance / TikTok. Aggressive, not always well-behaved on robots.txt.
Diffbot — knowledge-graph extractor used downstream by AI products.
cohere-ai, Cohere-AI — Cohere training.
YouBot — You.com search.
Mistral-Crawl — Mistral training (sporadic).

Maintain this list in code, not in your head. Bots change names. New ones appear monthly.

Verify Before You Trust The User Agent

User-agent strings are trivially spoofable. A scraper claiming to be GPTBot may be a competitor scraping your prices. Verify with one of:

1. Reverse DNS + forward DNS lookup

The classic Googlebot pattern, now adopted by OpenAI:

# 1. Reverse DNS the source IP
host 20.171.207.34
# -> ptr record ending in .openai.com or similar

# 2. Forward DNS the result back to an IP
host the-ptr-record-you-got.openai.com

# 3. Compare to the original IP — must match

If forward DNS does not match the source IP, the request is spoofed regardless of user agent.

2. Published IP ranges

OpenAI, Anthropic, Perplexity, and Microsoft publish CIDR ranges for their crawlers. Pull them daily, store them, and tag requests whose source IP falls inside a known range as verified.

3. HTTPS verifier endpoints

Some providers offer signed verifier endpoints. Useful for higher-trust signals when blocking is a destructive action.

Anything that does not pass at least one verification path is "claimed bot, unverified" - log it, do not trust it, and watch for misbehavior.

For the broader pattern of distinguishing real traffic from automation, see DDoS Detection & Traffic Spike Monitoring.

What To Log Per Request

Add these fields to your access logs (or analytics pipeline) for every request:

ua_string - full user agent
bot_family - gptbot, claudebot, perplexitybot, googlebot, unknown, etc.
bot_purpose - train, search, user-fetch, index, unknown
bot_verified - boolean from rDNS + IP-range check
src_ip
path, status, bytes, response_time_ms
referer
accept_language
robots_decision - what robots.txt would have said for this UA + path

The bot_verified flag is the single most important field. Aggregate everything else against it.

robots.txt: Opt-In vs Opt-Out

robots.txt is the contract layer. AI bots that respect it will honor your rules; the ones that ignore it are exactly the ones you want to detect and alert on.

A pragmatic default for most sites:

# Allow live citations and search indexing
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Claude-Web
Allow: /

# Disallow model training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

Sitemap: https://example.com/sitemap.xml

This template:

Allows real-time citation fetches from ChatGPT, Perplexity, Claude, Bing Copilot.
Blocks bulk training crawls.
Leaves classic search (Googlebot, Bingbot) untouched.

Adjust to your strategy. The point is: be deliberate, version the file, and monitor for unintended regressions.

For monitoring robots.txt itself - regressions, missing files, unintended Disallow: / after a deploy - see Sitemap & robots.txt Monitoring.

Monitoring Patterns

Volume by bot family

Daily and hourly counts per bot_family, segmented by bot_purpose. New spikes worth alerting on:

A bot family appearing for the first time.
An order-of-magnitude jump in request count.
A new path becoming the top destination.

Verified vs unverified

If 90% of requests claiming to be GPTBot fail rDNS verification, you have a scraper problem, not a crawler problem. Alert when unverified rate crosses 25% on any single bot family over 1h.

Status code distribution per bot

Bot	2xx	3xx	4xx	5xx
GPTBot	95%	3%	2%	0%
ClaudeBot	92%	3%	4%	1%
PerplexityBot	88%	4%	6%	2%

A jump in 5xx for any bot family typically means you are the bottleneck, not them. Alert.

Latency per bot

p50, p95, p99 response time per bot_family. AI crawlers tend to hammer specific paths; latency rising on the path they are crawling tells you whether your origin is keeping up.

Robots.txt compliance

For every request from a bot whose robots.txt rules would have disallowed that path, log a robots_violation event. Aggregate by bot family. Perplexity-User has been the persistent offender here historically. Alert when a single bot family exceeds N violations per day.

Cost / bandwidth attribution

Bytes served per bot family. Knowing that 12% of egress goes to model training crawls is a useful number for legal, finance, and capacity planning.

Sample Alert Rules

Severity	Trigger	Window	Action
Critical	5xx rate > 1% for any bot family	5 min	Page on-call, snapshot top paths
Critical	New bot user agent crosses 100 req/min	5 min	Page on-call (potential scraper)
High	Unverified rate > 25% for a known bot	1 h	Notify SRE channel
High	robots.txt violations > 100 from any bot	24 h	Notify SEO + legal
Medium	Bytes/day per bot > N (capacity budget)	24 h	Notify platform
Info	New bot family observed	1 h	Slack digest

Tune thresholds to your traffic. For the alerting policy that keeps these from becoming pager fatigue, see Alert Fatigue.

Rate Limiting Without Breaking Citations

If GPTBot is genuinely overwhelming a path, rate-limit it - do not 5xx it. A 5xx tells the bot to retry; a 429 with Retry-After tells it to back off and try later. Returning broken responses to a search-citation bot can cost you the citation.

Pattern:

Per-bot family token bucket (separate from anonymous-user limits).
Higher limit for citation bots (OAI-SearchBot, Perplexity-User, Claude-Web).
Lower limit for training bots (GPTBot, ClaudeBot, PerplexityBot).
Always return 429 with Retry-After, never a generic 503.

For the broader rate-limit pattern - thresholds, headers, retry semantics - see API Rate Limit Monitoring.

Edge Cases You Will Hit

"User-driven" fetches that ignore robots.txt

Perplexity-User and ChatGPT-User explicitly model themselves as "the user is asking, not the bot." That can mean different rules apply. Monitor what they actually do, not what the documentation says.

Spoofed Googlebot for SEO cloaking

Scrapers pretend to be Googlebot to bypass paywalls or get cleaner HTML. Reverse-DNS verification catches this. Combine with SEO Cloaking Detection.

Bot traffic during JavaScript-heavy renders

If your site is a SPA, the bot may render JS. That changes performance and cost. See JavaScript SEO & Googlebot Rendering.

`User-agent: *` interactions

Disallow: / under User-agent: * does not automatically apply to AI bots if they have their own block - many bots only check their own section. Be explicit per bot.

CDN / WAF caching

If Cloudflare or your WAF serves cached responses to bots, your origin logs will under-count. Capture bot identity at the edge (Cloudflare Workers, Fastly VCL, etc.), not just at the origin.

Security headers and bot identity

Aggressive bot blocking can leak HTML structure or break previews. Validate your headers and content policies. See HTTP Security Headers Monitoring.

A Reference Dashboard

The bare minimum AI crawler dashboard:

Top: total bot requests vs human requests today, week-over-week.
By bot family: stacked area chart of requests/hour, last 14 days.
Compliance: robots.txt violations per bot, last 24 h.
Verification: verified vs unverified rate per bot, last 7 d.
Health: 5xx and p95 latency per bot, last 24 h.
Top paths per bot: the URLs each AI crawler is hammering.
New bots: any user agent seen for the first time in the last 7 days.

Tie this to alerts. A dashboard without alerts is a museum exhibit.

AI Crawler Monitoring Checklist

Per-request bot_family, bot_purpose, bot_verified fields logged
Reverse DNS / IP-range verification implemented for top 5 bot families
robots.txt reflects a clear opt-in vs opt-out policy for training vs citation
robots.txt file is monitored for regressions on deploy
Volume, 5xx rate, latency, and bytes alerts per bot family
robots_violation events tracked and alerted when threshold exceeded
Rate limits configured separately for citation and training bots
Dashboard reviewed weekly; new bot families investigated
Edge / CDN identity capture (not just origin) so cache hits are counted
Legal / SEO stakeholders looped into training-crawler policy decisions

How Webalert Helps

Webalert focuses on the external side of crawler monitoring - making sure crawlers can actually reach your site and that what they fetch is what you intend:

External HTTP monitoring - Detect when AI-traffic spikes cause 5xx for real users, regardless of internal dashboards.
Content validation - Catch the day a deploy serves your homepage as a login page to GPTBot. See Response Body Validation Monitoring.
robots.txt monitoring - Alert when robots.txt changes unexpectedly or returns 5xx. See Sitemap & robots.txt Monitoring.
Multi-region checks - Some AI bots crawl from a specific region; verify you respond healthily from there.
TLS and DNS checks - Bots will not crawl a site with broken certs or DNS issues; catch these before crawl budget is wasted.
Status page - Communicate downtime to bots and humans alike, via a real status page that you control.

Example Webalert check tuned for crawler health:

URL: https://example.com/robots.txt
Method: GET
Expected status: 200
Must contain: User-agent: GPTBot, Sitemap:
Must not contain: Disallow: / directly under User-agent: * unless intentional
Response time: under 800ms
Region: US + EU
Alert: immediate on any change to body

A second check on your sitemap, a third on a representative page that AI bots should be able to reach, and you have closed the loop.

Summary

AI crawler monitoring is now a real category. The bots are many, the rules are evolving, and your existing analytics most likely treats them as noise. Log per bot, verify identity, encode a clear robots.txt policy, alert on volume / 5xx / compliance / verification, and rate-limit citation bots differently from training bots.

The teams that figure this out first will know - within hours - whether ChatGPT Search is citing them, whether Claude is training on them, and whether the next 5xx storm came from a real user surge or from Bytespider discovering their site.

See exactly which AI crawlers can reach your site

Start monitoring with Webalert ->

See features and pricing. No credit card required.