
Two years ago, "monitoring crawler traffic" meant watching Googlebot. Today it means watching a fleet: GPTBot, OAI-SearchBot, ClaudeBot, Claude-Web, PerplexityBot, Perplexity-User, Google-Extended, Bytespider, Amazonbot, Applebot-Extended, CCBot, and a dozen more that change their user-agent string every quarter. Some are crawling to cite you in AI answers. Some are crawling to train on you. Most of your existing monitoring stack treats them identically: as traffic that does not convert.
This guide is the practical playbook for monitoring AI crawler traffic. How to identify each bot, how to verify it is the bot it claims to be, how to read its behavior, what to allow versus block in robots.txt, and how to alert when an AI crawler is the actual cause of your next 5xx spike.
This is the inbound side of AI. For the outbound side - whether your brand appears in AI answers - see AI Search Visibility Monitoring.
Why AI Crawler Monitoring Is Suddenly A Topic
Three things changed at once:
- AI search products started citing live web content. ChatGPT Search, Perplexity, Google AI Overviews, Bing Copilot — all fetch pages in real time to ground answers. If their crawler cannot reach you, you do not get cited.
- AI labs started crawling to train models. Separate user agents, sometimes overlapping IP ranges, often more aggressive than search crawlers, and not always declared.
- Volume scaled. Real-time AI search traffic plus training crawls plus agent-style "browse the web" fetches now adds up to a meaningful share of bot requests, and shows up as latency spikes, 429s, and 5xx storms on under-provisioned sites.
If your site logs do not separate AI crawlers from human traffic, you are flying blind on three problems at once: SEO/AEO visibility, infrastructure load, and content licensing.
For broader robots.txt and sitemap regressions that block crawlers entirely, see Sitemap & robots.txt Monitoring.
The AI Crawler Roster (As Of 2026)
User agents change. Always verify against the vendor's docs before alerting on a specific string. Categories below reflect the declared intent in each vendor's documentation - not a guarantee.
OpenAI
| User agent | Purpose | Documented? |
|---|---|---|
GPTBot |
Model training | Yes |
OAI-SearchBot |
ChatGPT Search results (live citations) | Yes |
ChatGPT-User |
On-demand user-initiated fetches | Yes |
OpenAI publishes IP ranges for verification. Treat all three as separate bots in monitoring - blocking GPTBot does not block search citations.
Anthropic
| User agent | Purpose | Documented? |
|---|---|---|
ClaudeBot |
Model training | Yes |
Claude-Web |
Live answer grounding | Yes |
anthropic-ai |
Legacy / unspecified | Sometimes seen |
Perplexity
| User agent | Purpose | Documented? |
|---|---|---|
PerplexityBot |
Indexing for Perplexity Search | Yes |
Perplexity-User |
On-demand user query fetches (sometimes ignores robots.txt) | Disputed |
Perplexity has been publicly accused of fetching pages despite robots.txt blocks via "user-driven" fetches. Monitor both PerplexityBot and Perplexity-User separately and alert if Perplexity-User continues hitting paths you have disallowed.
| User agent | Purpose |
|---|---|
Googlebot |
Classic search index |
Google-Extended |
Opt-out flag for Gemini training (no actual crawler with this exact name) |
GoogleOther |
Misc product crawls |
Google-Extended is unusual: it is only a robots.txt token, not a real user agent string in logs. You opt out of Gemini training by Disallow: / Google-Extended while keeping Googlebot allowed.
Microsoft
| User agent | Purpose |
|---|---|
Bingbot |
Bing search index |
Bing-Copilot (varies) |
Live Copilot grounding fetches |
Meta
| User agent | Purpose |
|---|---|
Meta-ExternalAgent |
Meta AI training |
Meta-ExternalFetcher |
On-demand fetches |
FacebookExternalHit |
Link previews (not AI but commonly mis-classified) |
Apple
| User agent | Purpose |
|---|---|
Applebot |
Spotlight / Siri search |
Applebot-Extended |
Opt-out flag for Apple AI training |
Amazon
| User agent | Purpose |
|---|---|
Amazonbot |
Alexa answers and Amazon AI training |
Others worth watching
CCBot— Common Crawl, the upstream dataset many models train on.Bytespider— ByteDance / TikTok. Aggressive, not always well-behaved on robots.txt.Diffbot— knowledge-graph extractor used downstream by AI products.cohere-ai,Cohere-AI— Cohere training.YouBot— You.com search.Mistral-Crawl— Mistral training (sporadic).
Maintain this list in code, not in your head. Bots change names. New ones appear monthly.
Verify Before You Trust The User Agent
User-agent strings are trivially spoofable. A scraper claiming to be GPTBot may be a competitor scraping your prices. Verify with one of:
1. Reverse DNS + forward DNS lookup
The classic Googlebot pattern, now adopted by OpenAI:
# 1. Reverse DNS the source IP
host 20.171.207.34
# -> ptr record ending in .openai.com or similar
# 2. Forward DNS the result back to an IP
host the-ptr-record-you-got.openai.com
# 3. Compare to the original IP — must match
If forward DNS does not match the source IP, the request is spoofed regardless of user agent.
2. Published IP ranges
OpenAI, Anthropic, Perplexity, and Microsoft publish CIDR ranges for their crawlers. Pull them daily, store them, and tag requests whose source IP falls inside a known range as verified.
3. HTTPS verifier endpoints
Some providers offer signed verifier endpoints. Useful for higher-trust signals when blocking is a destructive action.
Anything that does not pass at least one verification path is "claimed bot, unverified" - log it, do not trust it, and watch for misbehavior.
For the broader pattern of distinguishing real traffic from automation, see DDoS Detection & Traffic Spike Monitoring.
What To Log Per Request
Add these fields to your access logs (or analytics pipeline) for every request:
ua_string- full user agentbot_family-gptbot,claudebot,perplexitybot,googlebot,unknown, etc.bot_purpose-train,search,user-fetch,index,unknownbot_verified- boolean from rDNS + IP-range checksrc_ippath,status,bytes,response_time_msrefereraccept_languagerobots_decision- what robots.txt would have said for this UA + path
The bot_verified flag is the single most important field. Aggregate everything else against it.
robots.txt: Opt-In vs Opt-Out
robots.txt is the contract layer. AI bots that respect it will honor your rules; the ones that ignore it are exactly the ones you want to detect and alert on.
A pragmatic default for most sites:
# Allow live citations and search indexing
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Claude-Web
Allow: /
# Disallow model training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
Sitemap: https://example.com/sitemap.xml
This template:
- Allows real-time citation fetches from ChatGPT, Perplexity, Claude, Bing Copilot.
- Blocks bulk training crawls.
- Leaves classic search (Googlebot, Bingbot) untouched.
Adjust to your strategy. The point is: be deliberate, version the file, and monitor for unintended regressions.
For monitoring robots.txt itself - regressions, missing files, unintended Disallow: / after a deploy - see Sitemap & robots.txt Monitoring.
Monitoring Patterns
Volume by bot family
Daily and hourly counts per bot_family, segmented by bot_purpose. New spikes worth alerting on:
- A bot family appearing for the first time.
- An order-of-magnitude jump in request count.
- A new path becoming the top destination.
Verified vs unverified
If 90% of requests claiming to be GPTBot fail rDNS verification, you have a scraper problem, not a crawler problem. Alert when unverified rate crosses 25% on any single bot family over 1h.
Status code distribution per bot
| Bot | 2xx | 3xx | 4xx | 5xx |
|---|---|---|---|---|
| GPTBot | 95% | 3% | 2% | 0% |
| ClaudeBot | 92% | 3% | 4% | 1% |
| PerplexityBot | 88% | 4% | 6% | 2% |
A jump in 5xx for any bot family typically means you are the bottleneck, not them. Alert.
Latency per bot
p50, p95, p99 response time per bot_family. AI crawlers tend to hammer specific paths; latency rising on the path they are crawling tells you whether your origin is keeping up.
Robots.txt compliance
For every request from a bot whose robots.txt rules would have disallowed that path, log a robots_violation event. Aggregate by bot family. Perplexity-User has been the persistent offender here historically. Alert when a single bot family exceeds N violations per day.
Cost / bandwidth attribution
Bytes served per bot family. Knowing that 12% of egress goes to model training crawls is a useful number for legal, finance, and capacity planning.
Sample Alert Rules
| Severity | Trigger | Window | Action |
|---|---|---|---|
| Critical | 5xx rate > 1% for any bot family | 5 min | Page on-call, snapshot top paths |
| Critical | New bot user agent crosses 100 req/min | 5 min | Page on-call (potential scraper) |
| High | Unverified rate > 25% for a known bot | 1 h | Notify SRE channel |
| High | robots.txt violations > 100 from any bot | 24 h | Notify SEO + legal |
| Medium | Bytes/day per bot > N (capacity budget) | 24 h | Notify platform |
| Info | New bot family observed | 1 h | Slack digest |
Tune thresholds to your traffic. For the alerting policy that keeps these from becoming pager fatigue, see Alert Fatigue.
Rate Limiting Without Breaking Citations
If GPTBot is genuinely overwhelming a path, rate-limit it - do not 5xx it. A 5xx tells the bot to retry; a 429 with Retry-After tells it to back off and try later. Returning broken responses to a search-citation bot can cost you the citation.
Pattern:
- Per-bot family token bucket (separate from anonymous-user limits).
- Higher limit for citation bots (
OAI-SearchBot,Perplexity-User,Claude-Web). - Lower limit for training bots (
GPTBot,ClaudeBot,PerplexityBot). - Always return
429withRetry-After, never a generic 503.
For the broader rate-limit pattern - thresholds, headers, retry semantics - see API Rate Limit Monitoring.
Edge Cases You Will Hit
"User-driven" fetches that ignore robots.txt
Perplexity-User and ChatGPT-User explicitly model themselves as "the user is asking, not the bot." That can mean different rules apply. Monitor what they actually do, not what the documentation says.
Spoofed Googlebot for SEO cloaking
Scrapers pretend to be Googlebot to bypass paywalls or get cleaner HTML. Reverse-DNS verification catches this. Combine with SEO Cloaking Detection.
Bot traffic during JavaScript-heavy renders
If your site is a SPA, the bot may render JS. That changes performance and cost. See JavaScript SEO & Googlebot Rendering.
User-agent: * interactions
Disallow: / under User-agent: * does not automatically apply to AI bots if they have their own block - many bots only check their own section. Be explicit per bot.
CDN / WAF caching
If Cloudflare or your WAF serves cached responses to bots, your origin logs will under-count. Capture bot identity at the edge (Cloudflare Workers, Fastly VCL, etc.), not just at the origin.
Security headers and bot identity
Aggressive bot blocking can leak HTML structure or break previews. Validate your headers and content policies. See HTTP Security Headers Monitoring.
A Reference Dashboard
The bare minimum AI crawler dashboard:
- Top: total bot requests vs human requests today, week-over-week.
- By bot family: stacked area chart of requests/hour, last 14 days.
- Compliance: robots.txt violations per bot, last 24 h.
- Verification: verified vs unverified rate per bot, last 7 d.
- Health: 5xx and p95 latency per bot, last 24 h.
- Top paths per bot: the URLs each AI crawler is hammering.
- New bots: any user agent seen for the first time in the last 7 days.
Tie this to alerts. A dashboard without alerts is a museum exhibit.
AI Crawler Monitoring Checklist
- Per-request
bot_family,bot_purpose,bot_verifiedfields logged - Reverse DNS / IP-range verification implemented for top 5 bot families
-
robots.txtreflects a clear opt-in vs opt-out policy for training vs citation -
robots.txtfile is monitored for regressions on deploy - Volume, 5xx rate, latency, and bytes alerts per bot family
-
robots_violationevents tracked and alerted when threshold exceeded - Rate limits configured separately for citation and training bots
- Dashboard reviewed weekly; new bot families investigated
- Edge / CDN identity capture (not just origin) so cache hits are counted
- Legal / SEO stakeholders looped into training-crawler policy decisions
How Webalert Helps
Webalert focuses on the external side of crawler monitoring - making sure crawlers can actually reach your site and that what they fetch is what you intend:
- External HTTP monitoring - Detect when AI-traffic spikes cause 5xx for real users, regardless of internal dashboards.
- Content validation - Catch the day a deploy serves your homepage as a login page to GPTBot. See Response Body Validation Monitoring.
- robots.txt monitoring - Alert when
robots.txtchanges unexpectedly or returns 5xx. See Sitemap & robots.txt Monitoring. - Multi-region checks - Some AI bots crawl from a specific region; verify you respond healthily from there.
- TLS and DNS checks - Bots will not crawl a site with broken certs or DNS issues; catch these before crawl budget is wasted.
- Status page - Communicate downtime to bots and humans alike, via a real status page that you control.
Example Webalert check tuned for crawler health:
- URL:
https://example.com/robots.txt - Method:
GET - Expected status:
200 - Must contain:
User-agent: GPTBot,Sitemap: - Must not contain:
Disallow: /directly underUser-agent: *unless intentional - Response time: under 800ms
- Region: US + EU
- Alert: immediate on any change to body
A second check on your sitemap, a third on a representative page that AI bots should be able to reach, and you have closed the loop.
Summary
AI crawler monitoring is now a real category. The bots are many, the rules are evolving, and your existing analytics most likely treats them as noise. Log per bot, verify identity, encode a clear robots.txt policy, alert on volume / 5xx / compliance / verification, and rate-limit citation bots differently from training bots.
The teams that figure this out first will know - within hours - whether ChatGPT Search is citing them, whether Claude is training on them, and whether the next 5xx storm came from a real user surge or from Bytespider discovering their site.