Cloudflare Workers Monitoring: D1, R2, KV & Edge Errors

Q: Can you monitor Cloudflare KV hot keys?

KV is rate-limited to 1 write per second per key . There's no direct dashboard for "which keys are hot" — you have to instrument it yourself. Log every KV put() with the key name and timestamp into Analytics Engine, then aggregate by key over a 1-minute window. Alert on any key approaching 1 write/sec. If you need higher write throughput per key, use Durable Objects for serializable state instead of KV.

Cloudflare Workers, D1, R2, KV: Edge Monitoring

A Cloudflare Worker fails differently from a regular backend. There is no log file to tail. No SSH access. No top to run. The request lands at one of 300+ edge colos, executes inside an isolate for a few milliseconds, and either returns a response or doesn't — and if it doesn't, your only window into what happened is whatever you remembered to instrument before the failure.

The Cloudflare developer platform has gone from "edge compute" to a full application stack in the last few years: Workers for compute, D1 for SQLite-on-edge, R2 for object storage, KV for key-value, Durable Objects for stateful actors, Queues, Hyperdrive, Vectorize, Workers AI. Each one has its own limits, its own failure modes, and its own monitoring story — and the built-in Cloudflare dashboards give you a useful but lagging view.

This guide is the production-monitoring layer for the Cloudflare developer platform: the limits that matter, the failure modes you will actually hit, how to use the built-in observability (Logs, Tail Workers, Trace Workers, Workers Analytics Engine), and where you need external monitoring to fill the gaps the platform doesn't cover.

The Platform At a Glance

The Cloudflare developer stack in mid-2026 covers:

Workers — JavaScript/TypeScript/Rust/WASM compute running in V8 isolates at the edge
D1 — distributed SQLite, with read-replica regions and a single primary
R2 — S3-compatible object storage, zero egress fees
KV — eventually-consistent key-value with global read replication
Durable Objects — single-instance stateful objects with strong consistency
Queues — message queues with at-least-once delivery
Pub/Sub — MQTT-style messaging
Hyperdrive — connection pooling for traditional PostgreSQL/MySQL/Postgres-compatible DBs
Vectorize — vector database for embeddings
Workers AI — model bindings (Llama, Mistral, Whisper, embeddings)
Email Workers — receive and process email
Cron Triggers — scheduled Worker invocations
Browser Rendering — headless Chromium

Each one has its own quota structure, error model, and observability surface.

Workers: The Compute Layer

What a Worker actually is

V8 isolates, not containers. Boot time is essentially zero (~5ms cold-start vs ~500ms for a typical Lambda container) because there is no container. The trade-off: you can run JavaScript, TypeScript, Rust-compiled-to-WASM, and a curated set of bindings — not arbitrary native binaries.

The limits that matter

Limit	Free	Paid (Standard)
CPU time per request	10ms	30,000ms
Wall-clock time	30 seconds	30 seconds
Memory	128MB	128MB
Subrequests per request	50	1,000
Request body	100MB	500MB
Response body	unlimited	unlimited
Environment variables	64	128
Secrets	64	128
Routes per zone	100	1,000

The CPU-time limit is the one that bites first. CPU time is only the time your code is on-CPU — time spent awaiting network I/O is not counted. But a Worker that does heavy parsing, crypto, or compression can blow past 30ms even with otherwise modest logic.

Common Worker failure modes

Exceeded CPU time (exceeded_cpu) — Worker terminated mid-execution; user gets a 1102 error or a partial response
Exceeded subrequest limit — too many fetch() calls; thrown as an exception you can catch
Exceeded request/response body size — silent truncation or 4xx return depending on configuration
Script error — uncaught exception; user gets 1101 unless you catch it
Out-of-memory — rare but possible; manifests as 1107
Resource limit reached — fetch to a bound service that's over quota; per-binding

The 1xxx error codes (1101, 1102, 1107, etc.) are Cloudflare's edge error pages. Your monitoring needs to surface these as the user sees them, not as your code sees them.

Built-in observability

Cloudflare ships four primary observability surfaces for Workers:

Workers Logs — structured logs from console.log calls, queryable in the dashboard and via Logpush. 200K events/day on free plan, paid for more.
Tail Workers — a separate Worker that receives every request event from a target Worker as a TailEvent. Lets you build your own logging/analytics. Indispensable for high-volume tracking.
Trace Events — request lifecycle events including exceptions, subrequest details, and CPU time consumed
Workers Analytics Engine — write custom time-series metrics from inside your Worker; query via SQL. Cheap, high-cardinality friendly. The metric infrastructure most Cloudflare-native apps end up using.

Plus the Workers dashboard itself shows: requests per second, errors per second, CPU time p50/p99, subrequest count, duration.

Wrap your handler

The single highest-value instrumentation pattern:

export default {
    async fetch(request: Request, env: Env, ctx: ExecutionContext) {
        const start = Date.now();
        const route = new URL(request.url).pathname;
        try {
            const response = await handle(request, env);
            env.METRICS.writeDataPoint({
                blobs: [route, response.status.toString()],
                doubles: [Date.now() - start],
                indexes: ['ok'],
            });
            return response;
        } catch (err) {
            env.METRICS.writeDataPoint({
                blobs: [route, '500', err.message?.slice(0, 100) ?? 'unknown'],
                doubles: [Date.now() - start],
                indexes: ['error'],
            });
            throw err;
        }
    },
};

Then query Analytics Engine via SQL for per-route latency, error rate, and top error messages.

What to alert on

Error rate per Worker > 1% for 5 minutes
CPU time p99 > 80% of your plan's limit
Subrequest count p99 > 80% of the subrequest limit
Tail Worker event rate dropping (= your Worker stopped serving)
Per-route latency p95 / p99 above baseline
Routes config drift (alert on route changes via API audit)

For the broader serverless landscape see Serverless Monitoring: Lambda, Vercel, Edge Functions. For the Cloudflare origin/CDN side see Cloudflare Monitoring: Detect Origin Outages.

D1: SQLite on the Edge

The mental model

D1 is SQLite, but distributed. Each D1 database has:

One primary in a single region (where writes go)
Optional read replicas in additional regions (eventually consistent, ~seconds of lag)
All accessible from any Worker via a binding

Writes from a Worker far from the primary cross the network and take 50-200ms+; reads from the local replica are sub-10ms. This locality difference is the single thing you have to monitor for in D1.

Limits to watch

Limit	Value
Database size	10GB per DB
Row size	1MB
Query duration	30 seconds
Read-replica lag	typically < 5s, can spike under load
Concurrent connections	bounded by Worker isolate concurrency
Statements per batch	30,000
Bound expressions per query	100

Failure modes

Replica lag spike — read just after a write returns stale data; the worst class of D1 bug
Primary-region degradation — writes slow down everywhere; reads continue locally
Query timeout — long-running query gets killed at 30s; the result is a 5xx-equivalent error to the Worker
Database-too-large — hitting the 10GB ceiling silently fails new writes (or requires sharding)
Schema migration drift — D1 doesn't have built-in migration tooling; mis-managed migrations are the most common production D1 outage

What to monitor

Query latency p50/p95/p99 split by read vs write
Read-replica lag (use a synthetic write-then-read-from-replica probe)
Storage size vs 10GB limit
Per-query type ("SELECT ... FROM table_x") top-N slow queries
Connection / binding error rate
Migration job completion (track every schema migration as a heartbeat — see Cron Job Monitoring)

Synthetic D1 health probe

A cheap monitoring pattern: a Worker that runs every minute and:

Writes a row to a monitoring_pings table with a UUID and timestamp
Waits 500ms
Reads the row back from a nearby read replica
Records latency, success, and the read-vs-write lag

Surface this on your dashboard as the "D1 end-to-end health" signal.

R2: Object Storage Without Egress Fees

Where R2 wins

S3-compatible API, zero egress fees, ~120 data-center backed POPs. The killer feature is the egress economics — serving large assets from R2 is roughly 80% cheaper than S3 once egress is included.

Limits

Limit	Value
Object size	5TB
Bucket size	unlimited
Multipart parts	10,000
Object key length	1024 bytes
List operations	1000 keys per response

Failure modes

Operation rate limit — class A operations (writes) are ~1000/sec/bucket; spikes get throttled with 429s
Multipart upload abandonment — partial uploads accumulate (and bill) until lifecycle rules clean them up
Signed URL expiry timing bugs — server clock skew between Worker and S3-compatible client tools occasionally produces 403s
CORS / public-bucket misconfiguration — silent 4xx for browser clients
R2 → public Worker domain misconfiguration — origin pulls fail in subtle ways

What to monitor

Class A / Class B operations per minute (track against your bill)
429 / 503 rate from R2 API calls
Multipart upload abandonment rate (track via S3-compatible ListMultipartUploads)
Bucket size trend (cost planning)
Lifecycle rule status (recent execution timestamp)
Public-domain HTTP availability (treat as a CDN)
Signed URL TTL window — alert if you're issuing URLs with TTL > 24h (security signal too)

KV: Eventually Consistent Key-Value

Mental model

A globally distributed key-value store with eventually consistent reads. Writes take ~60 seconds to fully propagate worldwide. Reads from a colo recently miss are 30-100ms; cached reads are 5-15ms.

Limits

Limit	Value
Value size	25MB
Key length	512 bytes
Write rate per key	1/sec
Read rate per key	very high
Operations per Worker invocation	1,000
Metadata size	1024 bytes per key

Failure modes

Hot-key contention — writing to the same key > 1/sec gets rejected. The single biggest KV gotcha; happens when teams use KV for rate limiting or counters (it's not designed for those use cases).
Eventual-consistency surprises — write-then-read-back from a different colo can return the old value for up to 60 seconds. KV is not a database.
Cache-miss latency — first read of a key in a colo can take 100ms+; if your app expects sub-10ms reads, this is a regression source.
Quota exhaustion — read/write/delete operations are billed; runaway loops are expensive fast.

What to monitor

Operation count per minute (writes vs reads vs deletes)
Per-binding latency p99
Read-vs-write ratio (a normal app reads >> writes; a write-heavy ratio suggests misuse)
Top-N hot keys (any single key > 1/sec is a problem)
KV operation error rate

KV vs Durable Objects vs D1 — pick the right primitive

A common monitoring story is "KV is broken" when really the team picked KV for a use case it doesn't fit:

KV — global config, feature flags, low-write cache. Eventual consistency.
Durable Objects — counters, rate limits, per-user state, anything needing strong consistency
D1 — relational data, queries, joins

Migrating off the wrong primitive often fixes "KV latency issues" because the new design uses the right tool.

Durable Objects: Strongly Consistent Edge State

Brief but worth covering. A Durable Object is a single-instance stateful actor — one DO, one location, one consistency boundary. Use them for: counters, presence (chat rooms), rate limits, per-tenant state, websocket coordination, distributed locks.

What to monitor:

DO invocation count + latency per class
DO hibernate / wake events (cold-start equivalent)
DO storage size (1GB per object)
Alarm scheduling and execution latency (cron-equivalent inside a DO)
WebSocket connection count per DO (for chat / presence patterns)
DO migration / class-rename safety (these are easy to get wrong in production)

Supporting Services (Brief)

Queues — at-least-once delivery; monitor queue depth, consumer lag, retry count, DLQ size
Pub/Sub — MQTT broker; monitor publish/subscribe rate, connection count, retained messages
Hyperdrive — DB connection pooling for upstream Postgres / MySQL; monitor cached query hit ratio, upstream connection count, error rate; for the broader DB story see API Rate Limit Monitoring
Vectorize — vector DB for embeddings; monitor index size, query latency, recall — see vector database monitoring
Workers AI — model bindings; per-binding latency p95/p99, error rate, neuron usage vs quota, daily cost
Browser Rendering — headless Chromium for screenshots/PDFs; concurrent-browser count, request timeout rate

Cron Triggers — The Most-Missed Failure Surface

Cloudflare's Cron Triggers fire scheduled Worker invocations. The default monitoring tells you "this cron is configured." It does not tell you "this cron actually ran and completed successfully last time it was supposed to."

The fix is a heartbeat pattern: each scheduled invocation pings a monitoring endpoint with (cron_name, timestamp, success). An external monitor alerts if the heartbeat doesn't arrive within the expected window.

export default {
    async scheduled(event, env, ctx) {
        const start = Date.now();
        try {
            await doScheduledWork(env);
            ctx.waitUntil(
                fetch('https://webalert.io/heartbeat/your-cron-id?status=ok')
            );
        } catch (err) {
            ctx.waitUntil(
                fetch(`https://webalert.io/heartbeat/your-cron-id?status=fail`)
            );
            throw err;
        }
    },
};

See Cron Job Monitoring: Background Tasks for the broader pattern.

Wrangler Deploy Monitoring

wrangler deploy is fast — typically 5-15 seconds — and the failure modes are:

Build error — TypeScript / bundler errors caught at deploy time
Configuration error — bound resource missing, secret missing, route conflict
Quota exhaustion — too many Workers / routes for plan
Account / API token issues — silent in some CI integrations

Monitor:

Deploy duration over time (sudden jumps indicate bundling issues)
Deploy failure rate
Time-since-last-successful-deploy (alert if > N days for a service that should be active)
Post-deploy synthetic check (hit the worker's URL within 30s of deploy; alert if the new version isn't returning 200)

Workers AI: The New Cost Center

If you're using Workers AI, monitor it like any other AI/LLM service plus a Cloudflare-specific lens:

Per-model latency p95/p99 (Llama 3 vs Mistral vs embeddings have very different latency profiles)
Per-model error rate
Neuron consumption vs daily quota
Cost per request (some models are dramatically more expensive than others)
Cost per user / per tenant (catch runaway usage)

See AI/LLM API Monitoring and AI Agent Monitoring for the broader AI-monitoring picture.

Why You Still Need External Monitoring

The Cloudflare dashboard lags by 30-90 seconds. The Cloudflare status page lags by 5-15 minutes (sometimes longer for regional incidents). Workers Logs are great for after-the-fact debugging but they're not real-time alerting.

External monitoring fills three gaps:

Time-to-detect — a 1-minute external HTTP check beats every other layer for "did the user get a 200?"
The platform itself failing — when Cloudflare is the problem, Cloudflare's own monitoring can't tell you
Multi-perspective verification — external checks from regions outside Cloudflare's edge confirm reachability

The combination that works: Workers Analytics Engine for per-route metrics inside the platform, Tail Workers for deep logs, and external HTTP monitoring with multi-region checks for time-to-detect.

See Multi-Region Monitoring: Why Location Matters, CDN Monitoring, and DDoS Monitoring: Detect & Mitigate Traffic Spikes for the broader edge-monitoring patterns.

Cloudflare Workers Monitoring Checklist

Every Worker wraps its fetch handler in try/catch + Analytics Engine write
Per-route latency p50/p95/p99 dashboards in Analytics Engine SQL
Per-Worker error rate alerting
CPU-time p99 alerting at 80% of plan limit
Subrequest count p99 alerting at 80% of limit
Tail Worker for high-value Workers shipping structured logs
D1 query latency split read vs write
D1 read-replica lag synthetic probe
D1 storage-size trend tracked vs 10GB ceiling
R2 class A/B operation rate tracked
R2 multipart abandonment lifecycle policy in place
KV top-N hot-key audit (no key written > 1/sec)
KV operation count tracked (cost)
Durable Objects per-class invocation latency + alarm execution lag
Cron Triggers heartbeat to external monitor
Wrangler deploy success rate + post-deploy synthetic check
Workers AI per-model latency + cost + quota
External HTTP monitoring from multiple regions on every public Worker URL
Status-page subscription for cloudflarestatus.com (lagging indicator only)

How Webalert Helps Monitor Cloudflare Workers Apps

Webalert is the external-monitoring layer that complements Cloudflare's built-in observability:

HTTP monitoring — Public Worker URL, R2 public domain, custom domains; 1-minute resolution
Multi-region checks — Confirm Worker reachability from regions outside Cloudflare's network; catch routing / DNS issues invisible to the dashboard
Content validation — Hit a /internal/workers-health endpoint that surfaces Analytics Engine summary data; alert when error rate or CPU-time p99 crosses threshold
Heartbeat monitoring — Cron Triggers ping a heartbeat URL; missed heartbeats alert immediately
SSL certificate monitoring — Custom-domain cert expiry and chain validation
Response time alerts — Catch Workers degrading from "consistently fast" to "occasionally slow"
Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
Status page — Communicate degraded Worker regions or AI-binding outages to users
5-minute setup — Add hostnames, point at heartbeats, set thresholds

See features and pricing.

Summary

The Cloudflare developer platform is now a full stack — Workers, D1, R2, KV, Durable Objects, Queues, Hyperdrive, Vectorize, Workers AI — each with its own limits and failure modes.
The CPU-time limit, the subrequest limit, and the KV hot-key 1/sec rule are the three Worker constraints that bite earliest in production.
Wrap every Worker's fetch handler in try/catch + Analytics Engine write. Per-route latency, error rate, and top error messages fall out for free.
D1 read-replica lag is the most-overlooked source of bugs; run a synthetic write-then-read probe.
R2 is great for storage cost but multipart abandonment and operation rate limits still need monitoring.
KV is eventually consistent and rate-limited per key; using KV for counters or rate limiting is a frequent root cause of "KV is broken" tickets. Use Durable Objects instead.
Cron Triggers need a heartbeat pattern — Cloudflare won't tell you when one silently stops firing.
Workers AI is a new cost center; track per-model latency, cost, and quota daily.
The Cloudflare dashboard lags real time by ~30-90 seconds, the status page lags by 5-15 minutes. External multi-region HTTP monitoring is what catches outages first.

A well-instrumented Cloudflare Workers app pairs platform-native observability (Analytics Engine, Tail Workers, Logs) with external monitoring (HTTP checks, heartbeats, multi-region). The platform tells you why something failed; external monitoring tells you that it failed — even when the platform itself is the problem.

Frequently Asked Questions

How do you monitor a Cloudflare Worker?

Combine three layers: (1) In-Worker instrumentation — wrap fetch in try/catch and write per-route metrics (status code, latency, error type) to Analytics Engine. (2) Platform observability — enable Tail Workers and Logs for real-time debugging. (3) External monitoring — run multi-region HTTP checks against your Worker's hostname so you detect outages, region-specific failures, and platform incidents before the Cloudflare dashboard catches up.

What's the CPU-time limit for Cloudflare Workers and how do I detect when I hit it?

The Free plan limits CPU time to 10 ms per request; the Paid plan to 30 seconds. When you exceed it, the Worker is killed and returns a 1102 error. Track CPU time inside the Worker (request.cf.cpuTime or by wrapping critical paths in performance.now() deltas) and write a metric to Analytics Engine. Alert when p95 CPU time crosses 80% of your limit — that's your early-warning before users start hitting 1102 errors.

How do you detect D1 read-replica lag?

D1 uses eventual consistency between primary and read replicas. To detect lag: write a row with a server timestamp, immediately read it back via a separate query, and measure the delay until the row appears. Run this synthetic write-then-read probe every minute from a Worker and alert if lag exceeds your tolerance (typically 1-5 seconds). This is the single most-overlooked source of "ghost bugs" in D1 deployments.

Can you monitor Cloudflare KV hot keys?

KV is rate-limited to 1 write per second per key. There's no direct dashboard for "which keys are hot" — you have to instrument it yourself. Log every KV put() with the key name and timestamp into Analytics Engine, then aggregate by key over a 1-minute window. Alert on any key approaching 1 write/sec. If you need higher write throughput per key, use Durable Objects for serializable state instead of KV.

How do you monitor Cloudflare Cron Triggers?

Cloudflare won't notify you when a Cron Trigger silently stops firing. Use the heartbeat pattern: at the end of every successful cron execution, have the Worker fetch() a unique heartbeat URL from your monitoring service. The monitor expects a ping at the scheduled interval and alerts if it doesn't arrive within a grace window. This catches misconfigured schedules, account-level cron limits being hit, and platform issues.

Catch Worker, D1, R2, and KV regressions before users hit the limits

Start monitoring with Webalert →

See features and pricing. No credit card required.

Cloudflare Workers Monitoring: D1, R2, KV & Edge Errors

The Platform At a Glance

Workers: The Compute Layer

What a Worker actually is

The limits that matter

Common Worker failure modes

Built-in observability

Wrap your handler

What to alert on

D1: SQLite on the Edge

The mental model

Limits to watch

Failure modes

What to monitor

Synthetic D1 health probe

R2: Object Storage Without Egress Fees

Where R2 wins

Limits

Failure modes

What to monitor

KV: Eventually Consistent Key-Value

Mental model

Limits

Failure modes

What to monitor

KV vs Durable Objects vs D1 — pick the right primitive

Durable Objects: Strongly Consistent Edge State

Supporting Services (Brief)

Cron Triggers — The Most-Missed Failure Surface

Wrangler Deploy Monitoring

Workers AI: The New Cost Center

Why You Still Need External Monitoring

Cloudflare Workers Monitoring Checklist

How Webalert Helps Monitor Cloudflare Workers Apps

Summary

Frequently Asked Questions

How do you monitor a Cloudflare Worker?

What's the CPU-time limit for Cloudflare Workers and how do I detect when I hit it?

How do you detect D1 read-replica lag?

Can you monitor Cloudflare KV hot keys?

How do you monitor Cloudflare Cron Triggers?

Catch Worker, D1, R2, and KV regressions before users hit the limits

Related Articles

Serverless Cold Starts: Causes, Monitoring, and Fixes

5xx Server Errors Explained: 500, 502, 503, 504 Fix Guide

DDoS Monitoring: Detect & Mitigate Traffic Spikes

Stop guessing about downtime