Skip to content

Cloudflare Workers, D1, R2, KV: Edge Monitoring

Webalert Team
May 16, 2026
16 min read

Cloudflare Workers, D1, R2, KV: Edge Monitoring

A Cloudflare Worker fails differently from a regular backend. There is no log file to tail. No SSH access. No top to run. The request lands at one of 300+ edge colos, executes inside an isolate for a few milliseconds, and either returns a response or doesn't — and if it doesn't, your only window into what happened is whatever you remembered to instrument before the failure.

The Cloudflare developer platform has gone from "edge compute" to a full application stack in the last few years: Workers for compute, D1 for SQLite-on-edge, R2 for object storage, KV for key-value, Durable Objects for stateful actors, Queues, Hyperdrive, Vectorize, Workers AI. Each one has its own limits, its own failure modes, and its own monitoring story — and the built-in Cloudflare dashboards give you a useful but lagging view.

This guide is the production-monitoring layer for the Cloudflare developer platform: the limits that matter, the failure modes you will actually hit, how to use the built-in observability (Logs, Tail Workers, Trace Workers, Workers Analytics Engine), and where you need external monitoring to fill the gaps the platform doesn't cover.


The Platform At a Glance

The Cloudflare developer stack in mid-2026 covers:

  • Workers — JavaScript/TypeScript/Rust/WASM compute running in V8 isolates at the edge
  • D1 — distributed SQLite, with read-replica regions and a single primary
  • R2 — S3-compatible object storage, zero egress fees
  • KV — eventually-consistent key-value with global read replication
  • Durable Objects — single-instance stateful objects with strong consistency
  • Queues — message queues with at-least-once delivery
  • Pub/Sub — MQTT-style messaging
  • Hyperdrive — connection pooling for traditional PostgreSQL/MySQL/Postgres-compatible DBs
  • Vectorize — vector database for embeddings
  • Workers AI — model bindings (Llama, Mistral, Whisper, embeddings)
  • Email Workers — receive and process email
  • Cron Triggers — scheduled Worker invocations
  • Browser Rendering — headless Chromium

Each one has its own quota structure, error model, and observability surface.


Workers: The Compute Layer

What a Worker actually is

V8 isolates, not containers. Boot time is essentially zero (~5ms cold-start vs ~500ms for a typical Lambda container) because there is no container. The trade-off: you can run JavaScript, TypeScript, Rust-compiled-to-WASM, and a curated set of bindings — not arbitrary native binaries.

The limits that matter

Limit Free Paid (Standard)
CPU time per request 10ms 30,000ms
Wall-clock time 30 seconds 30 seconds
Memory 128MB 128MB
Subrequests per request 50 1,000
Request body 100MB 500MB
Response body unlimited unlimited
Environment variables 64 128
Secrets 64 128
Routes per zone 100 1,000

The CPU-time limit is the one that bites first. CPU time is only the time your code is on-CPU — time spent awaiting network I/O is not counted. But a Worker that does heavy parsing, crypto, or compression can blow past 30ms even with otherwise modest logic.

Common Worker failure modes

  • Exceeded CPU time (exceeded_cpu) — Worker terminated mid-execution; user gets a 1102 error or a partial response
  • Exceeded subrequest limit — too many fetch() calls; thrown as an exception you can catch
  • Exceeded request/response body size — silent truncation or 4xx return depending on configuration
  • Script error — uncaught exception; user gets 1101 unless you catch it
  • Out-of-memory — rare but possible; manifests as 1107
  • Resource limit reached — fetch to a bound service that's over quota; per-binding

The 1xxx error codes (1101, 1102, 1107, etc.) are Cloudflare's edge error pages. Your monitoring needs to surface these as the user sees them, not as your code sees them.

Built-in observability

Cloudflare ships four primary observability surfaces for Workers:

  1. Workers Logs — structured logs from console.log calls, queryable in the dashboard and via Logpush. 200K events/day on free plan, paid for more.
  2. Tail Workers — a separate Worker that receives every request event from a target Worker as a TailEvent. Lets you build your own logging/analytics. Indispensable for high-volume tracking.
  3. Trace Events — request lifecycle events including exceptions, subrequest details, and CPU time consumed
  4. Workers Analytics Engine — write custom time-series metrics from inside your Worker; query via SQL. Cheap, high-cardinality friendly. The metric infrastructure most Cloudflare-native apps end up using.

Plus the Workers dashboard itself shows: requests per second, errors per second, CPU time p50/p99, subrequest count, duration.

Wrap your handler

The single highest-value instrumentation pattern:

export default {
    async fetch(request: Request, env: Env, ctx: ExecutionContext) {
        const start = Date.now();
        const route = new URL(request.url).pathname;
        try {
            const response = await handle(request, env);
            env.METRICS.writeDataPoint({
                blobs: [route, response.status.toString()],
                doubles: [Date.now() - start],
                indexes: ['ok'],
            });
            return response;
        } catch (err) {
            env.METRICS.writeDataPoint({
                blobs: [route, '500', err.message?.slice(0, 100) ?? 'unknown'],
                doubles: [Date.now() - start],
                indexes: ['error'],
            });
            throw err;
        }
    },
};

Then query Analytics Engine via SQL for per-route latency, error rate, and top error messages.

What to alert on

  • Error rate per Worker > 1% for 5 minutes
  • CPU time p99 > 80% of your plan's limit
  • Subrequest count p99 > 80% of the subrequest limit
  • Tail Worker event rate dropping (= your Worker stopped serving)
  • Per-route latency p95 / p99 above baseline
  • Routes config drift (alert on route changes via API audit)

For the broader serverless landscape see Serverless Monitoring: Lambda, Vercel, Edge Functions. For the Cloudflare origin/CDN side see Cloudflare Monitoring: Detect Origin Outages.


D1: SQLite on the Edge

The mental model

D1 is SQLite, but distributed. Each D1 database has:

  • One primary in a single region (where writes go)
  • Optional read replicas in additional regions (eventually consistent, ~seconds of lag)
  • All accessible from any Worker via a binding

Writes from a Worker far from the primary cross the network and take 50-200ms+; reads from the local replica are sub-10ms. This locality difference is the single thing you have to monitor for in D1.

Limits to watch

Limit Value
Database size 10GB per DB
Row size 1MB
Query duration 30 seconds
Read-replica lag typically < 5s, can spike under load
Concurrent connections bounded by Worker isolate concurrency
Statements per batch 30,000
Bound expressions per query 100

Failure modes

  • Replica lag spike — read just after a write returns stale data; the worst class of D1 bug
  • Primary-region degradation — writes slow down everywhere; reads continue locally
  • Query timeout — long-running query gets killed at 30s; the result is a 5xx-equivalent error to the Worker
  • Database-too-large — hitting the 10GB ceiling silently fails new writes (or requires sharding)
  • Schema migration drift — D1 doesn't have built-in migration tooling; mis-managed migrations are the most common production D1 outage

What to monitor

  • Query latency p50/p95/p99 split by read vs write
  • Read-replica lag (use a synthetic write-then-read-from-replica probe)
  • Storage size vs 10GB limit
  • Per-query type ("SELECT ... FROM table_x") top-N slow queries
  • Connection / binding error rate
  • Migration job completion (track every schema migration as a heartbeat — see Cron Job Monitoring)

Synthetic D1 health probe

A cheap monitoring pattern: a Worker that runs every minute and:

  1. Writes a row to a monitoring_pings table with a UUID and timestamp
  2. Waits 500ms
  3. Reads the row back from a nearby read replica
  4. Records latency, success, and the read-vs-write lag

Surface this on your dashboard as the "D1 end-to-end health" signal.


R2: Object Storage Without Egress Fees

Where R2 wins

S3-compatible API, zero egress fees, ~120 data-center backed POPs. The killer feature is the egress economics — serving large assets from R2 is roughly 80% cheaper than S3 once egress is included.

Limits

Limit Value
Object size 5TB
Bucket size unlimited
Multipart parts 10,000
Object key length 1024 bytes
List operations 1000 keys per response

Failure modes

  • Operation rate limit — class A operations (writes) are ~1000/sec/bucket; spikes get throttled with 429s
  • Multipart upload abandonment — partial uploads accumulate (and bill) until lifecycle rules clean them up
  • Signed URL expiry timing bugs — server clock skew between Worker and S3-compatible client tools occasionally produces 403s
  • CORS / public-bucket misconfiguration — silent 4xx for browser clients
  • R2 → public Worker domain misconfiguration — origin pulls fail in subtle ways

What to monitor

  • Class A / Class B operations per minute (track against your bill)
  • 429 / 503 rate from R2 API calls
  • Multipart upload abandonment rate (track via S3-compatible ListMultipartUploads)
  • Bucket size trend (cost planning)
  • Lifecycle rule status (recent execution timestamp)
  • Public-domain HTTP availability (treat as a CDN)
  • Signed URL TTL window — alert if you're issuing URLs with TTL > 24h (security signal too)

KV: Eventually Consistent Key-Value

Mental model

A globally distributed key-value store with eventually consistent reads. Writes take ~60 seconds to fully propagate worldwide. Reads from a colo recently miss are 30-100ms; cached reads are 5-15ms.

Limits

Limit Value
Value size 25MB
Key length 512 bytes
Write rate per key 1/sec
Read rate per key very high
Operations per Worker invocation 1,000
Metadata size 1024 bytes per key

Failure modes

  • Hot-key contention — writing to the same key > 1/sec gets rejected. The single biggest KV gotcha; happens when teams use KV for rate limiting or counters (it's not designed for those use cases).
  • Eventual-consistency surprises — write-then-read-back from a different colo can return the old value for up to 60 seconds. KV is not a database.
  • Cache-miss latency — first read of a key in a colo can take 100ms+; if your app expects sub-10ms reads, this is a regression source.
  • Quota exhaustion — read/write/delete operations are billed; runaway loops are expensive fast.

What to monitor

  • Operation count per minute (writes vs reads vs deletes)
  • Per-binding latency p99
  • Read-vs-write ratio (a normal app reads >> writes; a write-heavy ratio suggests misuse)
  • Top-N hot keys (any single key > 1/sec is a problem)
  • KV operation error rate

KV vs Durable Objects vs D1 — pick the right primitive

A common monitoring story is "KV is broken" when really the team picked KV for a use case it doesn't fit:

  • KV — global config, feature flags, low-write cache. Eventual consistency.
  • Durable Objects — counters, rate limits, per-user state, anything needing strong consistency
  • D1 — relational data, queries, joins

Migrating off the wrong primitive often fixes "KV latency issues" because the new design uses the right tool.


Durable Objects: Strongly Consistent Edge State

Brief but worth covering. A Durable Object is a single-instance stateful actor — one DO, one location, one consistency boundary. Use them for: counters, presence (chat rooms), rate limits, per-tenant state, websocket coordination, distributed locks.

What to monitor:

  • DO invocation count + latency per class
  • DO hibernate / wake events (cold-start equivalent)
  • DO storage size (1GB per object)
  • Alarm scheduling and execution latency (cron-equivalent inside a DO)
  • WebSocket connection count per DO (for chat / presence patterns)
  • DO migration / class-rename safety (these are easy to get wrong in production)

Supporting Services (Brief)

  • Queues — at-least-once delivery; monitor queue depth, consumer lag, retry count, DLQ size
  • Pub/Sub — MQTT broker; monitor publish/subscribe rate, connection count, retained messages
  • Hyperdrive — DB connection pooling for upstream Postgres / MySQL; monitor cached query hit ratio, upstream connection count, error rate; for the broader DB story see API Rate Limit Monitoring
  • Vectorize — vector DB for embeddings; monitor index size, query latency, recall — see vector database monitoring
  • Workers AI — model bindings; per-binding latency p95/p99, error rate, neuron usage vs quota, daily cost
  • Browser Rendering — headless Chromium for screenshots/PDFs; concurrent-browser count, request timeout rate

Cron Triggers — The Most-Missed Failure Surface

Cloudflare's Cron Triggers fire scheduled Worker invocations. The default monitoring tells you "this cron is configured." It does not tell you "this cron actually ran and completed successfully last time it was supposed to."

The fix is a heartbeat pattern: each scheduled invocation pings a monitoring endpoint with (cron_name, timestamp, success). An external monitor alerts if the heartbeat doesn't arrive within the expected window.

export default {
    async scheduled(event, env, ctx) {
        const start = Date.now();
        try {
            await doScheduledWork(env);
            ctx.waitUntil(
                fetch('https://webalert.io/heartbeat/your-cron-id?status=ok')
            );
        } catch (err) {
            ctx.waitUntil(
                fetch(`https://webalert.io/heartbeat/your-cron-id?status=fail`)
            );
            throw err;
        }
    },
};

See Cron Job Monitoring: Background Tasks for the broader pattern.


Wrangler Deploy Monitoring

wrangler deploy is fast — typically 5-15 seconds — and the failure modes are:

  • Build error — TypeScript / bundler errors caught at deploy time
  • Configuration error — bound resource missing, secret missing, route conflict
  • Quota exhaustion — too many Workers / routes for plan
  • Account / API token issues — silent in some CI integrations

Monitor:

  • Deploy duration over time (sudden jumps indicate bundling issues)
  • Deploy failure rate
  • Time-since-last-successful-deploy (alert if > N days for a service that should be active)
  • Post-deploy synthetic check (hit the worker's URL within 30s of deploy; alert if the new version isn't returning 200)

Workers AI: The New Cost Center

If you're using Workers AI, monitor it like any other AI/LLM service plus a Cloudflare-specific lens:

  • Per-model latency p95/p99 (Llama 3 vs Mistral vs embeddings have very different latency profiles)
  • Per-model error rate
  • Neuron consumption vs daily quota
  • Cost per request (some models are dramatically more expensive than others)
  • Cost per user / per tenant (catch runaway usage)

See AI/LLM API Monitoring and AI Agent Monitoring for the broader AI-monitoring picture.


Why You Still Need External Monitoring

The Cloudflare dashboard lags by 30-90 seconds. The Cloudflare status page lags by 5-15 minutes (sometimes longer for regional incidents). Workers Logs are great for after-the-fact debugging but they're not real-time alerting.

External monitoring fills three gaps:

  1. Time-to-detect — a 1-minute external HTTP check beats every other layer for "did the user get a 200?"
  2. The platform itself failing — when Cloudflare is the problem, Cloudflare's own monitoring can't tell you
  3. Multi-perspective verification — external checks from regions outside Cloudflare's edge confirm reachability

The combination that works: Workers Analytics Engine for per-route metrics inside the platform, Tail Workers for deep logs, and external HTTP monitoring with multi-region checks for time-to-detect.

See Multi-Region Monitoring: Why Location Matters, CDN Monitoring, and DDoS Monitoring: Detect & Mitigate Traffic Spikes for the broader edge-monitoring patterns.


Cloudflare Workers Monitoring Checklist

  • Every Worker wraps its fetch handler in try/catch + Analytics Engine write
  • Per-route latency p50/p95/p99 dashboards in Analytics Engine SQL
  • Per-Worker error rate alerting
  • CPU-time p99 alerting at 80% of plan limit
  • Subrequest count p99 alerting at 80% of limit
  • Tail Worker for high-value Workers shipping structured logs
  • D1 query latency split read vs write
  • D1 read-replica lag synthetic probe
  • D1 storage-size trend tracked vs 10GB ceiling
  • R2 class A/B operation rate tracked
  • R2 multipart abandonment lifecycle policy in place
  • KV top-N hot-key audit (no key written > 1/sec)
  • KV operation count tracked (cost)
  • Durable Objects per-class invocation latency + alarm execution lag
  • Cron Triggers heartbeat to external monitor
  • Wrangler deploy success rate + post-deploy synthetic check
  • Workers AI per-model latency + cost + quota
  • External HTTP monitoring from multiple regions on every public Worker URL
  • Status-page subscription for cloudflarestatus.com (lagging indicator only)

How Webalert Helps Monitor Cloudflare Workers Apps

Webalert is the external-monitoring layer that complements Cloudflare's built-in observability:

  • HTTP monitoring — Public Worker URL, R2 public domain, custom domains; 1-minute resolution
  • Multi-region checks — Confirm Worker reachability from regions outside Cloudflare's network; catch routing / DNS issues invisible to the dashboard
  • Content validation — Hit a /internal/workers-health endpoint that surfaces Analytics Engine summary data; alert when error rate or CPU-time p99 crosses threshold
  • Heartbeat monitoring — Cron Triggers ping a heartbeat URL; missed heartbeats alert immediately
  • SSL certificate monitoring — Custom-domain cert expiry and chain validation
  • Response time alerts — Catch Workers degrading from "consistently fast" to "occasionally slow"
  • Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
  • Status page — Communicate degraded Worker regions or AI-binding outages to users
  • 5-minute setup — Add hostnames, point at heartbeats, set thresholds

See features and pricing.


Summary

  • The Cloudflare developer platform is now a full stack — Workers, D1, R2, KV, Durable Objects, Queues, Hyperdrive, Vectorize, Workers AI — each with its own limits and failure modes.
  • The CPU-time limit, the subrequest limit, and the KV hot-key 1/sec rule are the three Worker constraints that bite earliest in production.
  • Wrap every Worker's fetch handler in try/catch + Analytics Engine write. Per-route latency, error rate, and top error messages fall out for free.
  • D1 read-replica lag is the most-overlooked source of bugs; run a synthetic write-then-read probe.
  • R2 is great for storage cost but multipart abandonment and operation rate limits still need monitoring.
  • KV is eventually consistent and rate-limited per key; using KV for counters or rate limiting is a frequent root cause of "KV is broken" tickets. Use Durable Objects instead.
  • Cron Triggers need a heartbeat pattern — Cloudflare won't tell you when one silently stops firing.
  • Workers AI is a new cost center; track per-model latency, cost, and quota daily.
  • The Cloudflare dashboard lags real time by ~30-90 seconds, the status page lags by 5-15 minutes. External multi-region HTTP monitoring is what catches outages first.

A well-instrumented Cloudflare Workers app pairs platform-native observability (Analytics Engine, Tail Workers, Logs) with external monitoring (HTTP checks, heartbeats, multi-region). The platform tells you why something failed; external monitoring tells you that it failed — even when the platform itself is the problem.


Catch Worker, D1, R2, and KV regressions before users hit the limits

Start monitoring with Webalert →

See features and pricing. No credit card required.

Monitor your website in under 60 seconds — no credit card required.

Start Free Monitoring

Written by

Webalert Team

The Webalert team is dedicated to helping businesses keep their websites online and their users happy with reliable monitoring solutions.

Ready to Monitor Your Website?

Start monitoring for free with 3 monitors, 10-minute checks, and instant alerts.

Start Free Monitoring