
Outgoing webhooks - the ones your app fires at customers - get plenty of attention. The dangerous ones are the incoming webhooks: Stripe sending you a payment_intent.succeeded, GitHub posting a push event, Clerk telling you a user signed up, Shopify announcing a new order, Twilio confirming an SMS delivery.
If your endpoint silently breaks, partners do not call you. They retry a few times, then disable the webhook, and the bug shows up days later as missing revenue, missing users, missing data, or a confused customer support ticket.
This guide focuses on monitoring the receiving side of webhooks: can the provider reach your URL, does it return the right 2xx quickly, are signatures verified, are retries observable, and are failures caught before the integration is disabled.
For the outgoing side, see Webhook Monitoring.
How Incoming Webhooks Actually Fail
Most teams design the happy path and skip the failure modes. The common silent failures:
- TLS handshake fails (expired cert, weak ciphers, missing chain).
- DNS for the webhook subdomain stops resolving.
- Endpoint returns 500 because a downstream dependency is down.
- Endpoint returns 200 but never processes the event (background job dropped).
- Endpoint takes 12 seconds to respond, provider times out at 10s.
- Endpoint returns 401/403 because signature secret was rotated.
- WAF or rate limit blocks the provider's IP range.
- Body parser rejects the payload after a framework upgrade.
- Endpoint moved from
/webhooks/stripeto/api/webhooks/stripeand the dashboard URL was not updated. - Partner disables the webhook after N consecutive failures.
All of these are invisible to standard uptime checks of the homepage or login page.
The Provider Timeout Reality
Each provider has an aggressive timeout and retry policy. Miss it and the event is treated as failed:
| Provider | Timeout | Retry policy |
|---|---|---|
| Stripe | ~30s read, alert disables after extended failure window | Retries with exponential backoff for up to 3 days |
| GitHub | 10s | Retries up to 3 times with backoff, then disables |
| Shopify | 5s | Retries for 48 hours, disables after 19 failed |
| Clerk | 15s | Retries with backoff, surfaces failures in dashboard |
| Twilio | 15s | Configurable retry, falls back to fallback URL |
| Slack | 3s | Single retry then drop for some event types |
| Auth0 | 15s | Retries with exponential backoff |
If your handler runs migrations, sends emails, or writes to a slow database inline, you will miss these timeouts.
Rule of thumb: respond 200 OK in under 1 second, do the real work asynchronously, and monitor the queue. See Job Queue Monitoring.
What to Monitor
1. Reachability
Can the provider reach your URL at all? Check from outside your network:
- DNS resolves to the expected IP.
- TLS certificate is valid, chain complete, not expiring.
- HTTPS handshake succeeds.
- HTTP 405 returned for
GETis fine ifPOSTis the only method (assert it). - WAF rules do not 403 the provider's IP ranges.
2. Latency
Webhook endpoints should respond fast. Monitor:
- p50, p95, p99 response time per endpoint.
- Time-to-first-byte under 500ms target.
- No timeouts above provider thresholds.
3. Status Codes
Track per provider and per event type:
- 2xx rate over rolling window.
- 4xx rate (signature failures, malformed bodies).
- 5xx rate (your bugs).
- 408/504 timeouts (your slow handlers).
4. Signature Verification
Most providers sign requests. Monitor:
- Signature verification success rate.
- Drops in success rate after a deploy (secret rotation gone wrong).
- Clock skew failures (HMAC includes a timestamp).
5. Idempotency
Providers retry. Your handler must be safe to call twice:
- Track duplicate event IDs received per hour.
- Alert if duplicate processing drops below expected (means idempotency check is broken).
- Alert if duplicate processing spikes (means provider is retrying because you are slow or returning 5xx).
6. Partner-Side Failure State
The truth source is the provider dashboard:
- Stripe: events with
pending_webhooks > 0after N minutes. - GitHub: webhook
last_response.statusandlast_response.code. - Shopify: webhook delivery status in admin.
- Clerk: failed events in webhook dashboard.
Where the provider exposes an API, scrape it and surface failures alongside your own monitoring.
A Safe Webhook Endpoint Shape
The minimal, safe webhook handler does three things:
- Verify signature.
- Enqueue work.
- Return 200.
POST /webhooks/stripe
1. Read raw body (do not parse first).
2. Verify Stripe-Signature header using endpoint secret.
3. Insert event row { id, type, payload, status: 'received' } with idempotent unique index on id.
4. Enqueue background job to process event.
5. Return 200 OK.
Everything heavy - DB writes, emails, downstream API calls, ML, PDF generation - runs in the background. The webhook handler stays under 200ms typical.
This is also what Health Check Endpoint Design recommends for any high-frequency endpoint.
Monitoring From Outside (External Synthetic)
External monitoring catches problems that internal metrics cannot. Internal metrics depend on the request reaching your app; external monitoring catches DNS, TLS, WAF, CDN, and routing issues.
Set up an external check that posts a benign test payload to a dedicated test endpoint:
POST https://api.example.com/webhooks/stripe/_healthcheck
Content-Type: application/json
X-Webalert-Probe: true
{ "type": "ping" }
Assert:
- Status 200.
- Response body contains
"ok": true(use content validation rather than only status). - Response time below 500ms.
- TLS certificate valid for at least 14 days.
Do not call the real handler with fake events; expose a dedicated probe path that exercises the same TLS, WAF, routing, and framework code without writing data.
For securing that probe, see Monitor Authenticated APIs.
Monitoring Signature Verification
Signature failures are the single most common silent webhook bug after a deploy. Track:
| Metric | Why |
|---|---|
webhook.signature.valid count |
Baseline traffic |
webhook.signature.invalid count |
Alert on spikes |
webhook.signature.invalid_rate |
Catch secret rotation regressions |
webhook.timestamp.skew_seconds |
NTP drift breaks HMAC-with-timestamp |
webhook.body.read_bytes |
Body parsed before signature can break verification |
Alert when invalid signature rate exceeds 1% on a per-provider basis. A single bad deploy that swaps the wrong secret will spike this to ~100% instantly.
Webhook Endpoints and Rate Limits
Providers can burst. Stripe will send tens of events per second during checkout spikes. GitHub Actions can deliver hundreds of workflow_run events at once.
Monitor:
- Concurrent request count per webhook endpoint.
- Rate of 429s your endpoint returns (you should not be returning 429s to a webhook provider; they will treat it as a failure).
- WAF rule blocks per provider IP range.
- Connection pool saturation on the database.
If your endpoint is rate limited, see API Rate Limit Monitoring. For webhook endpoints, the answer is almost always "enqueue and return 200, do not throttle the provider."
End-to-End Delivery Monitoring
Status code is not enough. To prove the end-to-end pipeline works, monitor:
| Stage | Signal |
|---|---|
| Provider attempted delivery | Provider dashboard / API |
| Endpoint received request | Application logs / metrics |
| Signature verified | webhook.signature.valid |
| Event row written | DB count by event id |
| Background job enqueued | Queue depth / enqueued counter |
| Background job processed | Processed counter, duration |
| Side effect completed | Domain-level metric (order created, user provisioned) |
Alert when the last stage falls behind the first stage by more than your SLO budget.
This is the same shape as Stripe Payment Monitoring: the webhook is the start of a pipeline; success at the endpoint does not mean success at the business level.
Provider-Specific Notes
Stripe
- Use the dedicated endpoint secret per Stripe webhook destination.
- Read the raw body before parsing JSON to keep the signature valid.
- Watch the
pending_webhooksfield on the Event API; non-zero for long means you are behind. - Stripe will retry for up to 3 days. Do not treat a single failure as fatal, but track the trend.
GitHub
- Verify
X-Hub-Signature-256using HMAC-SHA256 with the webhook secret. - Respond in under 10 seconds. Long workflows must run async.
- After 3 failed deliveries with no successes, GitHub may disable the webhook. Monitor the
last_response.codevia the API.
Shopify
- 5 second timeout. Absolutely no synchronous work.
- Verify
X-Shopify-Hmac-Sha256using the shared secret. - 19 failures over 48 hours disables the webhook. Track delivery status in the admin API.
Clerk
- Webhooks are signed with Svix headers (
svix-id,svix-timestamp,svix-signature). - Verify timestamp within tolerance to prevent replay.
- Use the Clerk dashboard to track failed deliveries.
Twilio
- Signature is in
X-Twilio-Signatureand is computed over the full URL plus sorted POST params - any reverse proxy that rewrites the URL breaks verification. - Configure a fallback URL where possible.
Slack
- 3 second timeout for many event types - extremely aggressive.
- Respond first, process later, always.
Securing the Endpoint
Webhook endpoints are public. They need their own security checks:
- TLS only, modern ciphers - see TLS Configuration Monitoring.
- Reject requests without a valid signature, never "log and continue."
- Reject replayed timestamps outside a tight window (5 minutes typical).
- Restrict by provider IP range when published (Stripe, GitHub, Shopify all publish IP lists).
- Rate-limit unauthenticated callers, but never the provider.
- Do not echo the payload back in errors - it can contain PII.
- Log signature failures with hashed identifiers, not raw secrets.
For securing the perimeter, see HTTP Security Headers Monitoring.
Alerting Thresholds
Critical
- Webhook endpoint 5xx rate over 1% for 5 minutes.
- Signature verification success rate drops below 99%.
- Provider dashboard reports webhook disabled.
- Endpoint p95 latency exceeds provider timeout.
- TLS certificate expires in under 7 days on webhook host.
High
- Webhook duplicate-processing rate spikes (provider retrying you).
- Background job queue depth for webhook events grows for 10 minutes.
- End-to-end stage drift (provider sent, side effect not done) exceeds SLO.
- 4xx rate from provider IP ranges spikes (likely body or signature issue).
Informational
- New event type observed (could mean a provider rolled out a feature).
- Webhook URL responded with redirect (providers do not follow redirects reliably).
- Spike in webhook traffic outside business hours.
For routing these alerts without burning the team out, see Alert Fatigue.
Incoming Webhook Monitoring Checklist
- External synthetic check on each webhook endpoint
- Dedicated
_healthcheckprobe path that exercises TLS/WAF/framework - Content assertion on the probe response, not only status code
- Per-provider 2xx/4xx/5xx tracking
- Signature verification success rate monitored and alerted
- p95 latency below provider timeout, with margin
- Idempotency key on event id with monitoring of duplicate rate
- Background queue depth and processing lag monitored
- Provider-side failure dashboard scraped or alerted
- TLS certificate expiry tracked on webhook host
- WAF and IP allow-lists reviewed when providers change ranges
- Runbook documented per provider for "webhook disabled" recovery
For the runbook side of this, see Incident Runbook Template.
How Webalert Helps
Webalert is built for exactly this kind of external, content-aware monitoring:
- HTTP monitoring - POST a probe payload to
/webhooks/stripe/_healthcheck,/webhooks/github/_healthcheck, and assert status, latency, and response body. - Content validation - Assert
"ok": true,"signature": "verified", or your custom marker so a wrong handler returning 200 does not look healthy. - TLS and certificate monitoring - Catch expiring certs and broken chains on the webhook hostname before the provider does.
- Multi-region checks - Detect routing or WAF failures that only affect certain provider IP ranges.
- Latency alerts - Trigger when p95 approaches a provider timeout, well before delivery actually fails.
- Status page integration - Distinguish "site up" from "webhooks healthy" so customers and partners see the truth.
- Alert routing - Notify the integration owner directly, not just the on-call engineer.
Example Webalert check:
- URL:
https://api.example.com/webhooks/stripe/_healthcheck - Method:
POST - Body:
{"type":"ping"} - Expected status:
200 - Must contain:
"ok":true - Must not contain:
signature_invalid - Response time: under 800ms
- Region: US + EU
- TLS expiry warning: 14 days
Summary
Incoming webhooks fail quietly: a provider stops calling, retries silently, and eventually disables the integration. Status-code-only monitoring will not catch it.
Monitor reachability, latency, signature verification, idempotency, and end-to-end delivery. Probe from outside, assert on response body, and watch the provider's own delivery status. Done well, you find out before Stripe disables your endpoint - not after.