Incoming Webhook Monitoring: Retries, Failures & Security

Outgoing webhooks - the ones your app fires at customers - get plenty of attention. The dangerous ones are the incoming webhooks: Stripe sending you a payment_intent.succeeded, GitHub posting a push event, Clerk telling you a user signed up, Shopify announcing a new order, Twilio confirming an SMS delivery.

If your endpoint silently breaks, partners do not call you. They retry a few times, then disable the webhook, and the bug shows up days later as missing revenue, missing users, missing data, or a confused customer support ticket.

This guide focuses on monitoring the receiving side of webhooks: can the provider reach your URL, does it return the right 2xx quickly, are signatures verified, are retries observable, and are failures caught before the integration is disabled.

For the outgoing side, see Webhook Monitoring.

How Incoming Webhooks Actually Fail

Most teams design the happy path and skip the failure modes. The common silent failures:

TLS handshake fails (expired cert, weak ciphers, missing chain).
DNS for the webhook subdomain stops resolving.
Endpoint returns 500 because a downstream dependency is down.
Endpoint returns 200 but never processes the event (background job dropped).
Endpoint takes 12 seconds to respond, provider times out at 10s.
Endpoint returns 401/403 because signature secret was rotated.
WAF or rate limit blocks the provider's IP range.
Body parser rejects the payload after a framework upgrade.
Endpoint moved from /webhooks/stripe to /api/webhooks/stripe and the dashboard URL was not updated.
Partner disables the webhook after N consecutive failures.

All of these are invisible to standard uptime checks of the homepage or login page.

The Provider Timeout Reality

Each provider has an aggressive timeout and retry policy. Miss it and the event is treated as failed:

Provider	Timeout	Retry policy
Stripe	~30s read, alert disables after extended failure window	Retries with exponential backoff for up to 3 days
GitHub	10s	Retries up to 3 times with backoff, then disables
Shopify	5s	Retries for 48 hours, disables after 19 failed
Clerk	15s	Retries with backoff, surfaces failures in dashboard
Twilio	15s	Configurable retry, falls back to fallback URL
Slack	3s	Single retry then drop for some event types
Auth0	15s	Retries with exponential backoff

If your handler runs migrations, sends emails, or writes to a slow database inline, you will miss these timeouts.

Rule of thumb: respond 200 OK in under 1 second, do the real work asynchronously, and monitor the queue. See Job Queue Monitoring.

What to Monitor

1. Reachability

Can the provider reach your URL at all? Check from outside your network:

DNS resolves to the expected IP.
TLS certificate is valid, chain complete, not expiring.
HTTPS handshake succeeds.
HTTP 405 returned for GET is fine if POST is the only method (assert it).
WAF rules do not 403 the provider's IP ranges.

2. Latency

Webhook endpoints should respond fast. Monitor:

p50, p95, p99 response time per endpoint.
Time-to-first-byte under 500ms target.
No timeouts above provider thresholds.

3. Status Codes

Track per provider and per event type:

2xx rate over rolling window.
4xx rate (signature failures, malformed bodies).
5xx rate (your bugs).
408/504 timeouts (your slow handlers).

4. Signature Verification

Most providers sign requests. Monitor:

Signature verification success rate.
Drops in success rate after a deploy (secret rotation gone wrong).
Clock skew failures (HMAC includes a timestamp).

5. Idempotency

Providers retry. Your handler must be safe to call twice:

Track duplicate event IDs received per hour.
Alert if duplicate processing drops below expected (means idempotency check is broken).
Alert if duplicate processing spikes (means provider is retrying because you are slow or returning 5xx).

6. Partner-Side Failure State

The truth source is the provider dashboard:

Stripe: events with pending_webhooks > 0 after N minutes.
GitHub: webhook last_response.status and last_response.code.
Shopify: webhook delivery status in admin.
Clerk: failed events in webhook dashboard.

Where the provider exposes an API, scrape it and surface failures alongside your own monitoring.

A Safe Webhook Endpoint Shape

The minimal, safe webhook handler does three things:

Verify signature.
Enqueue work.
Return 200.

POST /webhooks/stripe

1. Read raw body (do not parse first).
2. Verify Stripe-Signature header using endpoint secret.
3. Insert event row { id, type, payload, status: 'received' } with idempotent unique index on id.
4. Enqueue background job to process event.
5. Return 200 OK.

Everything heavy - DB writes, emails, downstream API calls, ML, PDF generation - runs in the background. The webhook handler stays under 200ms typical.

This is also what Health Check Endpoint Design recommends for any high-frequency endpoint.

Monitoring From Outside (External Synthetic)

External monitoring catches problems that internal metrics cannot. Internal metrics depend on the request reaching your app; external monitoring catches DNS, TLS, WAF, CDN, and routing issues.

Set up an external check that posts a benign test payload to a dedicated test endpoint:

POST https://api.example.com/webhooks/stripe/_healthcheck
Content-Type: application/json
X-Webalert-Probe: true

{ "type": "ping" }

Assert:

Status 200.
Response body contains "ok": true (use content validation rather than only status).
Response time below 500ms.
TLS certificate valid for at least 14 days.

Do not call the real handler with fake events; expose a dedicated probe path that exercises the same TLS, WAF, routing, and framework code without writing data.

For securing that probe, see Monitor Authenticated APIs.

Monitoring Signature Verification

Signature failures are the single most common silent webhook bug after a deploy. Track:

Metric	Why
`webhook.signature.valid` count	Baseline traffic
`webhook.signature.invalid` count	Alert on spikes
`webhook.signature.invalid_rate`	Catch secret rotation regressions
`webhook.timestamp.skew_seconds`	NTP drift breaks HMAC-with-timestamp
`webhook.body.read_bytes`	Body parsed before signature can break verification

Alert when invalid signature rate exceeds 1% on a per-provider basis. A single bad deploy that swaps the wrong secret will spike this to ~100% instantly.

Webhook Endpoints and Rate Limits

Providers can burst. Stripe will send tens of events per second during checkout spikes. GitHub Actions can deliver hundreds of workflow_run events at once.

Monitor:

Concurrent request count per webhook endpoint.
Rate of 429s your endpoint returns (you should not be returning 429s to a webhook provider; they will treat it as a failure).
WAF rule blocks per provider IP range.
Connection pool saturation on the database.

If your endpoint is rate limited, see API Rate Limit Monitoring. For webhook endpoints, the answer is almost always "enqueue and return 200, do not throttle the provider."

End-to-End Delivery Monitoring

Status code is not enough. To prove the end-to-end pipeline works, monitor:

Stage	Signal
Provider attempted delivery	Provider dashboard / API
Endpoint received request	Application logs / metrics
Signature verified	`webhook.signature.valid`
Event row written	DB count by event id
Background job enqueued	Queue depth / enqueued counter
Background job processed	Processed counter, duration
Side effect completed	Domain-level metric (order created, user provisioned)

Alert when the last stage falls behind the first stage by more than your SLO budget.

This is the same shape as Stripe Payment Monitoring: the webhook is the start of a pipeline; success at the endpoint does not mean success at the business level.

Provider-Specific Notes

Stripe

Use the dedicated endpoint secret per Stripe webhook destination.
Read the raw body before parsing JSON to keep the signature valid.
Watch the pending_webhooks field on the Event API; non-zero for long means you are behind.
Stripe will retry for up to 3 days. Do not treat a single failure as fatal, but track the trend.

GitHub

Verify X-Hub-Signature-256 using HMAC-SHA256 with the webhook secret.
Respond in under 10 seconds. Long workflows must run async.
After 3 failed deliveries with no successes, GitHub may disable the webhook. Monitor the last_response.code via the API.

Shopify

5 second timeout. Absolutely no synchronous work.
Verify X-Shopify-Hmac-Sha256 using the shared secret.
19 failures over 48 hours disables the webhook. Track delivery status in the admin API.

Clerk

Webhooks are signed with Svix headers (svix-id, svix-timestamp, svix-signature).
Verify timestamp within tolerance to prevent replay.
Use the Clerk dashboard to track failed deliveries.

Twilio

Signature is in X-Twilio-Signature and is computed over the full URL plus sorted POST params - any reverse proxy that rewrites the URL breaks verification.
Configure a fallback URL where possible.

Slack

3 second timeout for many event types - extremely aggressive.
Respond first, process later, always.

Securing the Endpoint

Webhook endpoints are public. They need their own security checks:

TLS only, modern ciphers - see TLS Configuration Monitoring.
Reject requests without a valid signature, never "log and continue."
Reject replayed timestamps outside a tight window (5 minutes typical).
Restrict by provider IP range when published (Stripe, GitHub, Shopify all publish IP lists).
Rate-limit unauthenticated callers, but never the provider.
Do not echo the payload back in errors - it can contain PII.
Log signature failures with hashed identifiers, not raw secrets.

For securing the perimeter, see HTTP Security Headers Monitoring.

Alerting Thresholds

Critical

Webhook endpoint 5xx rate over 1% for 5 minutes.
Signature verification success rate drops below 99%.
Provider dashboard reports webhook disabled.
Endpoint p95 latency exceeds provider timeout.
TLS certificate expires in under 7 days on webhook host.

High

Webhook duplicate-processing rate spikes (provider retrying you).
Background job queue depth for webhook events grows for 10 minutes.
End-to-end stage drift (provider sent, side effect not done) exceeds SLO.
4xx rate from provider IP ranges spikes (likely body or signature issue).

Informational

New event type observed (could mean a provider rolled out a feature).
Webhook URL responded with redirect (providers do not follow redirects reliably).
Spike in webhook traffic outside business hours.

For routing these alerts without burning the team out, see Alert Fatigue.

Incoming Webhook Monitoring Checklist

External synthetic check on each webhook endpoint
Dedicated _healthcheck probe path that exercises TLS/WAF/framework
Content assertion on the probe response, not only status code
Per-provider 2xx/4xx/5xx tracking
Signature verification success rate monitored and alerted
p95 latency below provider timeout, with margin
Idempotency key on event id with monitoring of duplicate rate
Background queue depth and processing lag monitored
Provider-side failure dashboard scraped or alerted
TLS certificate expiry tracked on webhook host
WAF and IP allow-lists reviewed when providers change ranges
Runbook documented per provider for "webhook disabled" recovery

For the runbook side of this, see Incident Runbook Template.

How Webalert Helps

Webalert is built for exactly this kind of external, content-aware monitoring:

HTTP monitoring - POST a probe payload to /webhooks/stripe/_healthcheck, /webhooks/github/_healthcheck, and assert status, latency, and response body.
Content validation - Assert "ok": true, "signature": "verified", or your custom marker so a wrong handler returning 200 does not look healthy.
TLS and certificate monitoring - Catch expiring certs and broken chains on the webhook hostname before the provider does.
Multi-region checks - Detect routing or WAF failures that only affect certain provider IP ranges.
Latency alerts - Trigger when p95 approaches a provider timeout, well before delivery actually fails.
Status page integration - Distinguish "site up" from "webhooks healthy" so customers and partners see the truth.
Alert routing - Notify the integration owner directly, not just the on-call engineer.

Example Webalert check:

URL: https://api.example.com/webhooks/stripe/_healthcheck
Method: POST
Body: {"type":"ping"}
Expected status: 200
Must contain: "ok":true
Must not contain: signature_invalid
Response time: under 800ms
Region: US + EU
TLS expiry warning: 14 days

Summary

Incoming webhooks fail quietly: a provider stops calling, retries silently, and eventually disables the integration. Status-code-only monitoring will not catch it.

Monitor reachability, latency, signature verification, idempotency, and end-to-end delivery. Probe from outside, assert on response body, and watch the provider's own delivery status. Done well, you find out before Stripe disables your endpoint - not after.

Catch broken webhooks before partners disable them

Start monitoring with Webalert ->

See features and pricing. No credit card required.