
Retrying a failed request is one of the most reasonable things a program can do. It's also one of the most dangerous. The instinct — "it failed, try again" — is exactly right for a one-off network blip and exactly wrong during an outage, where it does the one thing you least want: it adds load to a system that's already on its knees. Done naively, retries don't help a struggling service recover — they keep it down, and can take down everything around it.
This is the retry storm, and it's a leading cause of outages that won't end on their own. This guide explains why retries amplify failures, and the three techniques — exponential backoff, jitter, and retry budgets — that make retries safe.
Why Naive Retries Make Things Worse
Picture a service that's briefly overloaded and starts returning errors. Every client that gets an error immediately retries. Now the service is handling its normal traffic plus a wave of retries — more load, precisely when it has less capacity. It returns more errors, which trigger more retries, which add more load. The system has entered a feedback loop that sustains its own failure.
Three things make this worse:
- Immediate retries hit the service again while it's still struggling, giving it no chance to recover.
- Fixed-interval retries from many clients synchronize — everyone who failed at 12:00:00 retries at 12:00:01, creating a thundering herd of simultaneous requests in coordinated waves.
- Layered retries multiply. If three tiers of services each retry three times, a single user request can become 27 requests to the bottom layer. Retries at every level compound exponentially.
The cruel irony: the retry, meant to improve reliability, becomes the mechanism that prevents recovery. Fixing this is about retrying less aggressively and less in unison.
Exponential Backoff: Slow Down
The first fix is to wait longer between each successive retry, growing the delay exponentially:
Attempt 1: wait 1s
Attempt 2: wait 2s
Attempt 3: wait 4s
Attempt 4: wait 8s (and so on, up to a cap)
Each failed attempt roughly doubles the wait before the next. This does two things: it gives the struggling service progressively more breathing room to recover, and it sharply reduces the total number of requests a client sends during an outage. A maxRetries limit and a maxDelay cap keep it bounded — you don't want to retry forever or wait an hour between attempts.
Exponential backoff alone is a massive improvement over immediate or fixed retries. But it has a hidden flaw, and that's where jitter comes in.
Jitter: Spread Out
Backoff fixes how often one client retries. It does not fix the synchronization problem. If a thousand clients all fail at the same instant and all use the same backoff schedule, they all wait 1s, then all retry together; all wait 2s, then all retry together. The load arrives in synchronized spikes — the thundering herd survives, just at wider intervals.
Jitter fixes this by adding randomness to each delay so clients spread out instead of marching in lockstep:
- Full jitter: wait a random time between 0 and the backoff value (e.g.
random(0, 4s)instead of exactly4s). This is the most effective at flattening the spikes. - Equal jitter: wait half the backoff plus a random half, keeping some minimum spacing while still de-synchronizing.
Jitter turns coordinated waves into a smooth, manageable trickle of retries. Backoff and jitter are not alternatives — you want both: backoff to reduce volume, jitter to break synchronization. Most well-designed client libraries (and the AWS SDKs, famously) use exponential backoff with full jitter by default.
Retry Budgets and Circuit Breakers
Backoff and jitter make individual clients well-behaved. Two more controls protect the system as a whole:
- Retry budgets. Cap retries as a fraction of total requests — say, no more than 10% of traffic may be retries. When a dependency is broadly down, retrying is futile and harmful, so the budget cuts retries off rather than letting them flood. This prevents the system-wide amplification that backoff-per-client can't see.
- Circuit breakers. When a dependency is clearly down, stop sending requests (and therefore retries) entirely for a cool-down period. A circuit breaker is the natural partner to retries: retry the transient failures, trip the breaker on the sustained ones.
- Retry only what's safe. Retries assume the operation can be repeated without harm. For anything that changes state — a payment, an order — that requires idempotency keys, so a retried request doesn't double-charge or double-process. Retrying non-idempotent operations is its own category of bug.
Together these form a layered defense: backoff and jitter at the client, budgets and breakers at the system, idempotency for correctness.
What to Monitor
Retry behavior is often invisible until it causes an outage — so measure it directly:
- Retry rate. What fraction of requests are retries? A rising retry rate is an early warning that something downstream is degrading, often before error rates fully spike. Watch it alongside the four golden signals.
- Retry-driven traffic amplification. Compare inbound request volume to useful work done. A growing gap means retries are inflating load.
- Error and latency on dependencies — the 5xx and timeout rates that trigger retries in the first place.
- Circuit breaker trips, which tell you when retries are (correctly) being suppressed.
A sudden spike in retry rate with flat successful throughput is the signature of a brewing retry storm — catch it there and you can intervene before it becomes self-sustaining.
How Webalert Helps
Retry storms are driven by failing dependencies — and the sooner you know a dependency is degrading, the sooner you can act before retries amplify it:
- Outside-in checks on your endpoints and the third-party services you depend on, catching the failures that trigger retry waves.
- Latency and error-rate tracking that surfaces the early degradation — rising timeouts and
5xxs — that precedes a storm. - Multi-region monitoring to tell a genuine dependency outage apart from a localized network blip, so you retry the transient and escalate the real.
- Fast, deduplicated alerts that tell you about sustained failure without adding their own noise.
Webalert won't configure your backoff for you — but it gives you the early signal that lets backoff, jitter, and breakers do their job before a blip becomes a storm.
Summary
Retries are essential for surviving transient failures and catastrophic during sustained ones, because they add load exactly when a system can least handle it. Naive immediate or fixed-interval retries create self-sustaining retry storms, amplified by synchronization and by retries stacking across service tiers.
The fix is a layered one: exponential backoff to reduce how often each client retries, jitter to keep clients from retrying in synchronized waves, retry budgets to cap retry load system-wide, and circuit breakers to stop retrying a dependency that's clearly down — with idempotency ensuring the retries that do happen are safe. Use all of them together, monitor your retry rate as a leading indicator, and retries go back to being what they should be: a quiet safety net, not the cause of your next outage.