5xx Server Errors Explained: 500, 502, 503, 504 Fix Guide

A 5xx status code means the server knows something went wrong on its side and is owning it. Unlike a 4xx error - which says "you (the client) did something wrong" - a 5xx is your problem to fix. And the specific code carries an enormous amount of information about which layer broke.

This is the per-code diagnostic guide. For each of the major 5xx codes - 500, 502, 503, 504, plus Cloudflare's 520-526 and a few rarer cases - what it actually means, the most common real-world causes, and how to confirm and fix it fast.

For the broader monitoring policy (alert thresholds, error budgets), see 5xx Error Rate Monitoring. For all status codes including 1xx, 2xx, 3xx, 4xx, see HTTP Status Codes Explained.

The 5xx Family At A Glance

Code	Name	Layer that failed	Owner
500	Internal Server Error	Application	App team
501	Not Implemented	Application / framework	App team
502	Bad Gateway	Proxy ↔ upstream	Platform / infra
503	Service Unavailable	Application / load balancer	App / platform
504	Gateway Timeout	Proxy waiting on upstream	Platform / app
505	HTTP Version Not Supported	Server config	Platform
507	Insufficient Storage	Disk / object storage	Platform
508	Loop Detected	App routing	App team
511	Network Authentication Required	Captive portal	Network
520-526	Cloudflare-specific	Cloudflare ↔ origin	Origin / Cloudflare

The most useful diagnostic step is always: which layer returned the 5xx? A 502 from the load balancer means something very different from a 502 from the WAF in front of it.

500 Internal Server Error

The server encountered an unexpected condition that prevented it from fulfilling the request.

A 500 means the application crashed or threw an uncaught exception while handling the request. The framework caught it and turned it into a generic 500. It is the most common 5xx in poorly observed systems because everything that is not specifically handled bubbles up as 500.

Common causes

Uncaught exception in a route handler (null deref, divide by zero, missing key).
Database query exception (connection lost, deadlock, broken constraint).
Misconfiguration loaded at request time (missing env var, bad credentials).
Out-of-memory kill (OOM) on the worker, restart in progress.
A deploy that included a startup bug only visible under traffic.

Diagnose

curl -sSI -L https://example.com/path-that-fails | head -10
curl -sSL https://example.com/path-that-fails | head -40

Then check:

Application logs around the request id / trace id.
Error tracker (Sentry, Rollbar, Honeybadger, etc).
Recent deploy timeline. See CI/CD Pipeline Monitoring.
Database error log for connection pool exhaustion or slow query timeouts. See Database Monitoring.

Fix

Roll back the recent deploy if the spike correlates.
Add the missing null check / error handler.
Increase connection pool size or DB instance size if it is saturated.
Add a real exception boundary so the error is logged with context, not just turned into 500.

A 500 is the application's problem. The proxy is just the messenger.

501 Not Implemented

The server does not support the functionality required to fulfill the request.

Rarer than 500. Two real-world causes:

The HTTP method is not supported (e.g. PATCH on a server that only knows GET/POST/PUT/DELETE).
The framework returned this for an unimplemented endpoint or feature flag off in production.

If users hit a 501 on a request your client sent, your client and your server disagree about the API contract. Audit and align.

502 Bad Gateway

The server, while acting as a gateway or proxy, received an invalid response from the upstream server.

This is the proxy's verdict on the upstream. NGINX, an AWS ALB, a Kubernetes ingress, Cloudflare, or any reverse proxy returns 502 when the thing it tried to talk to gave it something it could not parse.

Common causes

Upstream service is down or restarting (no listening process on the port).
Upstream process crashed mid-response.
Upstream timed out (sometimes also returns 504, depending on the proxy).
Upstream returned malformed HTTP (missing status line, bad chunked transfer).
TLS handshake to the upstream failed (cert mismatch, expired chain).
Connection pool exhausted between proxy and upstream.

Diagnose

# Confirm the proxy is reachable
curl -sSI https://example.com

# Compare with hitting the origin directly (if accessible)
curl -sSI https://origin.example.com

Then:

Check the upstream process is running and listening. See Port Monitoring.
Check upstream logs for crashes around the time of the 502.
Check container restarts and OOM kills. See Docker Container Monitoring and Kubernetes Monitoring.
Check NGINX error.log for upstream prematurely closed connection or recv() failed.

Fix

Restart or scale the upstream service.
Increase health-check sensitivity so the proxy stops sending traffic to a dying pod sooner.
Add a circuit breaker so the proxy gives a graceful 503 rather than a confusing 502.
Tighten the deploy strategy - 502s during deploy often mean traffic was sent to a not-yet-ready pod. See Health Check Endpoint Design.

503 Service Unavailable

The server is currently unable to handle the request due to a temporary overload or scheduled maintenance.

503 is the "we are intentionally not serving this right now" code. It says the proxy or the app deliberately refused.

Common causes

App is in maintenance mode.
Rate limiter or queue depth circuit breaker tripped.
Auto-scaler has no healthy pods.
Load balancer has zero healthy targets.
Application returned 503 because a critical dependency (DB, cache, queue) is down.

Diagnose

Check the Retry-After header - a well-behaved 503 includes it.
Check if a maintenance window is intentional. See Scheduled Maintenance Windows.
Check load balancer healthy-target count.
Check downstream dependencies (DB, cache, queue) for outages.

Fix

If maintenance is intentional but unannounced, document it on the status page.
If unintentional: roll back, scale up, restart the dependency that took the app to "unhealthy" state.
Add load shedding rather than crashing - returning 503 for excess load is healthier than 502 from crashed pods.

503 is often a better outcome than 500 or 502 - it means you saw the load and chose to shed rather than crash. But it has to come with an explanation on the status page or in your runbook.

504 Gateway Timeout

The proxy did not receive a timely response from the upstream server.

504 is the timeout-flavoured cousin of 502. The proxy reached the upstream, but the upstream took too long to respond.

Common causes

Slow database query blocking the worker.
Synchronous external API call exceeding the proxy timeout.
Long-running migration or background job in the request path.
Network partition between proxy and upstream.
An auth provider or third-party identity service is slow. See Auth Provider Monitoring.
An LLM or AI API call exceeded the timeout. See AI/LLM API Monitoring.

Diagnose

Look at p95/p99 latency for the affected route.
Trace a slow request end-to-end with distributed tracing - see OpenTelemetry Monitoring.
Check database slow-query log.
Check synchronous third-party calls in the request path.

Fix

Move slow operations behind a job queue. See Job Queue Monitoring.
Add timeouts at every external call, shorter than the proxy timeout, so you fail fast and gracefully.
Add database query indexes; tune connection pool.
Increase proxy timeout only as a last resort - long timeouts hide problems and tie up worker capacity.

505 HTTP Version Not Supported

The server does not support the HTTP protocol version used in the request.

Almost always a misconfiguration: a client trying HTTP/2 against a server that only speaks HTTP/1.1, or an old HTTP/0.9 request being rejected. Modern servers and clients negotiate version cleanly, so this is rare. Verify with:

curl --http1.1 -sSI https://example.com
curl --http2 -sSI https://example.com

If one works and the other does not, the issue is the unsupported version.

507 Insufficient Storage

The server is unable to store the representation needed to complete the request.

Originates from WebDAV but increasingly seen in modern APIs to indicate disk-full / object-store-full conditions. Common when:

A file-upload service ran out of disk on the worker.
An object storage bucket hit a quota or billing limit.
A database disk is full and writes are failing.

If you see this in production, treat it as a hard outage of the write path until storage is freed or expanded.

508 Loop Detected

The server detected an infinite loop while processing the request.

Almost always a misconfigured redirect chain or recursive include. Use:

curl -sSI -L --max-redirs 0 https://example.com

And see Redirect Chain Monitoring. The fix is to break the loop in the redirect rules.

511 Network Authentication Required

The client needs to authenticate to gain network access.

This is the captive-portal code (hotel Wi-Fi, conference network). If a user sees this for your site, they have a network problem, not a site problem.

Cloudflare-Specific 5xx (520-526)

Cloudflare returns its own 5xx codes when it cannot reach or interpret your origin. These are extremely valuable diagnostically.

520 Web Server Returned an Unknown Error

The origin returned an empty, unknown, or unexpected response. Common causes:

Origin process crashed mid-response.
Origin returned non-HTTP data on port 80/443.
Connection was reset mid-response.

Look at origin logs around the time of the 520. Often correlates with OOM kills.

521 Web Server Is Down

Cloudflare could not establish a TCP connection to the origin. Causes:

Origin process is not running.
Firewall is blocking Cloudflare's IP ranges (very common after a security rule change).
Origin is overloaded and refusing connections.

Verify Cloudflare's IP ranges are allow-listed in your origin firewall and security group.

522 Connection Timed Out

Cloudflare connected but the origin did not respond in time. Usually:

Origin is overloaded.
Network path is saturated.
Application is hanging in the request handler.

This is the Cloudflare equivalent of 504 at the edge.

523 Origin Is Unreachable

Routing problem. Cloudflare cannot find a network path to the origin IP. Often DNS or BGP-level:

Origin DNS record points at a non-routable IP.
Origin server was moved without updating DNS.

524 A Timeout Occurred

Cloudflare connected to origin, request was sent, but the origin took longer than 100 seconds to respond. Either:

The endpoint is genuinely slow (move to a queue).
A streaming or long-poll endpoint is hitting the Cloudflare timeout - use a different endpoint or move behind a WebSocket. See WebSocket Monitoring.

525 SSL Handshake Failed

TLS handshake between Cloudflare and the origin failed. Causes:

Origin certificate expired.
Origin certificate is self-signed and "Full (strict)" SSL mode is enabled in Cloudflare.
Cipher suite mismatch.

See TLS Configuration Monitoring.

526 Invalid SSL Certificate

Origin presented an invalid certificate (wrong host, expired chain, untrusted CA) and "Full (strict)" mode is enforcing validation.

For broader Cloudflare-related incidents - origin outages, edge errors, propagation lag - see Cloudflare Monitoring.

Differentiating 502 vs 503 vs 504

These three look similar at a glance and are constantly confused. The crisp distinctions:

502 = "I (the proxy) got something garbled or nothing from the upstream."
503 = "I am refusing to serve right now, on purpose or because nothing healthy exists."
504 = "I waited too long for the upstream to answer."

If you cannot tell which is firing in your stack, look at the proxy logs - NGINX, Envoy, ALB access logs all distinguish.

How To Tell Where The 5xx Comes From

Production stacks layer many proxies and you need to know which layer returned the error:

Client → CDN/WAF → Load Balancer → Reverse Proxy → App

Add a unique server identifier header at each layer:

Cloudflare adds cf-ray automatically.
ALB adds x-amzn-trace-id.
NGINX should set x-served-by: nginx-<pod>.
Your app should return a custom x-app-version header.

Then:

curl -sSI https://example.com | grep -iE 'cf-ray|x-amzn-trace|x-served-by|x-app-version'

If you see cf-ray but no app headers - Cloudflare returned the 5xx, never reached the app. If you see app headers - your application generated the 5xx.

This single trick saves hours of "is it Cloudflare or us?" arguments during incidents.

Per-Code Cheat Sheet

Code	First check	Most common cause	Owner
500	App logs / error tracker	Uncaught exception, recent deploy	App
501	Method support	Wrong HTTP verb / unimplemented	App
502	Upstream health	Crashed worker, OOM, malformed response	Platform
503	LB target count, maintenance flag	No healthy targets / intentional	Platform / app
504	Upstream latency	Slow query, slow third-party call	App / platform
507	Disk / object store	Storage full	Platform
508	Redirect chain	Redirect loop	App
520	Origin logs	Origin returned garbage / crashed	Origin
521	Origin process + firewall	Origin down / blocking CF	Origin
522	Origin latency	Origin overloaded	Origin
523	Origin DNS / route	DNS pointing at unreachable IP	Origin
524	Endpoint duration	Origin took > 100s	Origin
525	Origin TLS	Expired / wrong cert	Origin
526	Origin cert validity	Cert wrong host / chain broken	Origin

Monitoring 5xx Errors The Right Way

A few signals are worth tracking continuously, not just looking at after the fact:

5xx rate per endpoint - not site-wide. A 5xx storm on /checkout and a quiet homepage look identical in a site-wide chart.
5xx rate per status code - 500 spikes mean app crashes; 502/504 spikes mean infra issues. Treat them differently.
5xx rate per region - regional spikes mean ISP / CDN issues, not app bugs.
5xx by layer - using the headers above, attribute each 5xx to the layer that returned it.
p95 latency on the same route - 504s usually preceded by climbing latency.
External multi-region checks - so you know when 5xx hits real customers, not just internal monitoring.

For the alerting policy (thresholds, deduplication, paging rules), see 5xx Error Rate Monitoring and Alert Fatigue.

When 5xx Is Actually a 200 Lie

The opposite trap also happens: the app catches every exception, logs it, and returns 200 with an empty body or a friendly error page. The 5xx rate looks healthy because there are no 5xx codes - but real users are seeing a broken site.

Defend against this with content assertions on every monitored URL. See Response Body Validation Monitoring.

5xx Diagnosis Checklist

Captured the exact status code (500 ≠ 502 ≠ 504)
Identified the layer that returned it (Cloudflare, LB, proxy, app)
Correlated with recent deploys
Checked app logs / error tracker for stack traces
Checked upstream health (process running, ports listening)
Checked database / dependency status
Checked TLS certificate validity end-to-end
Checked CDN / WAF rules and IP allow-lists
Checked region distribution of failures
Documented the trigger and remediation in the runbook
Added or improved an alert so it does not surprise you next time

For the post-incident write-up, see Incident Post-Mortem Template.

How Webalert Helps

Webalert is built to catch 5xx errors from the outside, with enough detail to start the diagnosis:

External multi-region checks - 500/502/503/504 are recorded with the region that saw them, so you can tell global outages from regional ones.
Per-status code alerting - Separate notification rules for 5xx as a class, and for specific codes like 502 or 504 when you want them.
Content validation - Catch the "200 with a broken page" version of 5xx, where the app pretended everything was fine. See Response Body Validation Monitoring.
Latency alerts - Climbing latency typically precedes 504s; alert before the timeout fires.
TLS expiry warnings - Catch the cause behind 525 / 526 weeks before it happens.
Public status page - Customers see the incident as your monitor sees it, in real time.
Multi-channel alerts - Slack, Discord, Microsoft Teams, SMS.

Example Webalert check tuned for 5xx detection:

URL: https://example.com/checkout
Method: GET
Regions: US, EU, APAC
Frequency: every 60 seconds
Pass condition: HTTP 200 + body contains Continue to payment + response time under 2000ms
Alert: page on first 5xx in any region; SMS if 5xx for 3 consecutive checks
Tag: business-critical

Summary

5xx codes are not interchangeable. 500 is the app crashing; 502 is the proxy failing to talk to the app; 503 is the app or proxy intentionally refusing; 504 is the proxy waiting too long. Cloudflare 520-526 narrow the problem to the edge ↔ origin link.

Knowing which code, from which layer, on which endpoint, in which region, is the difference between "the website is broken" and "the auth dependency on /checkout is timing out in EU-West." The faster you can say the second sentence, the shorter the incident.

Catch 5xx errors before users do — and know which layer to blame

Start monitoring with Webalert ->

See features and pricing. No credit card required.

5xx Server Errors Explained: 500, 502, 503, 504 Fix Guide

The 5xx Family At A Glance

500 Internal Server Error

Common causes

Diagnose

Fix

501 Not Implemented

502 Bad Gateway

Common causes

Diagnose

Fix

503 Service Unavailable

Common causes

Diagnose

Fix

504 Gateway Timeout

Common causes

Diagnose

Fix

505 HTTP Version Not Supported

507 Insufficient Storage

508 Loop Detected

511 Network Authentication Required

Cloudflare-Specific 5xx (520-526)

520 Web Server Returned an Unknown Error

521 Web Server Is Down

522 Connection Timed Out

523 Origin Is Unreachable

524 A Timeout Occurred

525 SSL Handshake Failed

526 Invalid SSL Certificate

Differentiating 502 vs 503 vs 504

How To Tell Where The 5xx Comes From

Per-Code Cheat Sheet

Monitoring 5xx Errors The Right Way

When 5xx Is Actually a 200 Lie

5xx Diagnosis Checklist

How Webalert Helps

Summary

Catch 5xx errors before users do — and know which layer to blame

Related Articles

5xx Error Rate Monitoring: 500, 502, 503 Alert Guide

Cloudflare 520–526 Errors Explained: Causes and Fixes

3xx Redirect Codes Explained: 301, 302, 303, 307 & 308

Get alerted on 5xx errors the moment they spike