
The first thing every team writes when they ship to production is a health check endpoint. Usually one line:
@app.get("/health")
def health():
return {"status": "ok"}
It returns 200. Your uptime monitor is happy. Kubernetes is happy. Then a year later you have an outage and discover your health check has been returning 200 the whole time — even though the database has been disconnected for ten minutes, your queue is unreachable, and your downstream payment provider has been down for an hour.
Health checks are deceptively easy to write and incredibly easy to write badly. The bad version is worse than no health check at all because it gives you false confidence: "the monitor says we're up, so we must be up."
This guide covers what a real health check should do, the difference between liveness, readiness, and startup probes, when to do shallow checks versus deep checks, and the common mistakes that turn health endpoints into dashboard decoration rather than actual monitoring signal.
What a Health Check Endpoint Is For
Health checks have three distinct audiences, and they want different answers:
- Uptime monitors (Webalert, Pingdom, etc.) want to know: "Is your service reachable from the public internet, can it accept HTTP requests, and is it functionally able to serve traffic?"
- Container orchestrators (Kubernetes, ECS, Nomad) want to know: "Is this specific instance alive? Should I send it traffic? Should I restart it?"
- Other services in your system want to know: "Should I depend on you right now? Should I retry, fail over, or back off?"
These three needs aren't the same. A single /health endpoint that tries to answer all three usually answers none of them well. The better approach is a small set of purpose-built endpoints, each answering one question clearly.
Liveness, Readiness, and Startup Probes
If you only remember three terms from this guide, make it these:
Liveness (/livez)
Question: Is the process alive and not deadlocked?
Behavior: Should be cheap, fast, and almost always return 200. It should only fail when the process is in an unrecoverable state — deadlocked, completely unresponsive, internally corrupted. The expected response to a liveness failure is "kill and restart this process."
Should not check: Database connectivity, external dependencies, or anything that depends on something outside this process. If your database is down, your liveness check should still pass — restarting the pod won't fix the database.
Implementation: Often as simple as returning 200 with no body. Some teams add a single internal check (e.g., "can the request handler thread pool accept work?").
Readiness (/readyz)
Question: Should this instance receive traffic right now?
Behavior: Returns 200 when the instance is fully ready to serve requests, 503 when it's not. The orchestrator removes 503-returning instances from the load balancer until they recover.
Should check:
- Application is fully booted (database connection pool initialized, caches warmed if applicable)
- Critical dependencies needed for most requests are reachable
- Recent dependency failures aren't above a threshold
Should not check: Optional dependencies, heavy operations, or anything that takes more than ~1 second.
Startup (/startupz)
Question: Has this instance finished its initial boot sequence?
Behavior: Used by Kubernetes to delay liveness and readiness probes until the application has finished slow startup tasks (loading large models, warming caches, running migrations). Once startup succeeds, it's ignored — only liveness and readiness apply.
Implementation: Returns 503 during boot, 200 once initialization is complete. Most apps don't need this; it's specifically for slow-starting workloads.
What /health (the public one) Should Be
In addition to the orchestrator probes above, you typically want a public /health endpoint that uptime monitors can hit. This one's purpose is different: it answers the question "is this service usefully available right now?"
A good public /health endpoint:
- Returns 200 when the service can serve real traffic — not just "the process is alive"
- Returns 503 (Service Unavailable) when it cannot — never 500, which implies a bug
- Includes a small JSON body indicating which dependencies were checked and their state, useful for debugging
- Responds in under a second in the common case
- Doesn't require authentication (so external monitors can hit it)
- Doesn't leak sensitive details — never expose dependency hostnames, credentials, version SHAs in detail, or internal IPs
A reasonable shape:
{
"status": "ok",
"version": "1.42.7",
"checks": {
"database": "ok",
"cache": "ok",
"queue": "ok"
}
}
When something is failing:
{
"status": "degraded",
"version": "1.42.7",
"checks": {
"database": "ok",
"cache": "fail",
"queue": "ok"
}
}
Return HTTP 200 for "ok", 503 for "fail", and your choice for "degraded" (200 if the service can still mostly function, 503 if the failed dependency makes most requests fail).
Shallow vs Deep Health Checks
The most contentious health-check question: how thoroughly should you check dependencies?
Shallow checks
Just confirm the process is alive and accepting requests. Don't touch the database, queue, or any external dependency. Some teams call this "I'm here."
Pros: Fast, no false negatives from transient dependency blips, no risk of cascading failures.
Cons: A service whose database has been down for an hour still passes its shallow check. You won't know there's a problem from the health endpoint alone — you have to monitor each dependency separately.
Deep checks
Verify every critical dependency on each request: ping the database, ping the cache, ping any downstream service.
Pros: A single check tells you whether the service can actually do its job.
Cons:
- Slow — each dependency adds latency
- Cascading failures — if the cache flaps, every service that checks the cache flaps in lockstep
- Can take down the whole fleet — a slow dependency check can cause every readiness probe to time out, removing every instance from the load balancer
- Often shows incidents you don't actually need to alert on at the service level
The right balance
For most production services:
- Liveness: shallow only. Don't touch anything outside the process.
- Readiness: shallow or very lightweight checks. A query like
SELECT 1is fine; a complex aggregation isn't. - Public
/health: lightweight checks of critical dependencies (the ones whose failure means your service can't usefully serve any traffic). Cache the result for 5–10 seconds so a flood of monitoring traffic doesn't hammer your dependencies.
A common refinement: categorize dependencies as critical vs degraded-mode. The database is usually critical. The recommendation engine is usually degraded-mode. If the recommendation engine is down, your service is degraded but not down — return "degraded", not "fail".
What to Check (and How)
Database
Use a fast query that exercises the connection but doesn't touch tables:
- Postgres / MySQL:
SELECT 1 - MongoDB:
db.runCommand({ ping: 1 }) - Redis:
PING
Time-bound the query (e.g., 500ms timeout). A slow database ping is itself a signal — but a deep check shouldn't hang for 30 seconds.
For replicated databases, decide what "healthy" means:
- Reads from any replica? Just check one.
- Writes to primary? Check primary specifically.
- Replica lag matters? Check it but threshold loosely (e.g., fail at 10 minutes lag, not 1 second).
Cache
A GET of a known key, or PING for Redis. Cache failures often shouldn't fail your overall health — most apps degrade gracefully when the cache is down. Mark it as "degraded", not "fail".
Message queue
A connection check, not a publish. Publishing test messages on every health check pollutes your queue and can cascade.
Downstream services
Generally, don't ping them on each health check. Reasons:
- Their health is their concern, not yours
- Cascading failures: their health flap propagates to your fleet
- It hammers them with traffic that isn't real user traffic
Instead, monitor the downstream service from your application telemetry (track outbound call success rate, error rate, latency) and surface it via metrics, not via your health endpoint.
The exception: if your service is unable to do anything useful without that downstream — e.g., a payment proxy that has no purpose without the payment provider — then check it, but cache the result and fail closed gracefully.
Disk space, memory, threads
Optional. Useful in some environments. Generally:
- Disk space — useful if you have a known failure mode (logs filling up, write-heavy workloads)
- Memory — usually better tracked as a process metric than a health-check signal
- Thread pool exhaustion — the request to your health endpoint won't be served at all if threads are exhausted, so the health check itself signals this
HTTP Status Codes for Health Endpoints
Use the right code so your monitors and orchestrators can act correctly:
- 200 OK — Service is healthy and ready to serve traffic
- 503 Service Unavailable — Service is up but cannot serve traffic right now (warming up, dependency down, draining for shutdown). Include
Retry-Afterif you can estimate when you'll be ready. - 500 Internal Server Error — Reserve for bugs in the health check itself. Never for a known dependency failure — that's 503.
- 429 Too Many Requests — Rare on health checks, but valid if you've rate-limited the endpoint
- 204 No Content — Sometimes used for liveness checks where no body is needed; HTTP 200 with empty body is also fine
Most uptime monitors treat any non-2xx as failure, so 503 will (correctly) trigger alerts. Don't return 200 with "status": "fail" in the body and expect monitors to read the body — they often don't.
For more on what each status code means and how to monitor them, see HTTP Status Codes Explained: A Monitoring Guide.
Common Mistakes
1) Always returning 200
The classic. Your health endpoint returns {"status": "ok"} regardless of state. Your monitor passes forever. You discover problems only when users complain. Solution: actually check something.
2) Checking too much
The opposite mistake. Your health endpoint pings the database, cache, queue, three downstream services, validates a JWT, and runs a query. It takes 4 seconds. Under load, every health check times out, all instances are marked unhealthy, and Kubernetes removes them from the load balancer. Your service is now down because of the health check. Solution: cache results, time-bound dependency checks, and separate critical from optional.
3) Not authenticating internal vs external
Your /health endpoint exposes detailed dependency info, version SHAs, and internal hostnames. An attacker scrapes it and learns your stack. Solution: have a public /health with minimal info and a separate authenticated /internal/health with full diagnostics.
4) Ignoring degraded states
Health is binary in your code: ok or fail. But "degraded" is a real state — recommendation engine down, search slow, but core flows still work. If you only have ok/fail, you'll either over-alert (failing the whole service) or under-alert (claiming healthy while features are broken). Solution: support "degraded" and treat it differently from "fail".
5) Health check failing means traffic stops, but the failure is transient
Your readiness check fails for 200ms because of a network blip. Kubernetes removes the pod. Now the survivor pods are overloaded; their checks start failing too. The whole deployment cascades into outage. Solution: configure failure thresholds (e.g., 3 consecutive failures before marking unhealthy), tune readiness probes carefully in Kubernetes, and never have your readiness probe depend on a transient external service.
6) Putting the database in the liveness check
Database goes down. Every pod's liveness check fails. Kubernetes restarts every pod. Pods come back up, find database still down, restart-loop. Now you have a thrashing fleet making the recovery harder. Solution: liveness checks should never depend on anything outside the process. Database failures belong in readiness or /health, not liveness.
7) No version or build info
Health endpoint returns just "ok". During an incident, you can't tell which version is running on which instance, whether a deploy went out, or whether the fix landed. Solution: include version, build SHA (truncated), and start time in the response. Useful for debugging without leaking sensitive info.
8) Forgetting graceful shutdown
Your service starts shutting down. The shutdown takes 30 seconds (drain connections, flush buffers). During those 30 seconds, your health endpoint still returns 200. Traffic keeps arriving and getting dropped. Solution: have your shutdown handler immediately flip readiness to 503 so the load balancer drains traffic away before the actual shutdown starts.
Integrating with Kubernetes
If you run on Kubernetes, the probe configuration matters as much as the endpoint:
livenessProbe:
httpGet:
path: /livez
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 2
startupProbe:
httpGet:
path: /startupz
port: 8080
periodSeconds: 5
failureThreshold: 30
Notes:
initialDelaySecondsfor liveness should be longer than your worst-case startup. Otherwise Kubernetes restarts you mid-boot. Or use a startup probe.failureThresholdfor readiness should usually be 1–3. Too low and a single blip removes the pod; too high and a real failure goes unnoticed for a minute.timeoutSecondsmust be longer than your check's worst-case latency. A 1-second timeout on a 1.5-second check creates a slowly cascading outage.- Use a startup probe for any service that takes more than a few seconds to boot. This separates "starting" from "broken."
For more, see Kubernetes Monitoring: Health Checks and Pod Uptime.
Health Check Monitoring from Outside
Even with great in-cluster probes, you still want external monitoring of the public /health endpoint:
- External validates user-perspective reachability — DNS, certificates, ingress, load balancer, the full path your users take. Internal probes don't.
- Catches issues your cluster doesn't see — ingress controller bugs, certificate expiry, DNS misconfigurations
- Multi-region — confirms regional reachability matches your service's claims
- History and SLA reporting — your uptime monitor's record is your audit trail
Configure the external monitor to:
- Hit
/health(the public endpoint), not/livezor/readyz - Validate response body — confirm the JSON contains
"status": "ok", not just that it's a 200 - Check from multiple regions — see Multi-Region Monitoring: Why Location Matters
- Run every 1 minute for production services
- Alert on 3 consecutive failures to suppress single-blip noise
For more on monitoring API endpoints generally, see API Uptime Monitoring with Health Checks.
Health Checks for Microservices
In a microservices system, each service has its own health endpoints — and they shouldn't deeply check each other. The temptation is "let me check that the user service is up" in the order service's health check; the result is a graph of cascading dependencies where one failing service marks the whole system unhealthy.
Better approach:
- Each service health-checks itself, including its own database and cache
- Inter-service health is monitored externally, via metrics and tracing (latency, error rate of outbound calls)
- An aggregate "system health" view can compose individual service health into a system status, but each service's own
/healthreflects only its own state
See Microservices Monitoring: Health Checks and Service Mesh for the broader picture.
Health Endpoint Design Checklist
For every service you ship to production:
-
/livez— shallow, always returns 200 unless the process is unrecoverable -
/readyz— checks app is fully booted; returns 503 during shutdown drain -
/startupz— only if startup takes more than a few seconds -
/health— public, minimal JSON body, includes critical dependency status - Database check uses a fast query (
SELECT 1) with a short timeout - Cache failures return
"degraded", not"fail" - Downstream service status comes from metrics, not direct pings on each check
- HTTP 200 for healthy, 503 for unhealthy — never 500 for known failures
- Internal authenticated
/internal/healthfor full diagnostics; public/healthminimal - Shutdown handler flips readiness to 503 before the actual shutdown
- Liveness probe doesn't depend on anything outside the process
- External uptime monitor hits
/healthfrom multiple regions every minute - Body content validation in the external monitor — not just 200, but
"status": "ok"
How Webalert Helps Monitor Health Endpoints
Webalert is built to monitor health endpoints exactly the way they're meant to be monitored:
- Content validation — Confirm the response body contains
"status": "ok"or your custom signal, not just a 200 - Custom headers and authentication — Hit private health endpoints with
Authorization: Beareror custom keys - Multi-region checks — Verify the public health endpoint from every region your users come from
- 1-minute check intervals — Detect issues within a minute, not five
- Response time alerts — Catch a slow-but-passing health check before it degrades the user experience
- Status codes — Treat 503 differently from 500, with custom alert rules
- Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
- Status page — Surface external monitoring results to your customers automatically
- 5-minute setup — Add the URL, configure body validation, set thresholds, and you're live
See features and pricing for details.
Summary
- Health checks have three audiences (uptime monitors, orchestrators, other services), each wanting a different answer; design separate endpoints rather than one overloaded
/health. - Liveness checks should be shallow and only fail when the process is unrecoverable — never depend on the database or external services.
- Readiness checks should reflect whether the instance can serve traffic right now, including during shutdown drain.
- Public
/healthshould check critical dependencies lightly, return 503 (not 500) on failure, and include a minimal JSON body for debugging without leaking internals. - Avoid both extremes: the always-200 health check that hides real failures and the deep-check endpoint that creates cascading outages.
- Configure Kubernetes probes carefully — wrong timeouts and thresholds turn health checks into outage amplifiers.
- Monitor the public
/healthfrom outside your cluster with content validation and multi-region checks.
A health check is not a checkbox. It's the most-called endpoint in your service and the contract between your application and every system that depends on it. Designed well, it shortens incidents. Designed badly, it causes them.