Health Check Endpoints: /health, /livez, /readyz Guide

The first thing every team writes when they ship to production is a health check endpoint. Usually one line:

@app.get("/health")
def health():
    return {"status": "ok"}

It returns 200. Your uptime monitor is happy. Kubernetes is happy. Then a year later you have an outage and discover your health check has been returning 200 the whole time — even though the database has been disconnected for ten minutes, your queue is unreachable, and your downstream payment provider has been down for an hour.

Health checks are deceptively easy to write and incredibly easy to write badly. The bad version is worse than no health check at all because it gives you false confidence: "the monitor says we're up, so we must be up."

This guide covers what a real health check should do, the difference between liveness, readiness, and startup probes, when to do shallow checks versus deep checks, and the common mistakes that turn health endpoints into dashboard decoration rather than actual monitoring signal.

What a Health Check Endpoint Is For

Health checks have three distinct audiences, and they want different answers:

Uptime monitors (Webalert, Pingdom, etc.) want to know: "Is your service reachable from the public internet, can it accept HTTP requests, and is it functionally able to serve traffic?"
Container orchestrators (Kubernetes, ECS, Nomad) want to know: "Is this specific instance alive? Should I send it traffic? Should I restart it?"
Other services in your system want to know: "Should I depend on you right now? Should I retry, fail over, or back off?"

These three needs aren't the same. A single /health endpoint that tries to answer all three usually answers none of them well. The better approach is a small set of purpose-built endpoints, each answering one question clearly.

Liveness, Readiness, and Startup Probes

If you only remember three terms from this guide, make it these:

Liveness (`/livez`)

Question: Is the process alive and not deadlocked?

Behavior: Should be cheap, fast, and almost always return 200. It should only fail when the process is in an unrecoverable state — deadlocked, completely unresponsive, internally corrupted. The expected response to a liveness failure is "kill and restart this process."

Should not check: Database connectivity, external dependencies, or anything that depends on something outside this process. If your database is down, your liveness check should still pass — restarting the pod won't fix the database.

Implementation: Often as simple as returning 200 with no body. Some teams add a single internal check (e.g., "can the request handler thread pool accept work?").

Readiness (`/readyz`)

Question: Should this instance receive traffic right now?

Behavior: Returns 200 when the instance is fully ready to serve requests, 503 when it's not. The orchestrator removes 503-returning instances from the load balancer until they recover.

Should check:

Application is fully booted (database connection pool initialized, caches warmed if applicable)
Critical dependencies needed for most requests are reachable
Recent dependency failures aren't above a threshold

Should not check: Optional dependencies, heavy operations, or anything that takes more than ~1 second.

Startup (`/startupz`)

Question: Has this instance finished its initial boot sequence?

Behavior: Used by Kubernetes to delay liveness and readiness probes until the application has finished slow startup tasks (loading large models, warming caches, running migrations). Once startup succeeds, it's ignored — only liveness and readiness apply.

Implementation: Returns 503 during boot, 200 once initialization is complete. Most apps don't need this; it's specifically for slow-starting workloads.

What `/health` (the public one) Should Be

In addition to the orchestrator probes above, you typically want a public /health endpoint that uptime monitors can hit. This one's purpose is different: it answers the question "is this service usefully available right now?"

A good public /health endpoint:

Returns 200 when the service can serve real traffic — not just "the process is alive"
Returns 503 (Service Unavailable) when it cannot — never 500, which implies a bug
Includes a small JSON body indicating which dependencies were checked and their state, useful for debugging
Responds in under a second in the common case
Doesn't require authentication (so external monitors can hit it)
Doesn't leak sensitive details — never expose dependency hostnames, credentials, version SHAs in detail, or internal IPs

A reasonable shape:

{
  "status": "ok",
  "version": "1.42.7",
  "checks": {
    "database": "ok",
    "cache": "ok",
    "queue": "ok"
  }
}

When something is failing:

{
  "status": "degraded",
  "version": "1.42.7",
  "checks": {
    "database": "ok",
    "cache": "fail",
    "queue": "ok"
  }
}

Return HTTP 200 for "ok", 503 for "fail", and your choice for "degraded" (200 if the service can still mostly function, 503 if the failed dependency makes most requests fail).

Shallow vs Deep Health Checks

The most contentious health-check question: how thoroughly should you check dependencies?

Shallow checks

Just confirm the process is alive and accepting requests. Don't touch the database, queue, or any external dependency. Some teams call this "I'm here."

Pros: Fast, no false negatives from transient dependency blips, no risk of cascading failures.

Cons: A service whose database has been down for an hour still passes its shallow check. You won't know there's a problem from the health endpoint alone — you have to monitor each dependency separately.

Deep checks

Verify every critical dependency on each request: ping the database, ping the cache, ping any downstream service.

Pros: A single check tells you whether the service can actually do its job.

Cons:

Slow — each dependency adds latency
Cascading failures — if the cache flaps, every service that checks the cache flaps in lockstep
Can take down the whole fleet — a slow dependency check can cause every readiness probe to time out, removing every instance from the load balancer
Often shows incidents you don't actually need to alert on at the service level

The right balance

For most production services:

Liveness: shallow only. Don't touch anything outside the process.
Readiness: shallow or very lightweight checks. A query like SELECT 1 is fine; a complex aggregation isn't.
Public /health: lightweight checks of critical dependencies (the ones whose failure means your service can't usefully serve any traffic). Cache the result for 5–10 seconds so a flood of monitoring traffic doesn't hammer your dependencies.

A common refinement: categorize dependencies as critical vs degraded-mode. The database is usually critical. The recommendation engine is usually degraded-mode. If the recommendation engine is down, your service is degraded but not down — return "degraded", not "fail".

What to Check (and How)

Database

Use a fast query that exercises the connection but doesn't touch tables:

Postgres / MySQL: SELECT 1
MongoDB: db.runCommand({ ping: 1 })
Redis: PING

Time-bound the query (e.g., 500ms timeout). A slow database ping is itself a signal — but a deep check shouldn't hang for 30 seconds.

For replicated databases, decide what "healthy" means:

Reads from any replica? Just check one.
Writes to primary? Check primary specifically.
Replica lag matters? Check it but threshold loosely (e.g., fail at 10 minutes lag, not 1 second).

Cache

A GET of a known key, or PING for Redis. Cache failures often shouldn't fail your overall health — most apps degrade gracefully when the cache is down. Mark it as "degraded", not "fail".

Message queue

A connection check, not a publish. Publishing test messages on every health check pollutes your queue and can cascade.

Downstream services

Generally, don't ping them on each health check. Reasons:

Their health is their concern, not yours
Cascading failures: their health flap propagates to your fleet
It hammers them with traffic that isn't real user traffic

Instead, monitor the downstream service from your application telemetry (track outbound call success rate, error rate, latency) and surface it via metrics, not via your health endpoint.

The exception: if your service is unable to do anything useful without that downstream — e.g., a payment proxy that has no purpose without the payment provider — then check it, but cache the result and fail closed gracefully.

Disk space, memory, threads

Optional. Useful in some environments. Generally:

Disk space — useful if you have a known failure mode (logs filling up, write-heavy workloads)
Memory — usually better tracked as a process metric than a health-check signal
Thread pool exhaustion — the request to your health endpoint won't be served at all if threads are exhausted, so the health check itself signals this

HTTP Status Codes for Health Endpoints

Use the right code so your monitors and orchestrators can act correctly:

200 OK — Service is healthy and ready to serve traffic
503 Service Unavailable — Service is up but cannot serve traffic right now (warming up, dependency down, draining for shutdown). Include Retry-After if you can estimate when you'll be ready.
500 Internal Server Error — Reserve for bugs in the health check itself. Never for a known dependency failure — that's 503.
429 Too Many Requests — Rare on health checks, but valid if you've rate-limited the endpoint
204 No Content — Sometimes used for liveness checks where no body is needed; HTTP 200 with empty body is also fine

Most uptime monitors treat any non-2xx as failure, so 503 will (correctly) trigger alerts. Don't return 200 with "status": "fail" in the body and expect monitors to read the body — they often don't.

For more on what each status code means and how to monitor them, see HTTP Status Codes Explained: A Monitoring Guide.

Common Mistakes

1) Always returning 200

The classic. Your health endpoint returns {"status": "ok"} regardless of state. Your monitor passes forever. You discover problems only when users complain. Solution: actually check something.

2) Checking too much

The opposite mistake. Your health endpoint pings the database, cache, queue, three downstream services, validates a JWT, and runs a query. It takes 4 seconds. Under load, every health check times out, all instances are marked unhealthy, and Kubernetes removes them from the load balancer. Your service is now down because of the health check. Solution: cache results, time-bound dependency checks, and separate critical from optional.

3) Not authenticating internal vs external

Your /health endpoint exposes detailed dependency info, version SHAs, and internal hostnames. An attacker scrapes it and learns your stack. Solution: have a public /health with minimal info and a separate authenticated /internal/health with full diagnostics.

4) Ignoring degraded states

Health is binary in your code: ok or fail. But "degraded" is a real state — recommendation engine down, search slow, but core flows still work. If you only have ok/fail, you'll either over-alert (failing the whole service) or under-alert (claiming healthy while features are broken). Solution: support "degraded" and treat it differently from "fail".

5) Health check failing means traffic stops, but the failure is transient

Your readiness check fails for 200ms because of a network blip. Kubernetes removes the pod. Now the survivor pods are overloaded; their checks start failing too. The whole deployment cascades into outage. Solution: configure failure thresholds (e.g., 3 consecutive failures before marking unhealthy), tune readiness probes carefully in Kubernetes, and never have your readiness probe depend on a transient external service.

6) Putting the database in the liveness check

Database goes down. Every pod's liveness check fails. Kubernetes restarts every pod. Pods come back up, find database still down, restart-loop. Now you have a thrashing fleet making the recovery harder. Solution: liveness checks should never depend on anything outside the process. Database failures belong in readiness or /health, not liveness.

7) No version or build info

Health endpoint returns just "ok". During an incident, you can't tell which version is running on which instance, whether a deploy went out, or whether the fix landed. Solution: include version, build SHA (truncated), and start time in the response. Useful for debugging without leaking sensitive info.

8) Forgetting graceful shutdown

Your service starts shutting down. The shutdown takes 30 seconds (drain connections, flush buffers). During those 30 seconds, your health endpoint still returns 200. Traffic keeps arriving and getting dropped. Solution: have your shutdown handler immediately flip readiness to 503 so the load balancer drains traffic away before the actual shutdown starts.

Integrating with Kubernetes

If you run on Kubernetes, the probe configuration matters as much as the endpoint:

livenessProbe:
  httpGet:
    path: /livez
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 1
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 2

startupProbe:
  httpGet:
    path: /startupz
    port: 8080
  periodSeconds: 5
  failureThreshold: 30

Notes:

initialDelaySeconds for liveness should be longer than your worst-case startup. Otherwise Kubernetes restarts you mid-boot. Or use a startup probe.
failureThreshold for readiness should usually be 1–3. Too low and a single blip removes the pod; too high and a real failure goes unnoticed for a minute.
timeoutSeconds must be longer than your check's worst-case latency. A 1-second timeout on a 1.5-second check creates a slowly cascading outage.
Use a startup probe for any service that takes more than a few seconds to boot. This separates "starting" from "broken."

For more, see Kubernetes Monitoring: Health Checks and Pod Uptime.

Health Check Monitoring from Outside

Even with great in-cluster probes, you still want external monitoring of the public /health endpoint:

External validates user-perspective reachability — DNS, certificates, ingress, load balancer, the full path your users take. Internal probes don't.
Catches issues your cluster doesn't see — ingress controller bugs, certificate expiry, DNS misconfigurations
Multi-region — confirms regional reachability matches your service's claims
History and SLA reporting — your uptime monitor's record is your audit trail

Configure the external monitor to:

Hit /health (the public endpoint), not /livez or /readyz
Validate response body — confirm the JSON contains "status": "ok", not just that it's a 200
Check from multiple regions — see Multi-Region Monitoring: Why Location Matters
Run every 1 minute for production services
Alert on 3 consecutive failures to suppress single-blip noise

For more on monitoring API endpoints generally, see API Uptime Monitoring with Health Checks.

Health Checks for Microservices

In a microservices system, each service has its own health endpoints — and they shouldn't deeply check each other. The temptation is "let me check that the user service is up" in the order service's health check; the result is a graph of cascading dependencies where one failing service marks the whole system unhealthy.

Better approach:

Each service health-checks itself, including its own database and cache
Inter-service health is monitored externally, via metrics and tracing (latency, error rate of outbound calls)
An aggregate "system health" view can compose individual service health into a system status, but each service's own /health reflects only its own state

See Microservices Monitoring: Health Checks and Service Mesh for the broader picture.

Health Endpoint Design Checklist

For every service you ship to production:

/livez — shallow, always returns 200 unless the process is unrecoverable
/readyz — checks app is fully booted; returns 503 during shutdown drain
/startupz — only if startup takes more than a few seconds
/health — public, minimal JSON body, includes critical dependency status
Database check uses a fast query (SELECT 1) with a short timeout
Cache failures return "degraded", not "fail"
Downstream service status comes from metrics, not direct pings on each check
HTTP 200 for healthy, 503 for unhealthy — never 500 for known failures
Internal authenticated /internal/health for full diagnostics; public /health minimal
Shutdown handler flips readiness to 503 before the actual shutdown
Liveness probe doesn't depend on anything outside the process
External uptime monitor hits /health from multiple regions every minute
Body content validation in the external monitor — not just 200, but "status": "ok"

How Webalert Helps Monitor Health Endpoints

Webalert is built to monitor health endpoints exactly the way they're meant to be monitored:

Content validation — Confirm the response body contains "status": "ok" or your custom signal, not just a 200
Custom headers and authentication — Hit private health endpoints with Authorization: Bearer or custom keys
Multi-region checks — Verify the public health endpoint from every region your users come from
1-minute check intervals — Detect issues within a minute, not five
Response time alerts — Catch a slow-but-passing health check before it degrades the user experience
Status codes — Treat 503 differently from 500, with custom alert rules
Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
Status page — Surface external monitoring results to your customers automatically
5-minute setup — Add the URL, configure body validation, set thresholds, and you're live

See features and pricing for details.

Summary

Health checks have three audiences (uptime monitors, orchestrators, other services), each wanting a different answer; design separate endpoints rather than one overloaded /health.
Liveness checks should be shallow and only fail when the process is unrecoverable — never depend on the database or external services.
Readiness checks should reflect whether the instance can serve traffic right now, including during shutdown drain.
Public /health should check critical dependencies lightly, return 503 (not 500) on failure, and include a minimal JSON body for debugging without leaking internals.
Avoid both extremes: the always-200 health check that hides real failures and the deep-check endpoint that creates cascading outages.
Configure Kubernetes probes carefully — wrong timeouts and thresholds turn health checks into outage amplifiers.
Monitor the public /health from outside your cluster with content validation and multi-region checks.

A health check is not a checkbox. It's the most-called endpoint in your service and the contract between your application and every system that depends on it. Designed well, it shortens incidents. Designed badly, it causes them.

Monitor your health endpoints from the outside

Start monitoring with Webalert →

See features and pricing. No credit card required.

Health Check Endpoints: /health, /livez, /readyz Guide

What a Health Check Endpoint Is For

Liveness, Readiness, and Startup Probes

Liveness (`/livez`)

Readiness (`/readyz`)

Startup (`/startupz`)

What `/health` (the public one) Should Be

Shallow vs Deep Health Checks

Shallow checks

Deep checks

The right balance

What to Check (and How)

Database

Cache

Message queue

Downstream services

Disk space, memory, threads

HTTP Status Codes for Health Endpoints

Common Mistakes

1) Always returning 200

2) Checking too much

3) Not authenticating internal vs external

4) Ignoring degraded states

5) Health check failing means traffic stops, but the failure is transient

6) Putting the database in the liveness check

7) No version or build info

8) Forgetting graceful shutdown

Integrating with Kubernetes

Health Check Monitoring from Outside

Health Checks for Microservices

Health Endpoint Design Checklist

How Webalert Helps Monitor Health Endpoints

Summary

Monitor your health endpoints from the outside

Related Articles

Docker Container Monitoring: Health Checks, Logs, and Uptime

Kubernetes Monitoring: Health Checks, Pod Uptime, and Alerting

AI API Monitoring: OpenAI, Anthropic, and Gemini Uptime

Ready to Monitor Your Website?

Health Check Endpoints: /health, /livez, /readyz Guide

What a Health Check Endpoint Is For

Liveness, Readiness, and Startup Probes

Liveness (/livez)

Readiness (/readyz)

Startup (/startupz)

What /health (the public one) Should Be

Shallow vs Deep Health Checks

Shallow checks

Deep checks

The right balance

What to Check (and How)

Database

Cache

Message queue

Downstream services

Disk space, memory, threads

HTTP Status Codes for Health Endpoints

Common Mistakes

1) Always returning 200

2) Checking too much

3) Not authenticating internal vs external

4) Ignoring degraded states

5) Health check failing means traffic stops, but the failure is transient

6) Putting the database in the liveness check

7) No version or build info

8) Forgetting graceful shutdown

Integrating with Kubernetes

Health Check Monitoring from Outside

Health Checks for Microservices

Health Endpoint Design Checklist

How Webalert Helps Monitor Health Endpoints

Summary

Monitor your health endpoints from the outside

Related Articles

Docker Container Monitoring: Health Checks, Logs, and Uptime

Kubernetes Monitoring: Health Checks, Pod Uptime, and Alerting

AI API Monitoring: OpenAI, Anthropic, and Gemini Uptime

Ready to Monitor Your Website?

Liveness (`/livez`)

Readiness (`/readyz`)

Startup (`/startupz`)

What `/health` (the public one) Should Be