Kubernetes Monitoring: Health Checks, Pod Uptime, and Alerting

Kubernetes is excellent at keeping your workloads running. When a pod crashes, it restarts it. When a node fails, it reschedules to another. When demand spikes, it scales out.

But Kubernetes doesn't tell you whether your users can actually reach your application.

A pod can be marked "Running" while returning 500s. A service can exist while routing to zero healthy pods. An ingress can be configured while SSL termination is broken. Kubernetes is watching the containers — but nobody is watching what users see.

This guide covers the full Kubernetes monitoring picture: what the cluster watches internally, what you need to watch externally, and how to build alerting that catches real failures before your users do.

What Kubernetes Monitors Internally

Before adding external monitoring, understand what Kubernetes already does for you.

Liveness probes

A liveness probe tells Kubernetes whether a container is alive. If the probe fails, Kubernetes kills the container and restarts it.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

If your app deadlocks or gets into an unrecoverable state, the liveness probe catches it and triggers a restart.

Readiness probes

A readiness probe tells Kubernetes whether a container is ready to accept traffic. If the probe fails, Kubernetes removes the pod from the service's endpoints — traffic stops routing to it.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

Readiness probes prevent traffic from reaching pods that are still warming up, running database migrations, or temporarily overloaded.

Startup probes

A startup probe gives slow-starting containers time to become ready before liveness probes kick in. Useful for apps with long initialization times.

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

What internal probes can't tell you

Kubernetes probes check individual containers from inside the cluster. They don't check:

Whether external users can reach your service through the ingress
Whether SSL/TLS termination is working at the load balancer
Whether DNS resolves your domain to the correct IP
Whether the service is accessible from all geographic regions
Whether response content is correct (not just status code 200)
Whether response times are within acceptable limits for users

For all of that, you need external monitoring.

The Kubernetes Monitoring Stack

A complete monitoring strategy for Kubernetes has three layers:

Layer 3: External monitoring (user experience)
   ↕  HTTP checks, SSL, DNS, response time from outside the cluster

Layer 2: Internal observability (cluster internals)
   ↕  Prometheus, Grafana, resource metrics, pod status

Layer 1: Kubernetes internal probes (container health)
   ↕  Liveness, readiness, startup probes

Most teams set up Layer 1 (probes) automatically. Many add Layer 2 (Prometheus + Grafana) eventually. But Layer 3 — the external view that mirrors what users experience — is often the last to be added, even though it's the most important for catching real outages.

This guide focuses on Layer 3, which complements your existing K8s health check and observability tooling.

What to Monitor Externally

1. Your public-facing ingress endpoints

The ingress controller is where external traffic enters your cluster. A misconfigured ingress rule, a broken TLS secret, or a failed ingress controller deployment can make your application completely unreachable — while all pods inside the cluster appear healthy.

What to monitor:

Primary domain (https://yourapp.com)
API subdomain (https://api.yourapp.com/health)
Any public-facing service endpoints
Canary or staging endpoints (if publicly accessible)

Check every 1 minute. Kubernetes deployments and rollouts happen frequently — you want immediate detection if a rollout breaks the ingress.

2. Service health endpoints

Well-designed Kubernetes applications expose dedicated health endpoints:

/healthz or /health — Basic liveness check (is the process running?)
/readyz or /ready — Readiness check (is the process ready for traffic?)
/metrics — Prometheus-format metrics (scraped by Prometheus, not for external checks)

Monitor the /healthz and /readyz endpoints from outside the cluster. A 200 response from /readyz through the ingress means traffic is routing correctly through the ingress → service → pod chain.

Content validation tip: Don't just check for status 200. Add a content check that verifies the response body contains "status":"ok" or equivalent. This catches cases where your app returns 200 with an error body (a surprisingly common failure mode).

3. SSL/TLS certificates

Kubernetes manages TLS certificates in several ways:

cert-manager — Automatically provisions and renews Let's Encrypt certificates, storing them as Kubernetes secrets
Cloud-managed certs — GKE Managed Certificates, AWS ACM via ALB Ingress Controller
Manual secrets — Self-managed certificates stored as TLS secrets

All of these can fail silently. cert-manager renewal can fail if DNS validation records were removed. Cloud-managed cert renewal can fail after cluster upgrades. Manually managed certs simply expire.

Monitor certificate expiry with alerts at 30, 14, and 7 days. Also monitor certificate validity (correct domain, valid chain) — catch mismatches immediately.

4. DNS resolution

Your cluster might be perfectly healthy while DNS resolution is broken. This happens when:

External DNS controller (which syncs K8s services to your DNS provider) has an error
A Helm release or Terraform apply updated DNS records incorrectly
Your DNS provider has an outage

Monitor DNS resolution for all critical domains. Verify they resolve to the expected load balancer IP or hostname.

5. Response time

Kubernetes has many failure modes that don't cause outages but severely degrade performance:

Resource starvation — Pods running near CPU/memory limits slow down
Pod eviction — During node pressure, low-priority pods are evicted and restarting
Horizontal Pod Autoscaler lag — Traffic spike before HPA has scaled out
Cold starts — New pods handling traffic before warm-up is complete
Noisy neighbor — Another workload on the same node consuming resources

Monitor response time trends. Alert when sustained response time exceeds 2-3x your baseline.

6. Background job heartbeats

Kubernetes CronJobs run scheduled tasks inside the cluster. If a CronJob fails — because the image can't be pulled, the pod can't be scheduled, or the job itself errors — it fails silently from the outside.

Use heartbeat monitoring: configure your CronJob to ping a URL when it completes successfully. If the heartbeat is missed within the expected window, you get an alert.

# Example: CronJob that pings Webalert heartbeat on completion
apiVersion: batch/v1
kind: CronJob
metadata:
  name: data-processor
spec:
  schedule: "0 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: processor
            image: your-processor:latest
            command:
            - /bin/sh
            - -c
            - |
              python process.py && \
              curl -s https://app.web-alert.io/api/heartbeat/YOUR_HEARTBEAT_ID

Common Kubernetes Failure Modes and How to Catch Them

CrashLoopBackOff

What happens: A container crashes repeatedly. Kubernetes keeps restarting it with increasing delays (1s, 2s, 4s, 8s...). Eventually the pod is stuck in CrashLoopBackOff.

External symptom: The service returns 503 or times out because all pods are unavailable.

Detection: External HTTP monitor returns 503 or times out. Alert fires within 1-2 minutes of the crash loop starting.

ImagePullBackOff

What happens: A pod can't pull its container image. This happens when a new deployment references a non-existent image tag, a registry is unreachable, or pull secrets are missing.

External symptom: The new pods never become ready. The deployment stalls. If the rollout replaced all old pods first, the service is unavailable.

Detection: External HTTP monitor detects the service going down during a deployment. Response time spike before the outage can provide early warning.

OOMKilled

What happens: A container exceeds its memory limit and is killed by the OS. Kubernetes restarts it. If the memory issue persists, the cycle continues.

External symptom: Intermittent 503s or connection resets as pods restart.

Detection: External HTTP monitor detects intermittent failures. Alert fires if consecutive checks fail.

Node Not Ready

What happens: A worker node becomes unavailable. Pods on that node are rescheduled to other nodes after a timeout (default 5 minutes).

External symptom: During rescheduling, if there aren't enough resources on other nodes or if the rescheduled pods take time to start, the service is degraded or unavailable.

Detection: External HTTP monitor detects increased error rates or timeouts during the node failure window.

Ingress Controller Failure

What happens: The ingress controller pod crashes or is misconfigured. All traffic to all services behind the ingress stops routing.

External symptom: All services return 502/504 or are completely unreachable.

Detection: External HTTP monitor detects all monitored endpoints failing simultaneously. The "all at once" pattern suggests infrastructure-level failure rather than application-level.

Failed Rollout

What happens: A new deployment has a bug. The new pods start but return errors. Kubernetes may or may not roll back depending on your deployment configuration.

External symptom: Error rate increases as traffic routes to unhealthy new pods while old pods are terminated.

Detection: External HTTP monitor detects rising error rates immediately after a deployment. Alerts fire within the first minutes, giving the team time to roll back before all old pods are replaced.

Setting Up Health Check Endpoints

If your application doesn't have dedicated health endpoints yet, add them. They're the foundation of both Kubernetes probes and external monitoring.

Minimal health endpoint

# FastAPI example
@app.get("/healthz")
async def health():
    return {"status": "ok"}

// Express example
app.get('/healthz', (req, res) => {
  res.json({ status: 'ok' });
});

// Go example
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"status":"ok"}`))
})

Rich readiness endpoint

A readiness endpoint should verify that all dependencies (database, cache, external services) are reachable:

@app.get("/readyz")
async def readiness():
    checks = {}

    # Check database
    try:
        await db.execute("SELECT 1")
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {str(e)}"

    # Check cache
    try:
        await redis.ping()
        checks["cache"] = "ok"
    except Exception as e:
        checks["cache"] = f"error: {str(e)}"

    all_ok = all(v == "ok" for v in checks.values())
    status_code = 200 if all_ok else 503

    return JSONResponse(
        content={"status": "ok" if all_ok else "degraded", "checks": checks},
        status_code=status_code
    )

This gives both Kubernetes readiness probes and external monitoring a meaningful signal — not just "the process is alive" but "all dependencies are healthy."

Alerting for Kubernetes

What to alert on

Condition	Severity	Channel
Service endpoint returns 5xx	Critical	SMS + Slack/Discord
Service endpoint unreachable (timeout)	Critical	SMS + Slack/Discord
SSL certificate expires in 7 days	High	Slack/Discord + email
Response time >3x baseline sustained	High	Slack/Discord
CronJob heartbeat missed	High	Slack/Discord
SSL certificate expires in 30 days	Medium	Email
DNS resolution changed unexpectedly	Medium	Slack/Discord

Alert on deployments

The riskiest moment for any Kubernetes service is during a deployment. Consider temporarily increasing monitoring sensitivity around deployments:

Check every 30 seconds during active rollouts (if your monitoring tool supports it)
Alert after 1 consecutive failure (instead of the usual 2-3) during the deployment window
Enable all notification channels during planned releases

Recovery notifications

Always configure recovery notifications. Kubernetes often resolves issues automatically (pod restart, rescheduling, HPA scale-out). A recovery notification confirms the incident is over without requiring the team to manually verify.

Multi-Region Monitoring for Kubernetes

If your cluster serves users globally, monitor from multiple regions.

A Kubernetes node failure in one availability zone might only affect users routed to that zone. Multi-region monitoring catches failures that are geographically isolated:

Regional load balancer routing issue
Availability zone with degraded node pool
CDN routing users in a specific geography to a broken origin

Check from at minimum 3 geographic regions: US, Europe, and Asia-Pacific. This also helps you catch DNS propagation issues after cluster migrations or IP changes.

How Webalert Monitors Kubernetes

Webalert provides the external monitoring layer that complements your cluster's internal health checks:

HTTP/HTTPS monitoring — Check ingress endpoints every 1 minute with content validation
SSL monitoring — Certificate expiry and validity for cert-manager and cloud-managed certs
DNS monitoring — Verify external DNS controller is keeping records in sync
TCP port monitoring — Check any publicly exposed service port
Response time tracking — Detect degradation from resource starvation, HPA lag, or noisy neighbors
Heartbeat monitoring — Verify Kubernetes CronJobs complete successfully on schedule
Multi-region checks — Catch availability zone and regional failures
Deployment-aware alerting — Tight thresholds during rollouts, on-call escalation when things go wrong
Status pages — Keep users informed during cluster incidents

Kubernetes watches your containers. Webalert watches what your users see.

See features and pricing for the full details.

Summary

Kubernetes probes (liveness, readiness, startup) monitor containers from inside the cluster. They're essential but only see part of the picture.
External monitoring catches what probes miss: ingress failures, SSL issues, DNS problems, geographic outages, and content errors.
Monitor the full chain: domain → DNS → SSL → ingress → service → pod → response body.
Key external checks: HTTP on ingress endpoints, SSL certificate validity and expiry, DNS resolution, response time, CronJob heartbeats.
Common K8s failures detected by external monitoring: CrashLoopBackOff (service goes down), ImagePullBackOff (deployment breaks service), OOMKilled (intermittent errors), node failures (temporary degradation), failed rollouts (error spike during deployment).
Alert with context: critical for outages, high for degradation, medium for upcoming expiries. Always send recovery notifications.

Your cluster is self-healing. Your monitoring should be self-aware.

Monitor what your users see, not just what kubectl shows

Start monitoring free with Webalert →

See features and pricing. No credit card required.