Skip to content

Cron Dead-Man Switch Monitoring: Catch Missed Jobs Fast

Webalert Team
May 26, 2026
11 min read

Cron Dead-Man Switch Monitoring: Catch Missed Jobs Fast

The worst cron failures are the ones that do not fail at all. The job simply stops running. No exception. No 500. No alert. Backups stop happening. Invoices stop sending. Stale data piles up. Two weeks later, an auditor or a customer notices.

Most monitoring is wired for failures. A cron job that does not run produces no failure to monitor. You need the inverse: a dead-man switch that alerts when an expected heartbeat is missing.

This guide covers how to design dead-man switches for cron jobs, scheduled tasks, and background pipelines: grace periods, missed vs failed jobs, last-success timestamps, idempotency, and how to alert without drowning in noise.

For broader background-task monitoring, see Cron Job Monitoring.


Why Cron Jobs Disappear

Cron jobs and scheduled tasks vanish for boring reasons:

  • A server was decommissioned and the crontab moved with it - sort of.
  • A Kubernetes CronJob has suspend: true after a Helm rollout.
  • A scheduler container is in CrashLoopBackOff for an hour before the alert fires, or no alert fires because liveness probes pass.
  • The job started, hit a panic early, exited 0 by accident.
  • The cron schedule was changed from */5 * * * * to */5 * * * 1-5 and nobody noticed the weekend gap.
  • A new region was launched without the cron config.
  • DST or timezone changes pushed the job into a maintenance window.
  • The deploy pipeline switched to a new image that no longer includes the cron binary.

None of these throw an exception. The job just stops.


What a Dead-Man Switch Actually Is

A dead-man switch is a monitor that alerts on silence.

The pattern:

  1. The job pings a known URL or writes a heartbeat at the end of a successful run.
  2. The monitor expects the ping at a known schedule.
  3. If the ping does not arrive within expected_interval + grace_period, the monitor alerts.

It inverts the usual "alert on error" pattern into "alert on absence of success."

A trivial implementation:

# At the END of a successful cron run
curl -fsS -m 10 https://ping.example.com/cron/nightly-billing/ok || true

The monitor knows the schedule and the grace period. If no ping arrives, it pages.

If the job is partially successful, do not ping. The dead-man switch should only fire on complete success.


Missed vs Failed: Two Different Alerts

These are not the same problem and should not share the same alert:

Type Meaning Detection Owner
Failed Job ran, errored Exception, non-zero exit, error log App owner
Missed Job did not run No heartbeat in expected window Platform / scheduler owner
Late Ran but past SLA Ping arrived after grace period Platform owner
Skipped Ran but exited intentionally without doing work Heartbeat with status=skipped App owner

Combine them and you cannot triage. A "failed" alert tells you to check application logs. A "missed" alert tells you to check the scheduler.

Track all four. Alert on missed and failed with different runbooks.


Designing the Heartbeat

A heartbeat should answer three questions: what ran, when, and what it did.

Bad heartbeat:

GET /cron-ok

You cannot tell which job ran, when, or whether it succeeded.

Better heartbeat:

POST /heartbeat
{
  "job": "nightly-billing",
  "run_id": "2026-05-26T03:00:00Z-7f3a",
  "started_at": "2026-05-26T03:00:00Z",
  "finished_at": "2026-05-26T03:04:11Z",
  "duration_ms": 251300,
  "rows_processed": 4821,
  "status": "ok"
}

Now you can:

  • Detect missed jobs (no row arrived in the window).
  • Detect slow jobs (duration_ms trend).
  • Detect skipped jobs (status: skipped).
  • Detect "ran but did nothing" (rows_processed == 0 for a job that should always do work).
  • Reconcile with application logs via run_id.

Store the last 30 days of heartbeats per job. Many silent failures show up as "this job has not processed > 0 rows in 5 days."


Grace Periods, Done Right

A 5-minute cron job that pages at exactly 5 minutes is a noise machine. Pick grace periods carefully.

Rule of thumb:

Schedule Suggested grace period
Every minute 2 minutes
Every 5 minutes 10 minutes
Hourly 30 minutes
Every 4 hours 1 hour
Daily 2 hours
Weekly 6 hours
Monthly 24 hours

Grace period covers:

  • Job retry on first failure.
  • Cluster scheduling delays.
  • DST transitions on daily jobs.
  • Deploy windows when schedulers restart.

For business-critical jobs (billing, payouts, backups), keep grace tight and ensure the runbook is clear. For non-critical jobs (cleanup, cache warming), keep grace generous to avoid noise.


The Last-Success Timestamp Pattern

For long-running pipelines and complex schedules, you do not need a per-run heartbeat. You need a last-success timestamp the monitor can read.

Pattern:

  1. After each successful run, write cron_last_success_at to a known location:
    • A database table.
    • A health endpoint that exposes the value.
    • An S3 object.
    • A status file in object storage.
  2. A monitor polls the value.
  3. If now() - cron_last_success_at > expected_interval + grace, alert.

The advantage: it survives missed runs. Even if the job is down for hours, the timestamp does not change, and any monitor can read it.

A simple endpoint:

GET /internal/health/cron/nightly-billing

200 OK
{
  "job": "nightly-billing",
  "last_success_at": "2026-05-26T03:04:11Z",
  "expected_interval_minutes": 1440,
  "grace_minutes": 120,
  "stale": false
}

The monitor checks the endpoint and asserts:

  • Status 200.
  • stale: false.
  • last_success_at within the expected window.

This is exactly the pattern documented in Health Check Endpoint Design, specialised for cron jobs.


Idempotency and Retries

A dead-man switch assumes you can re-run a missed job safely. Many cannot.

Make jobs idempotent before you tighten alerting:

  • Use a unique run_id per logical period (not per attempt).
  • Use unique constraints on side effects (unique(invoice_id, period)).
  • Read-then-write within a transaction or use upserts.
  • Track processed_at rather than re-querying "what is new since last run."
  • Truncate temp tables on start, not on success.

When the job is idempotent, you can:

  • Retry automatically on missed runs.
  • Replay a window after an incident.
  • Stop fearing duplicate heartbeats.

For job pipelines that need this property, see Job Queue Monitoring.


Kubernetes CronJob Specifics

Kubernetes CronJobs hide a lot of failure modes behind passive defaults:

  • concurrencyPolicy: Allow lets a slow job overlap with the next run and exhaust resources.
  • startingDeadlineSeconds unset means missed runs may never be scheduled.
  • successfulJobsHistoryLimit is small by default; you lose history.
  • suspend: true is set by Helm rollouts more often than people expect.
  • imagePullBackOff on a CronJob is silent unless you alert on it.

Monitoring checklist:

  • Alert on CronJob suspend: true when not expected.
  • Alert when lastScheduleTime is too old.
  • Alert when lastSuccessfulTime is too old (Kubernetes 1.28+).
  • Alert on Job failed count > 0.
  • Alert when no Pods were created for a scheduled run.

Pair these with external dead-man switches; do not trust Kubernetes alone to tell you a CronJob ran.


Schedulers That Lie

Several common schedulers report success when they actually failed:

  • System cron with > /dev/null 2>&1 and no set -e will mask script failures.
  • Airflow marks a DAG as success if a sensor times out with mode=poke in some configurations.
  • Cloud Scheduler / EventBridge / Lambda scheduled consider a 2xx HTTP response from the target as success; they cannot see whether the target actually did the work.
  • Heroku Scheduler does not retry. Missed = lost.
  • Sidekiq-cron / BullMQ repeatable jobs can silently stop firing when the Redis key expires.

The dead-man switch is the outside-in check that does not trust any of these schedulers. The job itself must claim success.


Schedule the Monitor, Not Just the Job

A dead-man switch only works if the monitor itself is running. Monitor the monitor:

  • The dead-man switch service has its own heartbeat to an external service.
  • External uptime checks confirm the monitor endpoint is responsive.
  • Alert if the monitor restarts and has not received any heartbeats for an unexpected interval.

If you run your dead-man switch inside the same cluster as the cron jobs it watches, an outage takes both out simultaneously. Run it externally - that is exactly what services like Webalert exist for.


SLO and Reporting

Once dead-man switches are in place, you can express cron reliability as an SLO:

  • Schedule adherence: % of expected runs that arrived within grace.
  • Success rate: % of runs with status: ok.
  • End-to-end freshness: time since last_success_at across critical jobs.

A simple weekly report:

Job Expected runs Heartbeats Failed Late Success %
nightly-billing 7 7 0 0 100%
hourly-cache-rebuild 168 167 1 2 98.2%
weekly-report-export 1 0 0 0 0%

The "0 heartbeats, 0 failures" row is the one a status-code-only monitor would miss. That is exactly the silent failure dead-man switches catch.


Alerting Thresholds

Critical

  • Business-critical job (billing, payouts, backups) missed by more than grace period.
  • Last-success timestamp older than 2x expected interval.
  • Job ran successfully but processed 0 rows for N consecutive runs.
  • Dead-man switch monitor itself has not received heartbeats for any job for 15 minutes.

High

  • Non-critical job missed.
  • Job duration trending 50% slower week over week.
  • CronJob suspend: true set without a change ticket.
  • Grace-period exhaustion approaching (job arriving close to deadline).

Informational

  • New job heartbeat observed (deploy added a job).
  • Job removed (deploy removed a job).
  • Schedule changed.

Route critical alerts to on-call, high alerts to the job owner, informational to a chat channel. See Alert Fatigue.


Cron Dead-Man Switch Checklist

  • Every critical job has a dead-man switch with grace period
  • Heartbeats include job name, run id, status, duration, work done
  • Missed and failed alerts are separate with separate runbooks
  • Last-success timestamp exposed via health endpoint
  • External monitor reads the health endpoint from outside the cluster
  • Jobs are idempotent so missed runs can be safely re-run
  • Kubernetes CronJob suspend, lastScheduleTime, lastSuccessfulTime alerted on
  • Schedulers are not trusted to self-report success
  • Dead-man switch service has its own heartbeat
  • Weekly schedule-adherence report reviewed
  • Runbook covers: re-run, backfill, escalation, communications
  • Daylight Saving and timezone behaviour documented per job

When a missed-job incident does happen, a clear Incident Runbook makes the recovery far faster than the detection.


How Webalert Helps

Webalert is the external monitor a dead-man switch needs:

  • Heartbeat endpoints - Provide a unique URL per job that your cron pings on success. Webalert alerts when the expected ping is late or missing.
  • Health endpoint polling - Poll your /internal/health/cron/<job> endpoint and assert stale: false, last_success_at recency, and expected schedule metadata.
  • Content validation - Confirm the response actually says success, not just returns 200. A misconfigured endpoint that returns {} for everything will not pass content checks.
  • Schedule-aware alerting - Define expected interval and grace period per job; Webalert pages on absence, not just failure.
  • Multi-region checks - Detect when an endpoint is healthy in one region but unreachable in another.
  • Outside-the-cluster - Webalert runs externally, so a cluster outage cannot also silence your dead-man switch.
  • Alert routing - Send "missed" alerts to platform on-call and "failed" alerts to the job owner.

Example Webalert check:

  • URL: https://api.example.com/internal/health/cron/nightly-billing
  • Method: GET
  • Headers: Authorization: Bearer <probe-token>
  • Expected status: 200
  • Must contain: "stale":false
  • Must not contain: "status":"error"
  • Response time: under 1500ms
  • Region: US + EU

For the bearer-token piece, see Monitor Authenticated APIs.


Summary

Cron jobs fail loudly. Cron jobs that stop running fail silently, and silence is exactly what status-code monitoring cannot detect.

Build dead-man switches with explicit grace periods, separate missed vs failed alerts, expose last-success timestamps, make jobs idempotent, and run the monitor outside the cluster. Done well, the next "Why didn't this job run?" question gets answered before the customer asks it.


Catch missed cron jobs before someone else does

Start monitoring with Webalert ->

See features and pricing. No credit card required.

Monitor your website in under 60 seconds — no credit card required.

Start Free Monitoring

Written by

Webalert Team

The Webalert team is dedicated to helping businesses keep their websites online and their users happy with reliable monitoring solutions.

Ready to Monitor Your Website?

Start monitoring for free with 3 monitors, 10-minute checks, and instant alerts.

Start Free Monitoring