
The worst cron failures are the ones that do not fail at all. The job simply stops running. No exception. No 500. No alert. Backups stop happening. Invoices stop sending. Stale data piles up. Two weeks later, an auditor or a customer notices.
Most monitoring is wired for failures. A cron job that does not run produces no failure to monitor. You need the inverse: a dead-man switch that alerts when an expected heartbeat is missing.
This guide covers how to design dead-man switches for cron jobs, scheduled tasks, and background pipelines: grace periods, missed vs failed jobs, last-success timestamps, idempotency, and how to alert without drowning in noise.
For broader background-task monitoring, see Cron Job Monitoring.
Why Cron Jobs Disappear
Cron jobs and scheduled tasks vanish for boring reasons:
- A server was decommissioned and the crontab moved with it - sort of.
- A Kubernetes CronJob has
suspend: trueafter a Helm rollout. - A scheduler container is in
CrashLoopBackOfffor an hour before the alert fires, or no alert fires because liveness probes pass. - The job started, hit a
panicearly, exited 0 by accident. - The cron schedule was changed from
*/5 * * * *to*/5 * * * 1-5and nobody noticed the weekend gap. - A new region was launched without the cron config.
- DST or timezone changes pushed the job into a maintenance window.
- The deploy pipeline switched to a new image that no longer includes the cron binary.
None of these throw an exception. The job just stops.
What a Dead-Man Switch Actually Is
A dead-man switch is a monitor that alerts on silence.
The pattern:
- The job pings a known URL or writes a heartbeat at the end of a successful run.
- The monitor expects the ping at a known schedule.
- If the ping does not arrive within
expected_interval + grace_period, the monitor alerts.
It inverts the usual "alert on error" pattern into "alert on absence of success."
A trivial implementation:
# At the END of a successful cron run
curl -fsS -m 10 https://ping.example.com/cron/nightly-billing/ok || true
The monitor knows the schedule and the grace period. If no ping arrives, it pages.
If the job is partially successful, do not ping. The dead-man switch should only fire on complete success.
Missed vs Failed: Two Different Alerts
These are not the same problem and should not share the same alert:
| Type | Meaning | Detection | Owner |
|---|---|---|---|
| Failed | Job ran, errored | Exception, non-zero exit, error log | App owner |
| Missed | Job did not run | No heartbeat in expected window | Platform / scheduler owner |
| Late | Ran but past SLA | Ping arrived after grace period | Platform owner |
| Skipped | Ran but exited intentionally without doing work | Heartbeat with status=skipped |
App owner |
Combine them and you cannot triage. A "failed" alert tells you to check application logs. A "missed" alert tells you to check the scheduler.
Track all four. Alert on missed and failed with different runbooks.
Designing the Heartbeat
A heartbeat should answer three questions: what ran, when, and what it did.
Bad heartbeat:
GET /cron-ok
You cannot tell which job ran, when, or whether it succeeded.
Better heartbeat:
POST /heartbeat
{
"job": "nightly-billing",
"run_id": "2026-05-26T03:00:00Z-7f3a",
"started_at": "2026-05-26T03:00:00Z",
"finished_at": "2026-05-26T03:04:11Z",
"duration_ms": 251300,
"rows_processed": 4821,
"status": "ok"
}
Now you can:
- Detect missed jobs (no row arrived in the window).
- Detect slow jobs (
duration_mstrend). - Detect skipped jobs (
status: skipped). - Detect "ran but did nothing" (
rows_processed == 0for a job that should always do work). - Reconcile with application logs via
run_id.
Store the last 30 days of heartbeats per job. Many silent failures show up as "this job has not processed > 0 rows in 5 days."
Grace Periods, Done Right
A 5-minute cron job that pages at exactly 5 minutes is a noise machine. Pick grace periods carefully.
Rule of thumb:
| Schedule | Suggested grace period |
|---|---|
| Every minute | 2 minutes |
| Every 5 minutes | 10 minutes |
| Hourly | 30 minutes |
| Every 4 hours | 1 hour |
| Daily | 2 hours |
| Weekly | 6 hours |
| Monthly | 24 hours |
Grace period covers:
- Job retry on first failure.
- Cluster scheduling delays.
- DST transitions on daily jobs.
- Deploy windows when schedulers restart.
For business-critical jobs (billing, payouts, backups), keep grace tight and ensure the runbook is clear. For non-critical jobs (cleanup, cache warming), keep grace generous to avoid noise.
The Last-Success Timestamp Pattern
For long-running pipelines and complex schedules, you do not need a per-run heartbeat. You need a last-success timestamp the monitor can read.
Pattern:
- After each successful run, write
cron_last_success_atto a known location:- A database table.
- A health endpoint that exposes the value.
- An S3 object.
- A status file in object storage.
- A monitor polls the value.
- If
now() - cron_last_success_at > expected_interval + grace, alert.
The advantage: it survives missed runs. Even if the job is down for hours, the timestamp does not change, and any monitor can read it.
A simple endpoint:
GET /internal/health/cron/nightly-billing
200 OK
{
"job": "nightly-billing",
"last_success_at": "2026-05-26T03:04:11Z",
"expected_interval_minutes": 1440,
"grace_minutes": 120,
"stale": false
}
The monitor checks the endpoint and asserts:
- Status 200.
stale: false.last_success_atwithin the expected window.
This is exactly the pattern documented in Health Check Endpoint Design, specialised for cron jobs.
Idempotency and Retries
A dead-man switch assumes you can re-run a missed job safely. Many cannot.
Make jobs idempotent before you tighten alerting:
- Use a unique
run_idper logical period (not per attempt). - Use unique constraints on side effects (
unique(invoice_id, period)). - Read-then-write within a transaction or use upserts.
- Track
processed_atrather than re-querying "what is new since last run." - Truncate temp tables on start, not on success.
When the job is idempotent, you can:
- Retry automatically on missed runs.
- Replay a window after an incident.
- Stop fearing duplicate heartbeats.
For job pipelines that need this property, see Job Queue Monitoring.
Kubernetes CronJob Specifics
Kubernetes CronJobs hide a lot of failure modes behind passive defaults:
concurrencyPolicy: Allowlets a slow job overlap with the next run and exhaust resources.startingDeadlineSecondsunset means missed runs may never be scheduled.successfulJobsHistoryLimitis small by default; you lose history.suspend: trueis set by Helm rollouts more often than people expect.imagePullBackOffon a CronJob is silent unless you alert on it.
Monitoring checklist:
- Alert on CronJob
suspend: truewhen not expected. - Alert when
lastScheduleTimeis too old. - Alert when
lastSuccessfulTimeis too old (Kubernetes 1.28+). - Alert on Job
failedcount > 0. - Alert when no Pods were created for a scheduled run.
Pair these with external dead-man switches; do not trust Kubernetes alone to tell you a CronJob ran.
Schedulers That Lie
Several common schedulers report success when they actually failed:
- System cron with
> /dev/null 2>&1and noset -ewill mask script failures. - Airflow marks a DAG as success if a sensor times out with
mode=pokein some configurations. - Cloud Scheduler / EventBridge / Lambda scheduled consider a 2xx HTTP response from the target as success; they cannot see whether the target actually did the work.
- Heroku Scheduler does not retry. Missed = lost.
- Sidekiq-cron / BullMQ repeatable jobs can silently stop firing when the Redis key expires.
The dead-man switch is the outside-in check that does not trust any of these schedulers. The job itself must claim success.
Schedule the Monitor, Not Just the Job
A dead-man switch only works if the monitor itself is running. Monitor the monitor:
- The dead-man switch service has its own heartbeat to an external service.
- External uptime checks confirm the monitor endpoint is responsive.
- Alert if the monitor restarts and has not received any heartbeats for an unexpected interval.
If you run your dead-man switch inside the same cluster as the cron jobs it watches, an outage takes both out simultaneously. Run it externally - that is exactly what services like Webalert exist for.
SLO and Reporting
Once dead-man switches are in place, you can express cron reliability as an SLO:
- Schedule adherence: % of expected runs that arrived within grace.
- Success rate: % of runs with
status: ok. - End-to-end freshness: time since
last_success_atacross critical jobs.
A simple weekly report:
| Job | Expected runs | Heartbeats | Failed | Late | Success % |
|---|---|---|---|---|---|
| nightly-billing | 7 | 7 | 0 | 0 | 100% |
| hourly-cache-rebuild | 168 | 167 | 1 | 2 | 98.2% |
| weekly-report-export | 1 | 0 | 0 | 0 | 0% |
The "0 heartbeats, 0 failures" row is the one a status-code-only monitor would miss. That is exactly the silent failure dead-man switches catch.
Alerting Thresholds
Critical
- Business-critical job (billing, payouts, backups) missed by more than grace period.
- Last-success timestamp older than 2x expected interval.
- Job ran successfully but processed 0 rows for N consecutive runs.
- Dead-man switch monitor itself has not received heartbeats for any job for 15 minutes.
High
- Non-critical job missed.
- Job duration trending 50% slower week over week.
- CronJob
suspend: trueset without a change ticket. - Grace-period exhaustion approaching (job arriving close to deadline).
Informational
- New job heartbeat observed (deploy added a job).
- Job removed (deploy removed a job).
- Schedule changed.
Route critical alerts to on-call, high alerts to the job owner, informational to a chat channel. See Alert Fatigue.
Cron Dead-Man Switch Checklist
- Every critical job has a dead-man switch with grace period
- Heartbeats include job name, run id, status, duration, work done
- Missed and failed alerts are separate with separate runbooks
- Last-success timestamp exposed via health endpoint
- External monitor reads the health endpoint from outside the cluster
- Jobs are idempotent so missed runs can be safely re-run
- Kubernetes CronJob
suspend,lastScheduleTime,lastSuccessfulTimealerted on - Schedulers are not trusted to self-report success
- Dead-man switch service has its own heartbeat
- Weekly schedule-adherence report reviewed
- Runbook covers: re-run, backfill, escalation, communications
- Daylight Saving and timezone behaviour documented per job
When a missed-job incident does happen, a clear Incident Runbook makes the recovery far faster than the detection.
How Webalert Helps
Webalert is the external monitor a dead-man switch needs:
- Heartbeat endpoints - Provide a unique URL per job that your cron pings on success. Webalert alerts when the expected ping is late or missing.
- Health endpoint polling - Poll your
/internal/health/cron/<job>endpoint and assertstale: false,last_success_atrecency, and expected schedule metadata. - Content validation - Confirm the response actually says success, not just returns 200. A misconfigured endpoint that returns
{}for everything will not pass content checks. - Schedule-aware alerting - Define expected interval and grace period per job; Webalert pages on absence, not just failure.
- Multi-region checks - Detect when an endpoint is healthy in one region but unreachable in another.
- Outside-the-cluster - Webalert runs externally, so a cluster outage cannot also silence your dead-man switch.
- Alert routing - Send "missed" alerts to platform on-call and "failed" alerts to the job owner.
Example Webalert check:
- URL:
https://api.example.com/internal/health/cron/nightly-billing - Method:
GET - Headers:
Authorization: Bearer <probe-token> - Expected status:
200 - Must contain:
"stale":false - Must not contain:
"status":"error" - Response time: under 1500ms
- Region: US + EU
For the bearer-token piece, see Monitor Authenticated APIs.
Summary
Cron jobs fail loudly. Cron jobs that stop running fail silently, and silence is exactly what status-code monitoring cannot detect.
Build dead-man switches with explicit grace periods, separate missed vs failed alerts, expose last-success timestamps, make jobs idempotent, and run the monitor outside the cluster. Done well, the next "Why didn't this job run?" question gets answered before the customer asks it.