Skip to content

How to Write an Incident Runbook That Works

Webalert Team
March 19, 2026
8 min read

How to Write an Incident Runbook That Works

At 3 AM, an alert fires. The on-call engineer opens the notification, sees "Database connection pool exhausted," and needs to act.

Without a runbook, they spend the first 15 minutes figuring out where to look, which credentials to use, and what commands are safe to run. With a runbook, they open the linked document and follow a step-by-step procedure that has worked before.

Runbooks are the difference between panicked debugging and calm execution. This guide shows how to write, structure, and maintain runbooks that your on-call team will actually use during incidents.


What Is a Runbook?

A runbook is a documented procedure for handling a specific operational scenario. It answers: "When X happens, do Y."

Good runbooks are:

  • Specific — One runbook per alert or failure mode, not a general troubleshooting guide
  • Actionable — Step-by-step commands and decisions, not abstract advice
  • Current — Updated after every incident that reveals gaps
  • Accessible — Linked directly from alert notifications, not buried in a wiki
  • Tested — Verified periodically so steps still work

A runbook is not a post-mortem, an architecture document, or a knowledge base article. It is a procedure to follow under pressure.


Runbook Structure Template

Every runbook should follow a consistent structure so engineers can navigate it quickly during incidents:

  • Title: Clear description of the scenario (e.g., "Database Connection Pool Exhausted")
  • Severity: Expected incident severity
  • Owner: Team or person responsible for maintaining this runbook
  • Last verified: Date the runbook was last tested or confirmed accurate
  • Related alerts: Which monitoring alerts trigger this runbook

Symptoms

What does this failure look like? List observable signals:

  • Alert name and message
  • User-visible symptoms (slow pages, error messages, failed transactions)
  • Dashboard indicators (metric spikes, log patterns)

Impact Assessment

Quick questions to determine scope:

  • How many users are affected?
  • Which services or features are impacted?
  • Is data at risk?
  • What is the revenue impact per minute?

Diagnostic Steps

Ordered steps to confirm the root cause:

  1. Check specific dashboard or metric
  2. Run specific command or query
  3. Verify specific log pattern
  4. Confirm or rule out specific hypothesis

Each step should include the exact command, URL, or query — not "check the database."

Resolution Steps

Ordered steps to fix the issue:

  1. Immediate mitigation (stop the bleeding)
  2. Root cause fix (if safe to apply now)
  3. Verification (confirm the fix worked)

Include exact commands with placeholders for environment-specific values. Mark any destructive or irreversible steps with warnings.

Escalation

When and how to escalate:

  • If diagnostic steps do not identify the cause within X minutes, escalate to [team/person]
  • If the fix requires access you do not have, contact [person/team]
  • If user impact exceeds [threshold], trigger the full incident response process

Post-Incident

  • Link to post-mortem template
  • Update this runbook if any steps were wrong or missing
  • File follow-up tickets for permanent fixes

Runbook Examples by Scenario

Example 1: High Error Rate on API

Symptoms: Monitoring alert for >5% 5xx error rate on /api/v2/* endpoints.

Diagnostic steps:

  1. Open API dashboard — check which endpoints have elevated errors
  2. Check recent deployments — was anything deployed in the last 30 minutes?
  3. Check downstream dependencies — are database, cache, or third-party APIs healthy?
  4. Check application logs for the most common error type

Resolution:

  • If caused by recent deployment: roll back to previous version
  • If caused by dependency failure: check dependency status page, enable circuit breaker if available
  • If caused by traffic spike: verify auto-scaling, consider temporary rate limiting

Example 2: SSL Certificate Expiring

Symptoms: Monitoring alert for SSL certificate expiring within 7 days.

Diagnostic steps:

  1. Confirm which domain and certificate are affected
  2. Check if auto-renewal is configured
  3. Check renewal logs for errors

Resolution:

  1. If auto-renewal failed: manually trigger renewal via certificate manager
  2. If certificate is managed externally: contact the certificate provider
  3. Verify renewed certificate is serving correctly from all regions

Example 3: Scheduled Job Missed

Symptoms: Heartbeat monitoring alert — expected signal not received.

Diagnostic steps:

  1. Check if the job scheduler is running
  2. Check job logs for errors or timeouts
  3. Check if the job completed but failed to send the heartbeat signal
  4. Check resource availability (disk space, memory, queue depth)

Resolution:

  1. If scheduler stopped: restart the scheduler service
  2. If job timed out: investigate data volume or resource constraints
  3. If heartbeat endpoint changed: update the job's heartbeat URL
  4. Manually trigger the missed job if safe and idempotent

Writing Tips for Better Runbooks

Use exact commands, not descriptions

Bad: "Check the database connection count."

Good: "Run SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; on the primary database. If count exceeds 90, proceed to resolution step 1."

Include expected output

After each diagnostic command, note what "normal" and "abnormal" results look like. The person running the command at 3 AM may not know the baseline.

Mark dangerous steps clearly

If a step can cause data loss, downtime, or is irreversible, make it unmistakable:

WARNING: This command will restart the production database. All active connections will be dropped. Only proceed if resolution steps 1-3 have failed.

Keep steps sequential

Do not write branching decision trees. Write a linear sequence. If the diagnosis branches, create separate runbooks for each branch and link between them.

Every tool reference should include a direct URL. Every credential reference should link to the secrets manager. Every dashboard mention should include the exact dashboard URL with relevant filters pre-applied.

Write for the newest team member

The person using this runbook might be a new hire on their first on-call shift. Do not assume knowledge of internal systems, shortcuts, or tribal knowledge.


Connecting Runbooks to Monitoring Alerts

The highest-value runbooks are the ones linked directly from alert notifications. When an engineer receives an alert, the runbook link should be one click away.

How to implement this

  1. For each monitoring alert, write a corresponding runbook
  2. Include the runbook URL in the alert notification template
  3. When the alert fires, the notification contains a direct link to the procedure

This eliminates the "where do I find the runbook?" problem entirely.

Most monitoring tools support custom text in alert messages. Add the runbook URL there.

Which alerts need runbooks first?

Prioritize by frequency and impact:

  • Alerts that fire most often
  • Alerts that wake people up (critical/high severity)
  • Alerts where the resolution is not obvious
  • Alerts that new team members are likely to handle

You do not need a runbook for every alert on day one. Start with the top 5-10 and expand.


Maintaining Runbooks Over Time

A runbook that was accurate six months ago may be dangerously wrong today. Infrastructure changes, services migrate, credentials rotate, and dashboards move.

Review triggers

Update runbooks when:

  • An incident reveals the runbook was incomplete or wrong
  • Infrastructure or architecture changes
  • A new team member follows the runbook and finds issues
  • Quarterly review cycle

Ownership

Assign each runbook to a team, not a person. Individuals leave; teams persist. The owning team is responsible for keeping the runbook current.

Version and date

Always include a "last verified" date. If a runbook has not been verified in 6+ months, treat it as potentially outdated and verify before trusting it in an incident.


How Webalert Helps

Webalert provides the monitoring signals that trigger runbook execution:

  • HTTP/HTTPS checks with configurable alert messages — include runbook URLs directly in notifications
  • Response time alerts for latency-based runbook triggers
  • Heartbeat monitoring for scheduled job runbooks
  • SSL and DNS alerts for certificate and infrastructure runbooks
  • Multi-channel notifications — deliver runbook links via Email, SMS, Slack, Discord, Teams, or webhooks
  • On-call scheduling — route alerts (with runbook links) to the right responder
  • Incident timelines — track whether runbook steps were followed and how long resolution took

Good monitoring fires the right alert. Good runbooks tell the responder exactly what to do next.

See features and pricing for details.


Summary

  • Runbooks turn panicked debugging into calm, step-by-step execution.
  • Structure every runbook consistently: symptoms, impact, diagnostics, resolution, escalation.
  • Write exact commands with expected outputs, not vague descriptions.
  • Link runbooks directly from alert notifications so they are one click away.
  • Start with your top 5-10 most frequent or impactful alerts.
  • Maintain runbooks after every incident and on a quarterly review cycle.
  • Assign ownership to teams, not individuals.

The best on-call teams are not the ones that debug fastest. They are the ones that documented the fix last time.


Start monitoring with Webalert →

See features and pricing. No credit card required.

Monitor your website in under 60 seconds — no credit card required.

Start Free Monitoring

Written by

Webalert Team

The Webalert team is dedicated to helping businesses keep their websites online and their users happy with reliable monitoring solutions.

Ready to Monitor Your Website?

Start monitoring for free with 3 monitors, 10-minute checks, and instant alerts.

Get Started Free