
On-call doesn't "break" teams because engineers aren't capable. It breaks teams when the system is missing fundamentals: unclear ownership, noisy alerts, no escalation, and no shared expectations.
The good news: you don't need a huge SRE org to run on-call well. You need a few simple guardrails that make incidents predictable, routinized, and fair.
In this guide, you'll learn how to build an on-call setup that keeps customers safe and humans sane.
1) Define what "on-call" means for your team
Before you tune alerts or buy tools, set expectations.
- What counts as an incident? A customer-facing outage? Elevated error rates? Payment delays?
- What is the objective? "Restore service quickly" is not the same as "find root cause immediately."
- What's the response window? 10 minutes? 30 minutes? Business-hours only for some systems?
- Who owns what? If everything is "everyone's job," it becomes nobody's job at 03:12.
Write this down in a one-page "on-call contract." It prevents most confusion before it starts.
2) Use a tiered severity model (and actually use it)
A severity model is your shared language for urgency.
A simple version:
- SEV1: Major outage or critical customer impact. Immediate response + escalation.
- SEV2: Partial degradation. Rapid response; escalate if not mitigated quickly.
- SEV3: Minor issue. Acknowledge and create a task; fix in working hours.
- SEV4: Informational. No page. Useful for trends and diagnostics.
Rule of thumb: Only SEV1 and SEV2 should page by default. If everything pages, nothing is urgent—and your team learns to ignore alerts.
3) Make pages actionable, not just "something is wrong"
A good alert answers three questions in the first 10 seconds:
- What is broken? ("API error rate above 5%")
- Who is impacted? ("EU region, checkout flows")
- What should I do first? ("Check latest deploy; roll back if started in last 30 minutes")
If your first move is "ask in chat what to do," the alert needs more context (or it should be a dashboard metric, not a pager).
4) Add escalation paths so one person isn't trapped
When the page hits, responders need to know what happens next if they can't resolve quickly.
Define:
- Primary: first responder
- Secondary: backup who can take over or swarm
- Domain experts: database, payments, infra, etc.
- Decision owner: who can authorize rollback, traffic shifts, vendor escalation
Also define time-based escalation, for example:
- No mitigation in 10 minutes → notify secondary
- SEV1 persists 20 minutes → notify decision owner
- Vendor dependency suspected → start vendor escalation immediately
Escalation removes guilt and guesswork. It's not a personal failure; it's the system working.
5) Optimize for mitigation first (rollback > root cause)
Most teams accidentally optimize on-call for "find the root cause," which can prolong outages.
Instead, optimize for:
- Detect
- Mitigate
- Communicate
- Diagnose
- Prevent recurrence
Mitigation might be a rollback, feature flag off, traffic shift, rate limit, or safe-mode toggle. Restore service first, then investigate with a clear head.
If mitigation is slow today, invest there:
- Feature flags for risky paths
- One-command rollback with a checklist
- Safe-mode toggles for non-essential features
- Runbooks for common failure modes
6) Reduce alert fatigue with ownership and SLO thinking
Alert fatigue comes from alerts that are "interesting" but not "urgent."
To reduce noise:
- Assign an owner to each paging alert (a team or service). Unowned alerts rot.
- Page on user impact, not metric twitchiness (SLO/error budget thinking when possible).
- Deduplicate and group so one incident doesn't send 12 pages.
A simple weekly habit: review your top paging alerts and ask:
- Did this help us act faster?
- Did it fire when users were fine?
- Should it be downgraded from paging?
7) Treat communication as part of incident response
During an incident, silence creates chaos—internally and externally.
Set a lightweight comms loop:
- Internal: one channel/thread, one incident lead, updates every X minutes for SEV1
- External: status page updates with a clear "next update by" time (if relevant)
Good communication prevents duplicated work and builds trust.
8) Keep postmortems small, blameless, and repeatable
Postmortems aren't about blame. They're about improving the system.
Keep them short:
- What happened?
- Impact (users, duration, revenue if relevant)
- Timeline (key events)
- What went well / what didn't
- Action items (owners + due dates)
Most importantly: at least one concrete prevention step. If postmortems don't lead to changes, people stop doing them.
9) Protect humans: rotations, recovery, and fairness
A sustainable on-call system includes human constraints.
- Rotation length: 1 week is common; shorter can work for small teams.
- Handoff: a 10-minute checklist at start/end of rotation.
- Recovery time: if someone gets paged at night, consider comp time or a late start.
- Fairness: track who gets paged and fix hot spots.
If on-call feels unfair, it will eventually become unstaffable.
The "good on-call" checklist
If you only do five things this month, do these:
- Only page on SEV1/SEV2
- Make every page actionable
- Define escalation paths
- Invest in rollback/feature flags
- Write lightweight postmortems with owners
That's enough to move from chaos to control.
Final Thoughts
On-call is one of the fastest ways to learn where your system—and your process—needs strengthening. Done well, it's not a tax. It's a feedback loop that improves reliability while keeping people healthy.