
At 2am, the worst possible time to invent a definition is mid-incident. "Is this a SEV1 or a SEV2?" should have an answer that takes five seconds, not a debate. Incident severity levels are that answer: a shared scale that tells everyone, instantly, how bad something is, who needs to wake up, and how fast to move. Get them right and your response is calm and proportionate. Get them wrong and you either over-page on trivia (burning out your team) or under-react to a real outage (burning customer trust).
This guide explains what severity levels are, a practical SEV1–SEV5 scale you can adopt, how to classify consistently, and the common mistakes that make severity schemes backfire.
What Incident Severity Levels Are
A severity level is a single label that captures the business impact of an incident — not its technical complexity. A one-line config typo that takes down checkout is a high-severity incident; a gnarly bug in an internal tool used by three people is not. Severity answers "how much does this hurt customers and the business right now?"
Severity drives almost everything downstream:
- Who gets paged and whether you wake people up.
- How fast you're expected to respond and how often you update stakeholders.
- Whether you open a war room, post to a public status page, and notify leadership.
- Whether a post-mortem is mandatory.
Two naming conventions dominate: SEV (SEV1 = most severe, counting up) and P / priority (P1 = most severe). They're interchangeable; pick one and use it everywhere. This guide uses SEV, where lower numbers are more severe.
A Practical SEV1–SEV5 Scale
Most organizations land on three to five levels. More than five and nobody can tell them apart; fewer than three and you can't distinguish "the site is down" from "a button is misaligned." Here's a battle-tested five-level scale you can adapt:
| Level | Name | Impact | Example | Response |
|---|---|---|---|---|
| SEV1 | Critical | Full outage or critical business function down for many/all users; data loss; security breach | Checkout completely down; site returns 500s globally | All hands, page immediately, exec + status page, 24/7 until resolved |
| SEV2 | Major | Major feature broken or severe degradation; significant subset of users affected; no workaround | Login failing for 30% of users; severe latency | Page on-call, war room, frequent updates |
| SEV3 | Minor | Partial/degraded functionality with a workaround; limited user impact | One non-critical API slow; minor feature broken | Handle in business hours, normal priority |
| SEV4 | Low | Minor issue, cosmetic, or affecting very few users | UI glitch; typo; isolated edge case | Backlog / next sprint |
| SEV5 | Informational | No current user impact; early warning | Cert expiring in 20 days; disk at 70% | Track and schedule proactively |
The exact wording matters less than two things: each level has a concrete definition and at least one real example so classification isn't a judgment call at 2am.
How to Define Severity for Your Org
Generic scales are a starting point; the value comes from tailoring them. Define severity along a few clear axes so anyone can classify quickly:
- Scope — how many users are affected? (all / a segment / a few / none)
- Functionality — is a critical business function impacted (revenue, auth, data integrity) or a peripheral one?
- Workaround — is there a viable workaround, or are users fully blocked?
- Data & security — any data loss, corruption, or security exposure escalates severity sharply, often straight to SEV1.
- Duration / trend — is it stable, recovering, or getting worse?
A useful tie-breaker rule: when in doubt, round up. It's cheaper to downgrade a SEV2 that turns out to be a SEV3 than to discover a "SEV3" was actually taking down revenue for an hour. You can always lower severity as you learn more.
Tie each level explicitly to response expectations — target response time, update cadence, and who's notified — so the label does real work. A severity scheme that doesn't change behavior is just paperwork.
Severity vs Priority
These two are often conflated, and keeping them distinct sharpens decisions:
- Severity = how bad the impact is (objective, customer-facing).
- Priority = the order in which you'll work on it (which factors in severity plus effort, dependencies, and timing).
Usually they align — a SEV1 is also top priority. But not always: a SEV4 cosmetic bug on your pricing page right before a major launch might get bumped to high priority despite low severity. Tracking both prevents "everything is a SEV1" inflation while still letting business context drive what gets done first.
Connecting Severity to Your Response
Severity is the trigger that wires the rest of your incident process together:
- Alerting & paging. Severity determines whether an alert pages a human at night or files a ticket for morning. Mapping the right things to the right severity is the single best defense against alert fatigue.
- Escalation. Each level should have an escalation policy: who's paged first, who's the backup, and when it escalates to leadership.
- On-call. Your on-call rotation defines who responds to each severity — and protecting responders from low-severity noise is key to on-call without burnout.
- Runbooks. High-severity incidents should point to a response runbook so the responder isn't improvising.
- Communication. SEV1/SEV2 typically trigger public status updates and stakeholder notifications on a defined cadence.
- Post-incident. Major severities should mandate a blameless post-mortem and a recovery checklist.
This is also where SLOs and error budgets connect: a severity high enough to burn a meaningful chunk of your error budget is, by definition, worth a serious response.
Common Mistakes
- Severity inflation. When everything is a SEV1, nothing is — and responders stop trusting the pager. Reserve the top level for genuine critical impact.
- Too many levels. Six or seven levels that nobody can distinguish create classification paralysis. Three to five is the sweet spot.
- Vague definitions. "Major impact" means different things to different people at 2am. Anchor each level with concrete examples.
- Confusing severity with effort. A hard-to-fix bug isn't automatically high severity; severity is about impact, not difficulty.
- Set-and-forget classification. Severity should be re-evaluated as an incident evolves — start high, downgrade as scope narrows (or escalate if it spreads).
- No link to action. If the severity label doesn't change who's paged or how fast you respond, it's decoration.
How Webalert Helps
Accurate severity starts with accurate, fast detection — you can't classify what you don't know is broken:
- Outside-in monitoring across regions catches full and partial outages from the user's perspective, so you can judge real scope, not just internal health.
- Content validation flags "false green" failures — a
200 OKreturning the wrong page — that internal checks miss, so a real SEV1 isn't mistaken for healthy. See response body validation. - Severity-aware alerting routes the right incidents to the right people, helping you page on real impact and avoid alert fatigue.
- Status pages to communicate the right message at the right severity, and history to inform the post-mortem.
Faster, more accurate detection means you classify correctly the first time — and respond proportionately.
Summary
Incident severity levels are a shared language for impact: a SEV1–SEV5 (or P1–P5) scale where each level has a concrete definition, a real example, and a clear set of response expectations. Define severity by scope, functionality, workaround, and data/security impact — not by how hard the fix is — and when in doubt, round up.
Wire severity into paging, escalation, on-call, runbooks, communication, and post-mortems so the label drives real behavior. Avoid severity inflation and vague definitions, keep it to three to five levels, and re-assess as incidents evolve. Do that, and "is this a SEV1 or a SEV2?" stops being a 2am debate and starts being a five-second decision.