
Your site went down. You fixed it. Everyone moved on.
Three months later, the exact same issue brings your service down again.
Sound familiar?
This is what happens when teams skip post-mortems — or write ones that gather dust in a forgotten Google Doc.
A good post-mortem isn't just documentation. It's a system for organizational learning. Done right, it prevents the same incident from happening twice.
In this guide, we'll cover how to write post-mortems that actually work — plus a ready-to-use template you can copy today.
What Is an Incident Post-Mortem?
A post-mortem (also called an incident review or retrospective) is a structured analysis of an incident after it's resolved.
The goals are simple:
- Understand what happened
- Identify the root cause
- Document the impact
- Create action items to prevent recurrence
- Share learnings across the team
Post-mortems aren't about assigning blame. They're about building systems that fail less often.
The Case for Blameless Post-Mortems
Here's a hard truth:
If people fear punishment, they'll hide information.
When engineers worry about being blamed, they:
- Minimize their involvement
- Omit important details
- Avoid speaking up about risks they noticed
This kills your ability to learn from incidents.
What "Blameless" Actually Means
Blameless doesn't mean "no accountability." It means:
- Focusing on systems and processes, not individuals
- Assuming people made reasonable decisions given what they knew at the time
- Asking "what failed?" instead of "who failed?"
- Creating an environment where honesty is rewarded
The engineers closest to the incident often have the best insights. A blameless culture ensures you actually hear them.
When Should You Write a Post-Mortem?
Not every hiccup needs a formal post-mortem. Here's when you should write one:
Always Write a Post-Mortem When:
- Customer-facing services were impacted
- The incident lasted longer than 30 minutes
- Data was lost or corrupted
- Revenue was directly affected
- The incident required multiple teams to resolve
- On-call engineers were paged outside business hours
Consider a Lighter Review When:
- The incident was caught before customer impact
- Recovery was quick and straightforward
- The root cause is already well understood
Pro Tip: Set Clear Thresholds
Define your triggers in advance. For example:
- Severity 1 (Critical): Full post-mortem within 48 hours
- Severity 2 (Major): Post-mortem within 1 week
- Severity 3 (Minor): Brief incident summary
Having predefined rules removes the "should we write one?" debate.
The Post-Mortem Template
Here's a battle-tested template you can copy and adapt:
📋 Incident Post-Mortem Template
Incident Title: [Brief descriptive title]
Date of Incident: [YYYY-MM-DD]
Duration: [Start time] - [End time] ([X] hours/minutes)
Severity: [Critical / Major / Minor]
Author: [Name]
Post-Mortem Date: [YYYY-MM-DD]
1. Executive Summary
2-3 sentences describing what happened and the business impact.
On [date], [service/system] experienced [issue type] for [duration]. This resulted in [impact: e.g., "approximately 2,000 users unable to complete checkout"]. The root cause was [brief root cause].
2. Timeline
Chronological list of key events. Use UTC times.
| Time (UTC) | Event |
|---|---|
| 14:00 | First alert triggered (API latency > 5s) |
| 14:05 | On-call engineer acknowledged alert |
| 14:12 | Initial investigation identified database connection pool exhaustion |
| 14:25 | Decision made to restart application servers |
| 14:32 | Services restored, monitoring confirmed |
| 14:45 | Incident declared resolved |
3. Root Cause Analysis
What was the underlying cause? Go beyond the surface-level symptom.
Immediate Cause: The application ran out of database connections.
Underlying Cause: A recent code deployment introduced a connection leak in the user authentication module. Connections were opened but not properly closed on certain error paths.
Contributing Factors:
- No connection pool monitoring alerts
- Code review didn't catch the missing connection close
- Load testing didn't cover the error scenarios
4. Impact Assessment
Quantify the impact as specifically as possible.
| Metric | Value |
|---|---|
| Duration | 32 minutes |
| Users affected | ~2,000 |
| Failed transactions | 847 |
| Estimated revenue loss | $12,400 |
| Support tickets created | 23 |
| SLA breach | Yes (99.9% monthly target) |
5. What Went Well
Acknowledge what worked. This reinforces good practices.
- Alerting triggered within 2 minutes of issue starting
- On-call response was fast (5 minutes to acknowledge)
- Team communication in Slack was clear and organized
- Status page was updated promptly
6. What Went Wrong
Be honest about failures. This is where learning happens.
- Connection leak wasn't caught in code review
- No automated test for connection cleanup
- Database connection pool had no alerting
- Initial diagnosis took 7 minutes due to unclear metrics
7. Action Items
Specific, assignable, measurable tasks. Each item needs an owner and deadline.
| Action | Owner | Deadline | Status |
|---|---|---|---|
| Add connection pool monitoring alert | @sarah | 2025-12-10 | To Do |
| Fix connection leak in auth module | @mike | 2025-12-08 | Done |
| Add integration test for connection cleanup | @sarah | 2025-12-15 | To Do |
| Update code review checklist for resource cleanup | @team-lead | 2025-12-12 | To Do |
| Document database connection best practices | @mike | 2025-12-20 | To Do |
8. Lessons Learned
Key takeaways for the broader team.
Monitor your connection pools — Database connections are a finite resource. Alert before exhaustion, not after.
Error paths need testing too — Happy-path testing isn't enough. Explicitly test failure scenarios.
Resource cleanup is a code review priority — Add it to your review checklist.
9. Follow-Up
Schedule a review to ensure action items are completed.
- Action item review date: 2025-12-22
- Responsible: @team-lead
Best Practices for Effective Post-Mortems
1. Write It While Memories Are Fresh
Schedule the post-mortem within 24-48 hours of resolution. Details fade quickly.
2. Involve the Right People
Include everyone who was involved in detection, response, and resolution. Their perspectives matter.
3. Stick to Facts, Not Opinions
"The deployment happened at 14:00" ✓ "The deployment was rushed" ✗ (opinion)
4. Use the "5 Whys" Technique
Keep asking "why" until you reach the true root cause:
- Why did the site go down? → Database connections exhausted
- Why were connections exhausted? → Connections weren't being released
- Why weren't they released? → Missing cleanup code in error handler
- Why was cleanup code missing? → Not caught in code review
- Why wasn't it caught? → No checklist item for resource cleanup
5. Make Action Items SMART
- Specific: "Add database connection alert" not "improve monitoring"
- Measurable: Clear definition of done
- Assignable: One owner per item
- Realistic: Achievable within the timeframe
- Time-bound: Specific deadline
6. Share Widely
Post-mortems lose value if they're hidden. Share across:
- Engineering team
- Product team
- Customer support (so they understand what happened)
- Leadership (summarized)
7. Track Completion
Action items without follow-up are just wishes. Schedule a review to verify completion.
Common Post-Mortem Mistakes
❌ Assigning Blame
"John deployed the bad code" → "A deployment introduced a regression"
❌ Being Too Vague
"Improve monitoring" → "Add alert when connection pool exceeds 80% capacity"
❌ Skipping the Impact Section
Without quantified impact, it's hard to prioritize prevention efforts.
❌ No Action Items
A post-mortem without action items is just storytelling.
❌ Action Items Without Owners
If everyone owns it, no one owns it.
❌ Never Following Up
The action items need to actually get done. Schedule the review.
How Monitoring Prevents Future Incidents
Many incidents share a common thread: they could have been caught earlier.
Better monitoring means:
- Faster detection (minutes vs. hours)
- Smaller blast radius
- More context for debugging
- Proactive alerts before customer impact
The best post-mortem is the one you never have to write — because your monitoring caught the issue before it became an incident.
Final Thoughts
Incidents will happen. That's the nature of complex systems.
What separates great teams from struggling ones isn't avoiding all failures — it's learning from each failure so it doesn't happen again.
A well-written post-mortem is your best tool for that learning.
Start with the template above. Adapt it to your team's needs. And most importantly: actually complete the action items.
Ready to catch issues before they become incidents?
Start monitoring for free with Webalert — get instant alerts when your site goes down, so you can respond faster and write shorter post-mortems.