Incident Postmortem Template
A structured postmortem template for analyzing system incidents, identifying root causes, and preventing recurrence.
Overview
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, and the follow-up actions to prevent recurrence. It follows a blameless culture approach.
When to Use
- After any significant production incident or outage
- When SLA/SLO targets are breached
- To document lessons learned for the team
- Required by compliance or regulatory standards
Template
# Incident Postmortem: [Incident Title]
## Metadata
- **Date**: YYYY-MM-DD
- **Severity**: P1 / P2 / P3 / P4
- **Duration**: HH:MM
- **Impact**: [Services affected, users impacted]
- **Reporter**: [Name]
- **Status**: Resolved
## Summary
[2-3 sentence description of what happened and its impact]
## Timeline (UTC)
| Time | Event |
|------|-------|
| 09:00 | Alert fired: database connection pool exhausted |
| 09:05 | Engineer acknowledged alert |
| 09:15 | Service temporarily scaled up |
| 09:45 | Root cause identified: connection leak in v2.3.1 |
| 10:00 | Hotfix deployed, incident resolved |
## Root Cause
[Detailed explanation of what caused the incident. Include code snippets, configuration, or architecture diagrams if relevant.]
## Impact Assessment
- **Users affected**: ~15,000
- **Error rate peak**: 42%
- **Revenue impact**: Minimal (degraded performance, not full outage)
## Resolution
[Steps taken to resolve the incident, including any workarounds.]
## Lessons Learned
### What Went Well
- Monitoring detected the issue within 5 minutes
- Runbook was up-to-date and effective
### What Went Wrong
- Connection leak was not caught in staging
- Rollback procedure was slower than expected
## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add connection pool monitoring | @sre-team | 2026-06-18 | High |
| Improve staging load tests | @backend-team | 2026-06-25 | Medium |
| Document faster rollback steps | @sre-team | 2026-06-15 | High |
Severity Levels
| Level | Description | Example |
|---|---|---|
| P1 | Critical outage | Complete service unavailability |
| P2 | Major degradation | Core features broken |
| P3 | Minor impact | Non-critical features affected |
| P4 | Informational | No user impact, potential risk |
Best Practices
- Blameless: Focus on system and process failures, not individuals
- Specific: Include exact times, metrics, and affected systems
- Actionable: Every postmortem must have follow-up action items
- Timely: Publish within 48 hours of incident resolution
- Shared: Distribute to all stakeholders and store centrally
Common Mistakes
- Blame-focused: Naming individuals instead of analyzing systems
- Vague timelines: Missing exact timestamps
- No action items: Documenting without preventing recurrence
Related Resources
Runbook Template
A reusable template for operational runbooks: incident response, deployment procedures, and routine tasks.
GuideCI/CD Pipeline Guide
A practical guide to building CI/CD pipelines with GitHub Actions, testing, deployment strategies, and rollback procedures.
GuideSecurity Best Practices Guide
A comprehensive guide to application security: authentication, authorization, input validation, secrets management, and common vulnerability prevention.