On-Call and Incident Response Playbook
A practical playbook for on-call engineers: triage, escalation, communication, and postmortems. Reduce MTTR and build a resilient incident response culture.
On-Call and Incident Response Playbook
Introduction
Incidents are inevitable. What separates resilient teams from fragile ones is not the absence of failures, but the speed and quality of their response. This playbook provides a structured approach to handling production incidents — from the first alert to the postmortem.
The Incident Response Lifecycle
Detect → Triage → Mitigate → Resolve → Postmortem
↑ │
└────────── Monitor & Communicate ─────────┘
1. Detection
Alerting Principles
| Alert | Why It Matters | Threshold |
|---|---|---|
| Error rate spike | Users are seeing failures | > 0.1% of requests for 2 minutes |
| Latency p99 | Degraded user experience | > 500ms for 5 minutes |
| Saturation | Resource exhaustion approaching | CPU > 80%, memory > 85%, disk > 90% |
| Dependency failure | Downstream service is down | Health check fails 3 times |
Alert Fatigue Is Real
If an alert fires and the on-call engineer does not take action, it is not an alert — it is noise. Remove or downgrade alerts with > 80% false positive rate.
2. Triage
The FIRST Minute Checklist
When paged, answer these questions in order:
- What is failing? — service name, endpoint, region
- Who is affected? — all users, a subset, internal only?
- When did it start? — exact time of first failure (check deployment logs)
- What changed? — any deploy, config change, or dependency shift?
- Is it getting worse? — trend of error rate over time
Severity Levels
| Severity | Definition | Response Time | Example |
|---|---|---|---|
| SEV-1 | Complete service outage or data loss | 15 minutes | Payment system down for all users |
| SEV-2 | Major functionality degraded | 30 minutes | Search returns empty for 50% of users |
| SEV-3 | Minor impact or workaround exists | 2 hours | Admin dashboard slow, API still fast |
| SEV-4 | No user impact, potential risk | Next business day | Log volume spike, no errors yet |
3. Mitigation
Stop the Bleeding First
Your first goal is not to fix the root cause — it is to restore service. Prefer rollback over forward-fix during an incident.
# Rollback a bad deployment
kubectl rollout undo deployment/api-service
# Enable a feature flag kill switch
curl -X POST "https://config-service/flags/checkout-v2" \
-d '{"enabled": false}'
# Scale up to absorb load
kubectl scale deployment/api-service --replicas=20
Common Mitigation Tactics
| Problem | Fast Mitigation |
|---|---|
| Bad deployment | Rollback to last known good version |
| Traffic spike | Scale horizontally, enable rate limiting |
| Dependency failure | Enable circuit breaker, serve stale cache |
| Database overload | Kill slow queries, add read replicas |
| Configuration error | Revert config, restart with previous values |
4. Communication
Internal Status Updates
Post in your incident channel every 10 minutes:
[SEV-2] Checkout latency elevated
- Started: 14:32 UTC
- Impact: ~30% of checkout requests timeout
- Cause: database connection pool exhausted after v2.4.1 deploy
- Mitigation: rolled back to v2.4.0 at 14:45, monitoring recovery
- ETA: 15:00 UTC if trend holds
- Commander: @alice
External Communication
| Severity | External Notice? | Who |
|---|---|---|
| SEV-1 | Yes, immediate | Customer support + status page |
| SEV-2 | Yes, if > 30 min | Customer support + status page |
| SEV-3 | No, unless asked | Internal only |
| SEV-4 | No | Internal only |
Blameless Communication Rules
- Do not name individuals as causes
- Do not use “human error” as a root cause
- Focus on what happened, what was done, and what is next
5. Resolution
Definition of Resolved
An incident is resolved when:
- Error rates return to baseline for 10 minutes
- All mitigations are stable
- No new symptoms have appeared
- The incident commander declares “all clear”
After All Clear
- Stop the clock (log total incident duration)
- Schedule postmortem within 24 hours for SEV-1/2
- Create follow-up tickets with owners and due dates
- Update runbooks with anything learned
6. Postmortem
The Five Whys
Ask “why” recursively until you reach a systemic issue, not a symptom.
Problem: Payment API returned 500 errors for 20 minutes.
Why? → Database connection pool was exhausted.
Why? → v2.4.1 increased default pool size but forgot to close connections in new retry logic.
Why? → The change was not tested under load.
Why? → Load tests do not cover the checkout flow.
Why? → Load test scenarios were last updated 6 months ago.
Action: Add checkout flow to weekly load tests; require load test pass in CI.
Postmortem Template
# Postmortem: [Incident Name] ([SEV-X])
## Summary
- Date: 2024-06-12
- Duration: 23 minutes
- Impact: 12% of checkout attempts failed
## Timeline
- 14:32 — First alert: error rate spike on /api/checkout
- 14:35 — On-call acknowledged
- 14:40 — Identified connection pool exhaustion
- 14:45 — Rolled back to v2.4.0
- 14:55 — Error rates returned to baseline
## Root Cause
v2.4.1 introduced a retry loop that leaked database connections.
## What Went Well
- Rollback completed in under 5 minutes
- Monitoring clearly pointed to connection pool exhaustion
## What Went Wrong
- Load tests did not cover the new retry logic
- No connection leak detection in staging
## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| Add checkout flow to load tests | @bob | 2024-06-19 |
| Add connection leak alert | @alice | 2024-06-15 |
Best Practices
- Rotate on-call fairly — no one should be on-call more than 1 week in 4
- Compensate for off-hours — pay extra or give time off in lieu
- Shadow on-call — new engineers shadow for 2-4 weeks before taking the pager
- Automate runbooks — if a runbook step is manual, add it to your automation backlog
- Review alerts quarterly — remove noise, tune thresholds, fix flapping alerts
Common Mistakes
- Skipping postmortems because “we are too busy”
- Blaming individuals instead of fixing systems
- Forward-fixing during an incident instead of rolling back
- Communicating too late to customers
- Not having a secondary on-call for escalation
- Keeping the same person on-call for weeks
Frequently Asked Questions
What if I do not know how to fix the issue?
That is expected. Your job is to contain the impact and find the right person — not to know every system. Escalate early and clearly. A 5-minute escalation is better than a 30-minute solo struggle.
How do I balance incident response with feature work?
Incidents are unplanned work. Track them. If a team spends > 20% of sprint capacity on incidents, that is a signal to invest in reliability (tests, automation, refactoring) rather than new features.
Should junior engineers be on-call?
Yes, with mentorship. Shadowing senior engineers during incidents is one of the fastest ways to learn how systems fail. Start with low-severity rotations and pair them with a senior for the first month.