Skip to content
SP StackPractices
intermediate By StackPractices

Incident Response — Structured Handling for Production Outages

A practical guide to incident response: declaring incidents, building an incident command structure, communication protocols, and reducing mean time to resolution with structured processes.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Incident response is the structured process of reacting to unplanned service disruptions. Without structure, incidents devolve into chaos: too many people talking, no clear decision-maker, and unclear communication to stakeholders. A defined response process reduces mean time to resolution (MTTR), minimizes customer impact, and reduces stress on responders.

This guide covers incident declaration, roles, communication, and resolution workflows.

When to Use

  • You experience production outages that lack clear ownership
  • Multiple engineers jump into incidents without coordination
  • Communication to stakeholders during outages is inconsistent or missing
  • Your MTTR is trending upward or exceeds your SLO
  • You want to practice and improve response capabilities proactively

Core Concepts

ConceptDescription
IncidentAn unplanned disruption or degradation of service
Incident Commander (IC)Single decision-maker who coordinates response
SeverityImpact classification (Sev1 = critical, Sev4 = minor)
MTTRMean Time To Resolution — average time to fix
Communication LeadPerson responsible for stakeholder updates
PostmortemBlameless review after incident resolution

Incident Severity Classification

SeverityCriteriaResponseCommunication
Sev1Complete outage, revenue stopped, data lossAll hands, war roomExecutive notification, status page, customer comms
Sev2Major degradation, core feature brokenOn-call team + backupStatus page, internal channels
Sev3Partial impact, workaround availablePrimary on-callInternal ticket, no external comms
Sev4Minor issue, minimal user impactBest effortTrack in ticket, no urgency

Step-by-Step Incident Response

1. Detect and Declare

Recognize when an alert becomes an incident:

## Incident Declaration Checklist

- [ ] Alert received and acknowledged
- [ ] Initial triage confirms user impact
- [ ] Severity assessed (Sev1-4)
- [ ] Incident Commander assigned
- [ ] Incident channel created (e.g., #incident-2024-001)
- [ ] Status page updated (Sev1/Sev2)
- [ ] Stakeholders notified (Sev1)

Declaration principles:

  • When in doubt, declare. Downgrading is easier than catching up.
  • Sev1 incidents get an Incident Commander immediately.
  • Create a dedicated channel for every Sev1/Sev2 incident.
  • Log start time, trigger, and initial assessment.

2. Assign Roles

Clear roles prevent chaos:

RoleResponsibilitiesRequired For
Incident CommanderMakes all decisions, assigns tasks, controls scopeSev1, Sev2
Technical LeadInvestigates root cause, proposes fixesSev1, Sev2
Communication LeadWrites status updates, manages stakeholder commsSev1
ScribeDocuments timeline, actions, and decisionsSev1
ResponderExecutes tasks assigned by ICAll
## Incident Command Structure

                  Incident Commander

         ┌──────────────┼──────────────┐
         │              │              │
    Technical      Communication    Scribe
       Lead           Lead

    Responders

Role best practices:

  • IC does not investigate directly; they coordinate
  • Only the IC speaks for the incident team to stakeholders
  • Rotate IC if the current person has been on for >2 hours
  • Scribe timestamps every major action and decision

3. Communicate Effectively

Communication is as important as technical response:

AudienceChannelFrequencyContent
Response teamIncident channelContinuousStatus, hypotheses, actions
Internal stakeholders#incidents or SlackEvery 15-30 min (Sev1)Impact, ETA, what we know
ExecutivesEmail/Slack DMEvery 30-60 min (Sev1)Business impact, recovery plan
CustomersStatus pageEvery 15-30 min (Sev1/2)What is affected, ETA, workarounds
## Status Update Template

**Incident:** #incident-2024-001
**Severity:** Sev1
**Started:** 14:30 UTC
**Status:** [Investigating / Identified / Monitoring / Resolved]

**Impact:** [What is broken and who is affected]
**What we know:** [Current understanding of root cause]
**What we are doing:** [Active remediation steps]
**ETA:** [Estimated time to resolution or next update]
**Workaround:** [Any available workaround for users]

Next update: 15:00 UTC

Communication principles:

  • Under-promise and over-deliver on ETAs
  • Do not speculate on root cause until confident
  • Update even if nothing has changed (“still investigating”)
  • Close the loop: notify when resolved, then follow up with postmortem timeline

4. Investigate and Mitigate

Structured technical response:

## Investigation Steps

1. **Confirm scope:** What is broken? For whom? Since when?
2. **Identify changes:** What deployed recently? Any config changes?
3. **Check dependencies:** Are downstream services healthy?
4. **Review logs and metrics:** Find the first error, the spike, the divergence
5. **Form hypothesis:** What is the most likely cause?
6. **Test hypothesis:** Can you reproduce or validate the theory?
7. **Implement fix:** Rollback, config change, scale up, patch
8. **Verify recovery:** Confirm metrics return to normal, user reports resolved

Mitigation strategies:

StrategyWhen to UseRisk
RollbackRecent deployment caused issueLow, if tested
Feature flag disableSpecific feature is brokenVery low
Scale upCapacity exhaustionLow, but may mask root cause
Circuit breakerDependency is failingLow, degrades functionality
Traffic shiftRegional or deployment issueMedium, requires prep
Manual interventionData corruption, complex stateHigh, requires expertise

5. Resolve and Close

Formalize the end of an incident:

## Resolution Checklist

- [ ] Service fully restored and verified
- [ ] Monitoring shows green for 15+ minutes
- [ ] Status page updated to "Resolved"
- [ ] Final communication sent to stakeholders
- [ ] Scribe has complete timeline documented
- [ ] Postmortem scheduled within 48 hours
- [ ] Incident formally closed in tracking system

Resolution principles:

  • Do not close until you have monitoring confirmation
  • Keep the incident channel open for 24 hours for follow-up questions
  • Schedule postmortem before memory fades
  • Track MTTR and incident frequency as operational metrics

Best Practices

  • Practice before you need it. Run game days and chaos engineering exercises.
  • Start with mitigation, not root cause. Fix the user impact first; investigate after.
  • One incident commander. Decision authority must be clear and singular.
  • Communicate early and often. Silence during an incident creates panic.
  • Document everything. The scribe’s notes are the foundation of the postmortem.
  • Learn from every incident. If you are having the same incident twice, your process is broken.

Common Mistakes

  • No clear IC. Multiple people giving orders creates confusion and delay.
  • Skipping communication. Stakeholders make their own (often wrong) assumptions.
  • Chasing root cause before mitigating. Users do not care why it broke; they care that it works.
  • Forgetting to verify. Marking resolved too early leads to re-opened incidents.
  • No follow-up. Incidents without postmortems are wasted learning opportunities.

Variants

  • Automated incident response: Auto-remediation runbooks triggered by alerts
  • Follow-the-sun response: Regional teams hand off incidents across time zones
  • External dependency incidents: Pre-defined escalation to third-party vendors
  • Security incident response: Separate playbook for breaches and data exposure

FAQ

Q: When should I declare an incident vs. handle as a normal alert? Declare when user-impacting symptoms are confirmed and the standard alert response is insufficient. When in doubt, declare.

Q: Who should be Incident Commander? The most available senior engineer who is not actively debugging. IC coordinates; they do not investigate.

Q: How do I run an effective postmortem? Schedule within 48 hours, focus on process and system improvements, not blame. See the Postmortem Guide.

Q: What if we cannot find the root cause? That is okay. Document what you know, what you tried, and what you will monitor. Some incidents remain partially unexplained.

Conclusion

Incident response is a team sport with clear rules. By declaring early, assigning roles, communicating relentlessly, and focusing on mitigation before investigation, you turn chaotic outages into structured, learnable events.