Skip to content
SP StackPractices
intermediate By StackPractices

Alert Management — On-Call Alerting Best Practices

A practical guide to alert management: reducing alert fatigue, defining severity levels, escalation policies, on-call rotation design, and building a sustainable alerting culture.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Alerting is how your systems tell you something needs attention. Done poorly, it creates noise, burnout, and slower incident response. Done well, it gives the right person the right information at the right time so they can act decisively.

This guide covers alert design, severity classification, on-call structures, escalation policies, and sustainable operational practices.

When to Use

  • Your team receives more than 5 alerts per person per week
  • Alerts are frequently ignored or treated as noise
  • Critical alerts are missed due to volume
  • You are establishing or redesigning an on-call rotation
  • Alert fatigue is causing burnout or attrition

Core Concepts

ConceptDescription
Alert FatigueDesensitization caused by too many low-value alerts
SeverityClassification of alert urgency (critical, warning, info)
EscalationAutomatically routing unacknowledged alerts to the next responder
On-Call RotationScheduled responsibility for incident response
Alert BudgetMaximum acceptable alert volume per time period
RunbookStep-by-step guide for responding to a specific alert

Severity Levels

Define clear, actionable severity levels:

LevelNameResponse TimeChannelExample
P1Critical5 minutesPage/SMSService down, revenue impact, data loss
P2High30 minutesPage/SlackDegraded performance, partial outage
P3Medium4 hoursSlack/EmailCapacity threshold, non-urgent anomaly
P4Low1-2 business daysTicketCleanup needed, non-urgent optimization
P5InfoNoneDashboard onlyMetrics for context, not action required

Severity design principles:

  • P1 means drop everything and respond immediately
  • P2 means respond within the current working period
  • P3 and below do not page; they create tickets or Slack messages
  • If everything is P1, nothing is P1
  • Review severity distribution monthly; aim for <10% P1

Step-by-Step Alert Management

1. Design Alerts That Matter

Every alert must be actionable and user-impacting:

# Example: Prometheus alert rules with severity

groups:
  - name: service_alerts
    rules:
      # P1: User-facing service is down
      - alert: ServiceDown
        expr: up{job=~"api|web|payment"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} is down"
          runbook_url: "https://wiki/runbooks/service-down"

      # P2: Elevated error rate but service still responding
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "High error rate in {{ $labels.service }}"

      # P3: Capacity warning — no immediate action needed
      - alert: DiskSpaceWarning
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
        for: 10m
        labels:
          severity: medium
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"

      # P4: Informational — track but do not page
      - alert: HighMemoryUsage
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
        for: 30m
        labels:
          severity: low
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

Alert design checklist:

  • Alert on symptoms users feel (errors, latency), not causes (disk full)
  • Every P1/P2 alert must have a runbook link
  • Use for: duration to prevent flapping (require sustained failure)
  • Include summary and description that tell the responder what to check
  • Add labels for service, environment, team, and severity

2. Build On-Call Rotations

Design rotations that are fair and sustainable:

Rotation TypeBest ForStructure
Primary/SecondarySmall teams (3-6)One primary, one backup
Follow-the-sunGlobal teams8-hour shifts across time zones
Weekly rotationMedium teams (6-12)One week on, 3-5 weeks off
Daily rotationLarge teams (12+)One day on, rest of week off
# Example: PagerDuty rotation configuration
# Primary: Weekly rotation, 6 engineers
# Secondary: Next person in rotation
# Escalation: Manager after 15 minutes

Rotation best practices:

  • Limit on-call frequency to no more than 1 week in 4
  • Ensure handoff between shifts includes active incidents
  • Compensate for on-call time (pay or time off)
  • Allow opt-out for personal events with coverage
  • Track and review incident frequency per rotation

3. Define Escalation Policies

Ensure unacknowledged alerts reach a human:

Escalation Path Example:

Alert Fires
    → Primary on-call (page + SMS)
        → Acknowledged? (stop)
        → Not acknowledged in 5 min
            → Secondary on-call (page)
                → Acknowledged? (stop)
                → Not acknowledged in 10 min
                    → Engineering Manager (page)
                        → Not acknowledged in 15 min
                            → Director of Engineering (page)

Escalation principles:

  • Escalate quickly for P1 (5-10 minute intervals)
  • Escalate more slowly for P2 (30-60 minute intervals)
  • Include the previous responder in the escalation chain
  • Set up team-wide Slack channels for visibility
  • Log all escalations for post-incident review

4. Create Runbooks for Every Alert

A runbook turns an alert into a solvable problem:

# Runbook: ServiceDown

## Alert
ServiceDown: `{{ $labels.job }}` is down

## Impact
Users cannot access `{{ $labels.job }}`. Revenue impact if payment or API.

## Diagnosis Steps
1. Check service health endpoint: `curl http://{{ $labels.instance }}/health`
2. Check if pod is running: `kubectl get pods -l app={{ $labels.job }}`
3. Check recent deployments: `kubectl rollout history deployment/{{ $labels.job }}`
4. Check resource usage: `kubectl top pod -l app={{ $labels.job }}`
5. Check logs: `kubectl logs -l app={{ $labels.job }} --tail=100`

## Resolution Steps
1. If pod crashed: `kubectl rollout restart deployment/{{ $labels.job }}`
2. If resource exhausted: Scale deployment or node pool
3. If deployment caused issue: `kubectl rollout undo deployment/{{ $labels.job }}`
4. If dependency down: Check dependency status and escalate to owning team

## Escalation
If unresolved in 15 minutes, escalate to: platform-team@company.com

Runbook best practices:

  • One runbook per P1/P2 alert
  • Include diagnosis, resolution, and escalation steps
  • Link runbook directly in alert notification
  • Review and update runbooks quarterly
  • Measure runbook effectiveness (time to resolve when followed)

5. Reduce Alert Fatigue

Actively measure and reduce alert volume:

MetricTargetAction if Exceeded
Alerts per person per week< 5Tune thresholds, remove noisy alerts
P1 alerts per month< 2Fix root causes, not symptoms
Alert acknowledgment time< 5 min for P1Improve runbooks, training
False positive rate< 10%Increase for: duration, add conditions
Alerts without runbooks0Create missing runbooks

Fatigue reduction tactics:

  • Consolidate: Group related alerts into one notification
  • Suppress: Silence known maintenance windows
  • Deduplicate: One alert per incident, not per affected host
  • Auto-remediate: Auto-restart, auto-scale for known recoverable issues
  • Delete: Remove alerts that fire more than once without action

Best Practices

  • Alert on symptoms, not causes. Disk full is a cause; slow requests is the symptom.
  • Every alert must be actionable. If the response is “wait and see,” it should not page.
  • Use for: to prevent flapping. Require sustained threshold breach before alerting.
  • Separate paging from logging. Not everything that is interesting needs to wake someone up.
  • Review alerts monthly. Track which alerts fire, which are acknowledged, and which are ignored.
  • Compensate on-call fairly. On-call is work; treat it as such.

Common Mistakes

  • Alerting on everything. More alerts do not mean better coverage; they mean more noise.
  • No escalation path. If the primary does not respond, the alert dies silently.
  • Missing runbooks. An alert without a runbook forces the responder to guess.
  • Ignoring alert fatigue. High alert volume leads to burnout and missed critical alerts.
  • Static thresholds on cyclical metrics. CPU spikes during batch jobs are normal; alert on deviation from baseline instead.

Variants

  • No-ops alerting: Fully automated remediation without human involvement
  • Severity-based routing: Different channels for different severities (Slack for P3, PagerDuty for P1)
  • Team-specific ownership: Alerts route to the team that owns the service
  • AI-assisted alerting: Anomaly detection that adjusts thresholds dynamically

FAQ

Q: How many alerts per week is too many? More than 5 actionable alerts per person per week is excessive. If you are paging more than once per week, something is wrong with your system or your thresholds.

Q: Should I alert on CPU and memory usage? Generally no, unless those metrics directly correlate with user-impacting symptoms. Alert on request latency, error rate, and throughput instead.

Q: How do I handle noisy alerts I cannot fix immediately? Temporarily silence them with an expiration date, create a ticket to fix the root cause, and schedule the fix within the current sprint.

Q: What is the difference between an alert and a dashboard? Alerts notify you of something that requires action. Dashboards help you understand what is happening. Use alerts for urgent issues; dashboards for investigation.

Conclusion

Good alerting is a product you build for your on-call engineers. It should be precise, actionable, and respectful of their time. By designing alerts around user impact, creating clear runbooks, and actively reducing noise, you build an operational culture that is sustainable and effective.