Skip to content
SP StackPractices
beginner By StackPractices

Monitoring and Alerting Policy Template

A policy template that defines how alerts are configured, routed, escalated, and reviewed across services and infrastructure.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

A Monitoring and Alerting Policy defines how an organization detects problems, notifies the right people, and escalates when issues are not resolved quickly. Without a clear policy, teams suffer from alert fatigue, missed incidents, or inconsistent response times. This template provides a structured framework for alert thresholds, severity levels, routing rules, escalation paths, and regular review.

When to Use

  • Setting up a new observability platform or monitoring stack.
  • Onboarding a new service or team into the alerting system.
  • Reviewing alert quality after a period of noise or missed incidents.
  • Defining on-call responsibilities and escalation paths.
  • Preparing for an audit of operational maturity or incident response.

Prerequisites

  • A monitoring and observability platform such as Prometheus, Datadog, Grafana, New Relic, or PagerDuty.
  • A list of critical services and infrastructure components.
  • Defined on-call rotations and escalation contacts.
  • A communication channel for alerts, such as Slack, Microsoft Teams, or email.
  • An incident response process that alerts will trigger.

Solution

Policy Template

1. Alert Severity Levels

SeverityResponse TimeExampleNotification Channel
P1 - CriticalImmediate (5 min)Service down, data loss, revenue impactPage on-call + executive notification
P2 - High15 minutesDegraded performance, failed backupsPage on-call + Slack alert
P3 - Medium1 hourHigh error rate, resource pressureSlack or email to owning team
P4 - LowNext business dayCapacity warning, non-urgent driftEmail or dashboard notice
P5 - InformationalNoneUsage metrics, trend dataDashboard only

2. Alert Categories

CategoryPurposeExamples
AvailabilityDetect service unreachabilityHTTP 5xx, connection timeout, health check failure
PerformanceDetect latency and throughput issuesp99 latency > 500ms, queue depth high
CapacityDetect resource exhaustionCPU > 85%, disk > 80%, memory pressure
Error rateDetect unusual failure ratesError rate > 1% for 5 minutes
SecurityDetect suspicious activityFailed logins, rate limit hits, blocked traffic
BusinessDetect revenue or workflow impactFailed payments, order drop, signup failure
Data healthDetect pipeline or data quality issuesStale data, missing partitions, sync lag

3. Alert Routing Matrix

TeamPrimary HoursOn-Call HoursChannelsEscalation Path
Platform team08:00 - 18:00 UTC24/7PagerDuty, #platform-alertsManager, then VP Engineering
Application team08:00 - 18:00 UTC24/7PagerDuty, #app-alertsTeam lead, then Engineering manager
Security team24/724/7PagerDuty, #security-alertsSecurity lead, then CISO
Database team08:00 - 18:00 UTC24/7PagerDuty, #db-alertsDBA lead, then Platform manager
Business operationsBusiness hoursNoneEmail, SlackOperations manager

4. Alert Threshold Guidelines

SignalWarning ThresholdCritical ThresholdEvaluation Window
HTTP error rate> 1% for 5 min> 5% for 2 minRolling 5 min
Response latency p99> 500ms for 10 min> 1s for 5 minRolling 10 min
CPU utilization> 70% for 10 min> 90% for 5 minRolling 5 min
Disk utilization> 75% for 1 hour> 90% for 15 minRolling 15 min
Memory utilization> 80% for 10 min> 95% for 5 minRolling 5 min
Queue depth> 1000 for 10 min> 5000 for 5 minRolling 5 min
Failed backupN/AAny failed backupPer job run
SSL certificate expiry< 30 days< 7 daysDaily check

5. Escalation Rules

SeverityInitial AlertNo AcknowledgmentStill UnresolvedFinal Escalation
P1Page on-call immediately5 min15 minExecutive notification + war room
P2Page on-call15 min30 minManager page
P3Slack to owning team1 hour4 hoursManager notification
P4Email or dashboardNext business dayN/AWeekly review

6. Alert Review and Maintenance

ActivityFrequencyOwnerOutput
Alert quality reviewWeeklyOn-call engineerTop noisy alerts, tuning actions
Alert runbook reviewMonthlySRE teamUpdated runbooks for each alert
Threshold calibrationQuarterlyObservability teamThreshold adjustments with evidence
On-call retroAfter major incidentIncident commanderAlert improvements, follow-up tasks
Policy reviewAnnuallyEngineering leadershipUpdated policy document

Explanation

This policy turns raw monitoring signals into actionable alerts. By assigning severity, routing, and escalation rules, the organization ensures that critical problems get fast attention while low-priority warnings do not disrupt on-call engineers. The review and maintenance section prevents alert fatigue by continuously tuning thresholds and removing noisy alerts.

Variants

  • Cloud-native alerting policy: Uses Prometheus Alertmanager, Grafana Oncall, or PagerDuty for container and serverless environments.
  • Enterprise IT monitoring policy: Focuses on infrastructure, network, and service desk integration.
  • Security alerting policy: Emphasizes SIEM rules, threat detection, and incident response triggers.
  • Business operations alerting: Tracks KPIs, revenue, and customer-facing metrics with business-hour notifications.
  • Developer self-service alerting: Allows teams to define their own alert rules within guardrails.

Best Practices

  • Alert on symptoms that affect users, not just internal metrics.
  • Use multi-window or multi-burn rate thresholds to reduce false positives.
  • Require every alert to have an associated runbook or troubleshooting link.
  • Route alerts to the team that can fix the problem, not a central queue.
  • Keep alert messages concise and include context such as severity, service, and impact.
  • Review noisy alerts weekly and tune or delete them.
  • Test escalation paths during regular drills.
  • Document alert thresholds and the rationale for changes.

Common Mistakes

  • Alerting on every metric threshold without considering user impact.
  • Sending all alerts to a single channel with no routing.
  • Using the same severity for every alert.
  • Not requiring acknowledgment or tracking resolution time.
  • Ignoring alerts that fire repeatedly without action.
  • Missing escalation paths for severe incidents.
  • Failing to review and retire stale alerts after system changes.

FAQs

What is alert fatigue and how do we avoid it?

Alert fatigue happens when on-call engineers receive too many low-value alerts. Avoid it by tuning thresholds, grouping related alerts, suppressing known issues, and regularly deleting alerts that do not lead to action.

Should every alert page someone?

No. Only P1 and P2 alerts should page the on-call engineer. Lower-severity alerts should use Slack, email, or dashboard notifications to avoid disrupting response time for critical issues.

How do we know if our thresholds are right?

Track the ratio of actionable alerts to total alerts, measure mean time to acknowledge and resolve, and review false-positive rates. If an alert fires frequently without action, it is a candidate for tuning or removal.