Skip to content
SP StackPractices
intermediate By StackPractices

Site Reliability Engineering — SRE Practices and Error Budgets

A practical guide to SRE: defining SLIs, SLOs, and SLAs, managing error budgets, toil reduction, on-call rotations, and building a culture of reliability.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Site Reliability Engineering (SRE), pioneered at Google, applies software engineering principles to operations. Instead of treating reliability as a separate function, SRE teams write code to automate operations, manage infrastructure, and measure system health through Service Level Objectives (SLOs). The core tenet: reliability is a feature, not an afterthought. SRE balances the need for velocity (shipping features) with the need for stability (keeping systems running) through error budgets, toil budgets, and blameless postmortems.

When to Use

  • You operate production systems where downtime has business impact
  • Development and operations teams are in conflict over release velocity vs stability
  • You need objective, measurable definitions of “reliable”
  • Manual operational work consumes significant engineering time
  • Incident response is reactive and ad-hoc rather than structured

The Hierarchy of Reliability Concepts

ConceptDefinitionExample
SLIService Level Indicator — what you measure”99th percentile request latency”
SLOService Level Objective — target over time”p99 latency < 200ms over 30 days”
SLAService Level Agreement — contract with penalty”99.9% uptime or 10% service credit”
Error budget1 - SLO; amount of acceptable failure0.1% error budget = 43m downtime/month

Defining SLIs

Choose indicators that users actually care about:

User-facingSystem-facing
Request latencyCPU utilization
Error rateMemory pressure
ThroughputQueue depth
AvailabilityReplication lag

Latency SLI example:

SLI = proportion of requests with latency < 200ms
measured over a 1-minute window

Setting SLOs

  1. Start with what you can measure — do not set an SLO you cannot track
  2. Base on historical performance — look at the last 30-90 days, pick the 50th percentile, not the best case
  3. Leave headroom — if you are at 99.9%, set SLO at 99.5% to allow for growth
  4. Review quarterly — tighten or relax based on business needs and technical capability
SLOError budget (monthly)Use case
99%7.3 hoursInternal tools, non-critical
99.9%43 minutesCustomer-facing services
99.99%4.3 minutesCore revenue systems
99.999%26 secondsRarely justified; extremely expensive

Error Budget Policy

IF error_budget_remaining > 50%:
    → Full release velocity

IF 25% < error_budget_remaining < 50%:
    → Requires SRE review for risky changes

IF 0% < error_budget_remaining < 25%:
    → Freeze all non-critical releases
    → Prioritize reliability work

IF error_budget_exhausted:
    → All feature work stops
    → Only reliability fixes and mitigation

Toil Reduction

Toil is manual, repetitive, automatable operational work with no enduring value.

Toil typeAutomation approach
Manual scalingHorizontal pod autoscaling, cluster autoscaler
Manual deploymentsCI/CD pipelines with automated canary analysis
Manual log reviewAlerting on derived metrics, not raw logs
Ticket-driven changesSelf-service portals with guardrails
On-call pages for known issuesAuto-remediation runbooks

Toil budget: Google recommends capping toil at 50% of an SRE’s time. The other 50% goes to project work that improves the system.

On-Call Rotation Design

PatternBest forRoster size
Primary/secondarySmall teams, critical services4-6 people
Follow-the-sunGlobal teams, 24/7 coverage3+ regions
No on-call (pagerless)Teams with mature automationRequires significant investment

On-call health metrics:

  • Pages per shift (target: < 2)
  • Time to acknowledge (target: < 5 minutes)
  • Time to resolve (track, but do not target — quality over speed)
  • Post-incident action items closed within 30 days (target: 100%)

Blameless Postmortem Template

## Incident: [Short description] — [Date]

### Impact
- Duration: 23 minutes
- Affected users: ~1,200
- Revenue impact: $0 (free tier)

### Timeline
- 14:32 — Monitoring alert fired
- 14:35 — On-call acknowledged
- 14:40 — Root cause identified (DB connection pool exhaustion)
- 14:55 — Service fully recovered

### Root Cause
The connection pool was sized for 100 connections. A deployment doubled traffic without scaling the pool.

### Contributing Factors
- No load test for the new deployment
- Connection pool size was not exposed as a tunable
- Alert threshold was too high (only fired at 95% error rate)

### Action Items
| Owner | Task | Due |
|-------|------|-----|
| @alice | Add connection pool autoscaling | 2026-07-15 |
| @bob | Run load tests in staging | 2026-07-01 |
| @charlie | Lower error rate alert to 1% | 2026-06-30 |

### Lessons Learned
We need to treat connection pools as elastic resources, not fixed constants.

Common Mistakes

  • Setting SLOs too high — 99.999% sounds impressive but costs 10x more than 99.9% for marginal benefit
  • Using SLAs as SLOs — SLAs are external contracts; SLOs are internal targets. SLOs should be stricter than SLAs.
  • No error budget policy — without consequences for burning budget, SLOs are meaningless
  • Toil that is “just part of the job” — if it is repetitive and manual, it is toil. Automate it.
  • Blameful postmortems — focusing on who made a mistake creates fear and hides systemic issues

FAQ

What is the difference between SRE and DevOps? DevOps is a cultural movement and set of practices. SRE is a specific implementation of DevOps principles with quantitative reliability targets and a 50% toil cap.

How do I convince management to adopt SLOs? Frame SLOs as risk management. They answer “How fast can we ship without breaking customer trust?” Error budgets create a data-driven conversation between engineering and product.

Should every team have an SRE? Not necessarily. Start with SLOs and error budgets. As toil grows, dedicate engineering time to automation. When that is not enough, form a dedicated SRE team.