Skip to content
SP StackPractices
intermediate

Monitoring and Alerting — Metrics, Logs, and Dashboards

A practical guide to observability: the three pillars (metrics, logs, traces), RED and USE methods, alert design, and building dashboards that actually help.

Topics: devops

Monitoring and Alerting — Metrics, Logs, and Dashboards

Introduction

You cannot improve what you cannot measure. Monitoring tells you when systems are unhealthy; alerting wakes you up when action is needed. But poorly designed alerting creates fatigue, burnout, and ignored pages. This guide covers the three pillars of observability, how to design actionable alerts, and how to build dashboards that help during incidents.

The Three Pillars of Observability

PillarWhat It AnswersExample Tools
MetricsWhat is the system doing over time?Prometheus, Datadog, CloudWatch
LogsWhat happened in detail?ELK, Loki, Splunk
TracesWhere did the request go and how long did each step take?Jaeger, Zipkin, OpenTelemetry

Metrics

Time-series data about system health. Cheap to store, fast to query.

# Application-level custom metric
from prometheus_client import Counter, Histogram

request_count = Counter('http_requests_total', 'Total requests', ['method', 'endpoint'])
request_latency = Histogram('http_request_duration_seconds', 'Request latency', ['endpoint'])

def handle_request(request):
    request_count.labels(method=request.method, endpoint=request.path).inc()
    with request_latency.labels(endpoint=request.path).time():
        return process(request)

Key metric types: Counters (always increase), Gauges (go up and down), Histograms (buckets of values).

Logs

Structured or unstructured records of discrete events.

{
  "timestamp": "2024-06-12T10:23:45Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc-123",
  "message": "Payment processor returned 503",
  "context": { "user_id": 456, "amount": 99.99 }
}

Rule: Use structured logs (JSON) in production. They are parseable, searchable, and correlate with traces.

Traces

Follow a single request across services.

[Gateway] 2ms → [Auth] 15ms → [Orders] 45ms → [DB] 30ms → [Payment] 120ms

Traces reveal where latency actually lives. A p99 of 500ms might be 400ms in payment and 100ms everywhere else.

The RED Method (for Services)

Monitor every service with these three metrics:

MetricQuestionExample Threshold
RateHow many requests per second?Baseline: 1000 req/s
ErrorsWhat percentage of requests fail?Alert if > 0.1% for 2 minutes
DurationHow long do requests take?Alert if p99 > 500ms for 5 minutes

The USE Method (for Resources)

Monitor every resource (CPU, disk, network, memory) with these three:

MetricQuestionExample Threshold
UtilizationHow busy is the resource?CPU > 80%
SaturationHow much work is queued?Disk queue depth > 10
ErrorsHow many errors occurred?Network packet drops > 0.1%

Alert Design

Good Alerts Are Actionable

A good alert answers three questions:

  1. What is wrong? — clear metric name and threshold breached
  2. Where is it wrong? — service name, region, environment
  3. What should I do? — link to runbook or suggested action

Bad Alert Examples

Bad AlertWhy It Is Bad
”CPU high”On which server? For how long? What do I do?
”Disk usage > 90%“Is this normal? Is it growing? Which service is affected?
”Log error rate increased”By how much? Is it a spike or a trend?

Good Alert Example

[SEV-2] payment-api p99 latency > 500ms in us-east-1
- Current: 750ms (baseline: 200ms)
- Duration: 8 minutes
- Runbook: https://wiki/runbooks/payment-latency
- Suggested action: Check payment processor status page

Alert Severity

SeverityResponse TimeAction
Page (Critical)5 minutesWake someone up
Ticket (Warning)4 hoursCreate a ticket for business hours
Log (Info)NoneRecord for dashboards and analysis

Rule: If an alert fires and no one takes action, downgrade it to a ticket or log.

Dashboard Design

The 5-Second Rule

A dashboard should tell you if the system is healthy in 5 seconds.

RowPurposeExample Panels
Row 1: HealthIs the system up?Error rate, availability SLA, throughput
Row 2: LatencyAre we fast enough?p50, p95, p99 latency by endpoint
Row 3: ResourcesAre we running out of capacity?CPU, memory, disk, network
Row 4: BusinessAre users happy?Sign-ups, checkouts, active sessions

Dashboard Anti-Patterns

  • 50 panels on one screen — information overload
  • Dashboards no one looks at — if it is not reviewed weekly, delete it
  • Static thresholds that never change — tune alerts as baselines shift

Best Practices

  • Instrument before you need it — adding metrics during an incident is too late
  • Use percentiles, not averages — averages hide outliers; p95 and p99 tell the real story
  • Correlation IDs everywhere — tie logs, metrics, and traces to a single request ID
  • Alert on symptoms, not causes — “users cannot check out” is better than “CPU is high”
  • Review alerts quarterly — remove noise, tune thresholds, consolidate duplicates
  • Test your runbooks — a runbook that has not been tested in 6 months is probably wrong

Common Mistakes

  • Alerting on every possible failure mode — alert fatigue kills response quality
  • Not having a “canary” metric — deploy a change and watch a single golden metric
  • Ignoring baseline shifts — if p99 drifts from 100ms to 300ms over a month, investigate before it becomes an incident
  • Dashboards without owners — someone must own and maintain each dashboard
  • No post-incident metric review — after every incident, ask “what metric would have caught this earlier?”

Frequently Asked Questions

Should I build my own monitoring or buy a SaaS?

Buy until it is a strategic differentiator. Prometheus + Grafana is free but requires expertise. Datadog/New Relic cost money but work immediately. Start with SaaS; move to self-hosted only if costs justify the operational overhead.

What is the difference between monitoring and observability?

Monitoring asks known questions (“is CPU high?”). Observability enables asking unknown questions (“why is this user experiencing 5-second latency?”). Monitoring is a subset of observability. You need both.

How many alerts should a service have?

3-5 critical alerts (pages), 5-10 warnings (tickets), unlimited info (logs/dashboards). More than 10 critical alerts means you are alerting on symptoms, not user impact.