Monitoring and Alerting — Metrics, Logs, and Dashboards
A practical guide to observability: the three pillars (metrics, logs, traces), RED and USE methods, alert design, and building dashboards that actually help.
Monitoring and Alerting — Metrics, Logs, and Dashboards
Introduction
You cannot improve what you cannot measure. Monitoring tells you when systems are unhealthy; alerting wakes you up when action is needed. But poorly designed alerting creates fatigue, burnout, and ignored pages. This guide covers the three pillars of observability, how to design actionable alerts, and how to build dashboards that help during incidents.
The Three Pillars of Observability
| Pillar | What It Answers | Example Tools |
|---|---|---|
| Metrics | What is the system doing over time? | Prometheus, Datadog, CloudWatch |
| Logs | What happened in detail? | ELK, Loki, Splunk |
| Traces | Where did the request go and how long did each step take? | Jaeger, Zipkin, OpenTelemetry |
Metrics
Time-series data about system health. Cheap to store, fast to query.
# Application-level custom metric
from prometheus_client import Counter, Histogram
request_count = Counter('http_requests_total', 'Total requests', ['method', 'endpoint'])
request_latency = Histogram('http_request_duration_seconds', 'Request latency', ['endpoint'])
def handle_request(request):
request_count.labels(method=request.method, endpoint=request.path).inc()
with request_latency.labels(endpoint=request.path).time():
return process(request)
Key metric types: Counters (always increase), Gauges (go up and down), Histograms (buckets of values).
Logs
Structured or unstructured records of discrete events.
{
"timestamp": "2024-06-12T10:23:45Z",
"level": "error",
"service": "payment-api",
"trace_id": "abc-123",
"message": "Payment processor returned 503",
"context": { "user_id": 456, "amount": 99.99 }
}
Rule: Use structured logs (JSON) in production. They are parseable, searchable, and correlate with traces.
Traces
Follow a single request across services.
[Gateway] 2ms → [Auth] 15ms → [Orders] 45ms → [DB] 30ms → [Payment] 120ms
Traces reveal where latency actually lives. A p99 of 500ms might be 400ms in payment and 100ms everywhere else.
The RED Method (for Services)
Monitor every service with these three metrics:
| Metric | Question | Example Threshold |
|---|---|---|
| Rate | How many requests per second? | Baseline: 1000 req/s |
| Errors | What percentage of requests fail? | Alert if > 0.1% for 2 minutes |
| Duration | How long do requests take? | Alert if p99 > 500ms for 5 minutes |
The USE Method (for Resources)
Monitor every resource (CPU, disk, network, memory) with these three:
| Metric | Question | Example Threshold |
|---|---|---|
| Utilization | How busy is the resource? | CPU > 80% |
| Saturation | How much work is queued? | Disk queue depth > 10 |
| Errors | How many errors occurred? | Network packet drops > 0.1% |
Alert Design
Good Alerts Are Actionable
A good alert answers three questions:
- What is wrong? — clear metric name and threshold breached
- Where is it wrong? — service name, region, environment
- What should I do? — link to runbook or suggested action
Bad Alert Examples
| Bad Alert | Why It Is Bad |
|---|---|
| ”CPU high” | On which server? For how long? What do I do? |
| ”Disk usage > 90%“ | Is this normal? Is it growing? Which service is affected? |
| ”Log error rate increased” | By how much? Is it a spike or a trend? |
Good Alert Example
[SEV-2] payment-api p99 latency > 500ms in us-east-1
- Current: 750ms (baseline: 200ms)
- Duration: 8 minutes
- Runbook: https://wiki/runbooks/payment-latency
- Suggested action: Check payment processor status page
Alert Severity
| Severity | Response Time | Action |
|---|---|---|
| Page (Critical) | 5 minutes | Wake someone up |
| Ticket (Warning) | 4 hours | Create a ticket for business hours |
| Log (Info) | None | Record for dashboards and analysis |
Rule: If an alert fires and no one takes action, downgrade it to a ticket or log.
Dashboard Design
The 5-Second Rule
A dashboard should tell you if the system is healthy in 5 seconds.
| Row | Purpose | Example Panels |
|---|---|---|
| Row 1: Health | Is the system up? | Error rate, availability SLA, throughput |
| Row 2: Latency | Are we fast enough? | p50, p95, p99 latency by endpoint |
| Row 3: Resources | Are we running out of capacity? | CPU, memory, disk, network |
| Row 4: Business | Are users happy? | Sign-ups, checkouts, active sessions |
Dashboard Anti-Patterns
- 50 panels on one screen — information overload
- Dashboards no one looks at — if it is not reviewed weekly, delete it
- Static thresholds that never change — tune alerts as baselines shift
Best Practices
- Instrument before you need it — adding metrics during an incident is too late
- Use percentiles, not averages — averages hide outliers; p95 and p99 tell the real story
- Correlation IDs everywhere — tie logs, metrics, and traces to a single request ID
- Alert on symptoms, not causes — “users cannot check out” is better than “CPU is high”
- Review alerts quarterly — remove noise, tune thresholds, consolidate duplicates
- Test your runbooks — a runbook that has not been tested in 6 months is probably wrong
Common Mistakes
- Alerting on every possible failure mode — alert fatigue kills response quality
- Not having a “canary” metric — deploy a change and watch a single golden metric
- Ignoring baseline shifts — if p99 drifts from 100ms to 300ms over a month, investigate before it becomes an incident
- Dashboards without owners — someone must own and maintain each dashboard
- No post-incident metric review — after every incident, ask “what metric would have caught this earlier?”
Frequently Asked Questions
Should I build my own monitoring or buy a SaaS?
Buy until it is a strategic differentiator. Prometheus + Grafana is free but requires expertise. Datadog/New Relic cost money but work immediately. Start with SaaS; move to self-hosted only if costs justify the operational overhead.
What is the difference between monitoring and observability?
Monitoring asks known questions (“is CPU high?”). Observability enables asking unknown questions (“why is this user experiencing 5-second latency?”). Monitoring is a subset of observability. You need both.
How many alerts should a service have?
3-5 critical alerts (pages), 5-10 warnings (tickets), unlimited info (logs/dashboards). More than 10 critical alerts means you are alerting on symptoms, not user impact.