Skip to content
SP StackPractices
intermediate

Logging, Monitoring & Observability Guide

A guide to building observable systems with structured logging, metrics, and distributed tracing.

Introduction

Observability is the ability to understand a system’s internal state by examining its outputs. The three pillars — logs, metrics, and traces — provide different perspectives on system behavior.

The Three Pillars

PillarQuestionGranularityRetention
LogsWhat happened?High (individual events)Days to weeks
MetricsHow is it trending?Low (aggregated)Months to years
TracesWhere did time go?Medium (request paths)Days to weeks

Structured Logging

Replace free-form text with machine-parseable JSON.

Format

{
  "timestamp": "2026-06-11T14:32:01Z",
  "level": "ERROR",
  "message": "Payment failed",
  "service": "billing-api",
  "trace_id": "abc123",
  "user_id": "user_456",
  "amount": 99.99,
  "error": "Card declined",
  "duration_ms": 245
}

Implementation (Python)

import structlog
import logging

structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

logger = structlog.get_logger()
logger.info("payment_processed", user_id="123", amount=49.99)

Log Levels

LevelUse CaseExample
DEBUGDevelopment detailVariable values, loop iterations
INFONormal operationsRequest completed, job started
WARNUnexpected but handledRetry attempted, deprecated API used
ERRORFailed operationRequest failed, exception caught
FATALSystem unavailabilityDatabase connection lost

Metrics

Metrics are numeric data points collected over time.

Metric Types

TypeDescriptionExample
CounterOnly increasesRequests served, errors occurred
GaugeCan go up or downCurrent queue size, memory usage
HistogramDistribution of valuesRequest duration, payload size
SummaryCalculated percentilesp95 latency, p99 latency

Implementation (Prometheus)

from prometheus_client import Counter, Histogram, start_http_server

requests_total = Counter('http_requests_total', 'Total requests', ['method', 'status'])
request_duration = Histogram('http_request_duration_seconds', 'Request duration')

@request_duration.time()
def handle_request():
    requests_total.labels(method='GET', status='200').inc()
    # ... process request

start_http_server(8000)  # Exposes /metrics

Distributed Tracing

Traces follow a request across multiple services.

Trace ID: abc123
├── Service A: 5ms  (HTTP request received)
├── Service B: 12ms (Auth check)
├── Service C: 45ms (Database query)
│   ├── Connection acquire: 2ms
│   ├── Query execution: 30ms
│   └── Result mapping: 13ms
└── Service D: 8ms  (Response formatting)

Implementation (OpenTelemetry)

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

span_processor = BatchSpanProcessor(OTLPSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)

with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("payment.amount", 99.99)
    span.set_attribute("payment.currency", "USD")
    # ... business logic

Alerting

Alert on symptoms, not causes.

Alerting Rules

# Prometheus alerting rule
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"

Alert Severity Levels

SeverityResponse TimeExample
CriticalImmediateService down, data loss risk
WarningWithin 1 hourElevated error rate, high latency
InfoNext business dayCapacity approaching limit

Best Practices

  • Use correlation IDs: Pass trace_id through every service call
  • Log at boundaries: Entry/exit of requests, jobs, and transactions
  • Avoid logging sensitive data: No passwords, tokens, or PII
  • Set SLOs and error budgets: Define what “good” means and measure against it
  • Alert fatigue is real: Page only for actionable, critical issues

Common Mistakes

  • Logging everything at INFO level
  • Metrics without labels (no dimensions to slice by)
  • Alerting on CPU usage instead of user-facing symptoms
  • Storing logs indefinitely without a retention policy