Skip to content
SP StackPractices
intermediate By StackPractices

Metrics and Dashboards — From Raw Data to Actionable Insights

A practical guide to metrics and dashboards: instrumenting applications, choosing metric types, building effective dashboards, and creating alerting pipelines with Prometheus, Grafana, and Datadog.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Metrics are numerical measurements collected over time that tell you how your systems behave. Dashboards visualize those metrics to make patterns visible. Together, they form the foundation of operational awareness, enabling teams to spot trends, detect anomalies, and make data-driven decisions.

This guide covers metric types, instrumentation patterns, dashboard design, and alert creation.

When to Use

  • You need to monitor system health and performance over time
  • You want to detect trends before they become incidents
  • Your team needs a shared operational picture
  • You are establishing SLOs and need to measure compliance
  • You want to reduce MTTR with visual, queryable data

Core Concepts

ConceptDescription
CounterA cumulative metric that only increases (requests served, errors)
GaugeA metric that can go up or down (temperature, queue depth, CPU)
HistogramSamples observations into configurable buckets (request duration)
SummarySimilar to histogram but calculates percentiles client-side
CardinalityNumber of unique time series (high cardinality = expensive)
SLI / SLO / SLAService Level Indicator, Objective, and Agreement

Metric Types and When to Use Them

TypeUse CaseExampleDo Not Use For
CounterCounting eventshttp_requests_totalValues that decrease
GaugePoint-in-time valuesmemory_usage_bytes, queue_sizeRates or cumulative counts
HistogramDistribution of valuesrequest_duration_secondsExact percentile calculation (use summary)
SummaryPre-computed percentilesrequest_latency_quantileWhen you need histogram heatmaps

Step-by-Step Metrics and Dashboards

1. Instrument Your Applications

Expose metrics in a format your collector understands:

# Example: Python application metrics with Prometheus client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'status', 'path']
)

request_duration_seconds = Histogram(
    'request_duration_seconds',
    'HTTP request duration',
    ['method', 'path'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# Instrument your code
@app.route("/api/orders/<order_id>")
def get_order(order_id):
    start = time.time()
    active_connections.inc()
    
    try:
        order = fetch_order(order_id)
        http_requests_total.labels(method='GET', status='200', path='/api/orders').inc()
        return jsonify(order)
    except OrderNotFound:
        http_requests_total.labels(method='GET', status='404', path='/api/orders').inc()
        return jsonify({"error": "Not found"}), 404
    finally:
        request_duration_seconds.labels(method='GET', path='/api/orders').observe(time.time() - start)
        active_connections.dec()

# Expose metrics endpoint
start_http_server(8000)
// Example: Spring Boot with Micrometer
@Configuration
public class MetricsConfig {
    
    @Bean
    public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
        return registry -> registry.config()
            .commonTags("application", "orders-service");
    }
}

@Service
public class OrderService {
    private final Counter orderCounter;
    private final Timer orderTimer;
    
    public OrderService(MeterRegistry registry) {
        this.orderCounter = Counter.builder("orders.processed")
            .description("Total orders processed")
            .register(registry);
        this.orderTimer = Timer.builder("orders.processing.time")
            .description("Order processing time")
            .register(registry);
    }
    
    public Order processOrder(OrderRequest request) {
        return orderTimer.recordCallable(() -> {
            Order result = doProcess(request);
            orderCounter.increment();
            return result;
        });
    }
}

Instrumentation checklist:

  • Instrument the four golden signals: latency, traffic, errors, saturation
  • Add labels for dimensions you will filter by (service, environment, endpoint)
  • Use consistent naming: unit suffix, total for counters, seconds for duration
  • Avoid high-cardinality labels (user IDs, session IDs, request IDs)
  • Measure business metrics (orders placed, payments processed) alongside technical metrics

2. Collect and Store Metrics

Set up a metrics pipeline:

# Example: Prometheus scrape configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Collection best practices:

  • Scrape every 10-30 seconds (faster for high-frequency changes)
  • Use service discovery (Kubernetes, Consul, DNS) instead of static targets
  • Run collectors in each region/zone to minimize latency
  • Use remote write for long-term storage (Thanos, Cortex, VictoriaMetrics)
  • Federation for hierarchical aggregation (edge → regional → global)

3. Build Effective Dashboards

Design dashboards that tell a story:

Dashboard TypePurposeKey Panels
Service overviewHealth of a single serviceError rate, latency p95, throughput, resource usage
Golden signalsCross-service healthRED metrics (Rate, Errors, Duration) per service
Business KPIImpact on revenue/usageConversions, active users, transaction volume
InfrastructureCluster/node healthCPU, memory, disk, network across all nodes
Incident responseDrill-down during incidentsDetailed per-endpoint latency, error breakdown, logs
// Example: Grafana dashboard JSON snippet (simplified)
{
  "dashboard": {
    "title": "Orders Service - Golden Signals",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [{
          "expr": "sum(rate(http_requests_total{service=\"orders\"}[5m])) by (status)"
        }]
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "targets": [{
          "expr": "sum(rate(http_requests_total{service=\"orders\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"orders\"}[5m]))"
        }],
        "thresholds": [
          {"value": 0.001, "color": "green"},
          {"value": 0.01, "color": "yellow"},
          {"value": 0.05, "color": "red"}
        ]
      },
      {
        "title": "Latency p95",
        "type": "timeseries",
        "targets": [{
          "expr": "histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{service=\"orders\"}[5m])) by (le))"
        }]
      }
    ]
  }
}

Dashboard design principles:

  • Put the most important panels at the top-left
  • Use consistent colors: green = good, yellow = warning, red = critical
  • Add links to related dashboards, logs, and traces
  • Keep the number of panels per dashboard under 20
  • Use template variables for service, environment, and time range

4. Define SLIs and SLOs

Translate metrics into reliability targets:

# Example: SLI queries for common objectives

# Availability SLI: % of successful requests
(
  sum(rate(http_requests_total{status!~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

# Latency SLI: % of requests under threshold
(
  sum(rate(request_duration_seconds_bucket{le="0.5"}[5m]))
  /
  sum(rate(request_duration_seconds_bucket{le="+Inf"}[5m]))
) * 100

# Error budget: remaining acceptable errors
# SLO: 99.9% availability
# Error budget: 0.1% of total requests per month
0.001 * sum(increase(http_requests_total[30d]))
ObjectiveSLISLOMeasurement Window
AvailabilitySuccessful requests / total requests99.9%30 days
LatencyRequests under 200ms / total requests99% under 200ms7 days
Error rateError responses / total responses< 0.1%1 hour
ThroughputRequests per second> 1000 rps5 minutes

5. Create Meaningful Alerts

Alert on symptoms, not causes:

# Example: Prometheus alerting rules
groups:
  - name: service_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate in {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: LatencyDegradation
        expr: |
          histogram_quantile(0.95,
            sum(rate(request_duration_seconds_bucket[5m])) by (le)
          ) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Latency p95 above 1s"

Alert design principles:

  • Alert on user-impacting symptoms (error rate, latency), not causes (disk full)
  • Use for: duration to reduce noise (require sustained threshold breach)
  • Add runbook links and dashboard links to every alert
  • Severity levels: page for critical (user impact), ticket for warning (trending)
  • Regularly review alert frequency and tune thresholds

Best Practices

  • Name metrics consistently. service_unit format: orders_service_requests_total.
  • Document your metrics. Every metric needs a description and unit.
  • Use histograms over averages. Averages hide outliers; histograms show distribution.
  • Cardinality is cost. Every unique label combination creates a new time series.
  • Dashboards are for exploration, not monitoring. Alerts notify; dashboards investigate.
  • Test your dashboards. Walk through incident scenarios to verify they provide answers.

Common Mistakes

  • High-cardinality metrics. Labeling by user ID or request ID explodes storage.
  • Alerting on everything. Too many alerts create noise and reduce response quality.
  • Missing units. A metric named latency is ambiguous — latency_seconds is clear.
  • Averaging percentiles. You cannot average p95s across services. Use histograms.
  • No aggregation rules. Raw high-frequency metrics overwhelm dashboards; aggregate first.

Variants

  • Pull-based: Prometheus scrapes exporters (standard for Kubernetes)
  • Push-based: StatsD, Telegraf, or application pushes to collector (better for short-lived jobs)
  • Cloud-native: AWS CloudWatch, Google Cloud Monitoring, Azure Monitor (managed, but vendor-specific)
  • Enterprise: Datadog, New Relic, Dynatrace (rich features, per-host pricing)

FAQ

Q: How many metrics should my application expose? 10-50 well-chosen metrics beats 1000 auto-generated ones. Focus on the four golden signals and business KPIs.

Q: What scrape interval should I use? 15 seconds is standard. Use 5 seconds for critical systems, 60 seconds for slow-changing infrastructure.

Q: How do I handle metric cardinality? Use static label values (status code class, not exact URL). Drop high-cardinality labels at ingestion if necessary.

Q: Should I use Prometheus or a SaaS solution? Prometheus is free but requires operational expertise. SaaS solutions reduce overhead but increase cost at scale. Many teams use both: Prometheus for real-time, SaaS for long-term.

Conclusion

Metrics and dashboards transform raw system data into operational intelligence. By instrumenting consistently, designing dashboards for decision-making, and alerting on symptoms rather than causes, you build an observability practice that reduces MTTR and improves system reliability.