Metrics and Dashboards — From Raw Data to Actionable Insights
A practical guide to metrics and dashboards: instrumenting applications, choosing metric types, building effective dashboards, and creating alerting pipelines with Prometheus, Grafana, and Datadog.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Metrics are numerical measurements collected over time that tell you how your systems behave. Dashboards visualize those metrics to make patterns visible. Together, they form the foundation of operational awareness, enabling teams to spot trends, detect anomalies, and make data-driven decisions.
This guide covers metric types, instrumentation patterns, dashboard design, and alert creation.
When to Use
- You need to monitor system health and performance over time
- You want to detect trends before they become incidents
- Your team needs a shared operational picture
- You are establishing SLOs and need to measure compliance
- You want to reduce MTTR with visual, queryable data
Core Concepts
| Concept | Description |
|---|---|
| Counter | A cumulative metric that only increases (requests served, errors) |
| Gauge | A metric that can go up or down (temperature, queue depth, CPU) |
| Histogram | Samples observations into configurable buckets (request duration) |
| Summary | Similar to histogram but calculates percentiles client-side |
| Cardinality | Number of unique time series (high cardinality = expensive) |
| SLI / SLO / SLA | Service Level Indicator, Objective, and Agreement |
Metric Types and When to Use Them
| Type | Use Case | Example | Do Not Use For |
|---|---|---|---|
| Counter | Counting events | http_requests_total | Values that decrease |
| Gauge | Point-in-time values | memory_usage_bytes, queue_size | Rates or cumulative counts |
| Histogram | Distribution of values | request_duration_seconds | Exact percentile calculation (use summary) |
| Summary | Pre-computed percentiles | request_latency_quantile | When you need histogram heatmaps |
Step-by-Step Metrics and Dashboards
1. Instrument Your Applications
Expose metrics in a format your collector understands:
# Example: Python application metrics with Prometheus client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'status', 'path']
)
request_duration_seconds = Histogram(
'request_duration_seconds',
'HTTP request duration',
['method', 'path'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
active_connections = Gauge(
'active_connections',
'Number of active connections'
)
# Instrument your code
@app.route("/api/orders/<order_id>")
def get_order(order_id):
start = time.time()
active_connections.inc()
try:
order = fetch_order(order_id)
http_requests_total.labels(method='GET', status='200', path='/api/orders').inc()
return jsonify(order)
except OrderNotFound:
http_requests_total.labels(method='GET', status='404', path='/api/orders').inc()
return jsonify({"error": "Not found"}), 404
finally:
request_duration_seconds.labels(method='GET', path='/api/orders').observe(time.time() - start)
active_connections.dec()
# Expose metrics endpoint
start_http_server(8000)
// Example: Spring Boot with Micrometer
@Configuration
public class MetricsConfig {
@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
return registry -> registry.config()
.commonTags("application", "orders-service");
}
}
@Service
public class OrderService {
private final Counter orderCounter;
private final Timer orderTimer;
public OrderService(MeterRegistry registry) {
this.orderCounter = Counter.builder("orders.processed")
.description("Total orders processed")
.register(registry);
this.orderTimer = Timer.builder("orders.processing.time")
.description("Order processing time")
.register(registry);
}
public Order processOrder(OrderRequest request) {
return orderTimer.recordCallable(() -> {
Order result = doProcess(request);
orderCounter.increment();
return result;
});
}
}
Instrumentation checklist:
- Instrument the four golden signals: latency, traffic, errors, saturation
- Add labels for dimensions you will filter by (service, environment, endpoint)
- Use consistent naming:
unitsuffix,totalfor counters,secondsfor duration - Avoid high-cardinality labels (user IDs, session IDs, request IDs)
- Measure business metrics (orders placed, payments processed) alongside technical metrics
2. Collect and Store Metrics
Set up a metrics pipeline:
# Example: Prometheus scrape configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Collection best practices:
- Scrape every 10-30 seconds (faster for high-frequency changes)
- Use service discovery (Kubernetes, Consul, DNS) instead of static targets
- Run collectors in each region/zone to minimize latency
- Use remote write for long-term storage (Thanos, Cortex, VictoriaMetrics)
- Federation for hierarchical aggregation (edge → regional → global)
3. Build Effective Dashboards
Design dashboards that tell a story:
| Dashboard Type | Purpose | Key Panels |
|---|---|---|
| Service overview | Health of a single service | Error rate, latency p95, throughput, resource usage |
| Golden signals | Cross-service health | RED metrics (Rate, Errors, Duration) per service |
| Business KPI | Impact on revenue/usage | Conversions, active users, transaction volume |
| Infrastructure | Cluster/node health | CPU, memory, disk, network across all nodes |
| Incident response | Drill-down during incidents | Detailed per-endpoint latency, error breakdown, logs |
// Example: Grafana dashboard JSON snippet (simplified)
{
"dashboard": {
"title": "Orders Service - Golden Signals",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(http_requests_total{service=\"orders\"}[5m])) by (status)"
}]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [{
"expr": "sum(rate(http_requests_total{service=\"orders\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"orders\"}[5m]))"
}],
"thresholds": [
{"value": 0.001, "color": "green"},
{"value": 0.01, "color": "yellow"},
{"value": 0.05, "color": "red"}
]
},
{
"title": "Latency p95",
"type": "timeseries",
"targets": [{
"expr": "histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{service=\"orders\"}[5m])) by (le))"
}]
}
]
}
}
Dashboard design principles:
- Put the most important panels at the top-left
- Use consistent colors: green = good, yellow = warning, red = critical
- Add links to related dashboards, logs, and traces
- Keep the number of panels per dashboard under 20
- Use template variables for service, environment, and time range
4. Define SLIs and SLOs
Translate metrics into reliability targets:
# Example: SLI queries for common objectives
# Availability SLI: % of successful requests
(
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
# Latency SLI: % of requests under threshold
(
sum(rate(request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(request_duration_seconds_bucket{le="+Inf"}[5m]))
) * 100
# Error budget: remaining acceptable errors
# SLO: 99.9% availability
# Error budget: 0.1% of total requests per month
0.001 * sum(increase(http_requests_total[30d]))
| Objective | SLI | SLO | Measurement Window |
|---|---|---|---|
| Availability | Successful requests / total requests | 99.9% | 30 days |
| Latency | Requests under 200ms / total requests | 99% under 200ms | 7 days |
| Error rate | Error responses / total responses | < 0.1% | 1 hour |
| Throughput | Requests per second | > 1000 rps | 5 minutes |
5. Create Meaningful Alerts
Alert on symptoms, not causes:
# Example: Prometheus alerting rules
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate in {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: LatencyDegradation
expr: |
histogram_quantile(0.95,
sum(rate(request_duration_seconds_bucket[5m])) by (le)
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "Latency p95 above 1s"
Alert design principles:
- Alert on user-impacting symptoms (error rate, latency), not causes (disk full)
- Use
for:duration to reduce noise (require sustained threshold breach) - Add runbook links and dashboard links to every alert
- Severity levels: page for critical (user impact), ticket for warning (trending)
- Regularly review alert frequency and tune thresholds
Best Practices
- Name metrics consistently.
service_unitformat:orders_service_requests_total. - Document your metrics. Every metric needs a description and unit.
- Use histograms over averages. Averages hide outliers; histograms show distribution.
- Cardinality is cost. Every unique label combination creates a new time series.
- Dashboards are for exploration, not monitoring. Alerts notify; dashboards investigate.
- Test your dashboards. Walk through incident scenarios to verify they provide answers.
Common Mistakes
- High-cardinality metrics. Labeling by user ID or request ID explodes storage.
- Alerting on everything. Too many alerts create noise and reduce response quality.
- Missing units. A metric named
latencyis ambiguous —latency_secondsis clear. - Averaging percentiles. You cannot average p95s across services. Use histograms.
- No aggregation rules. Raw high-frequency metrics overwhelm dashboards; aggregate first.
Variants
- Pull-based: Prometheus scrapes exporters (standard for Kubernetes)
- Push-based: StatsD, Telegraf, or application pushes to collector (better for short-lived jobs)
- Cloud-native: AWS CloudWatch, Google Cloud Monitoring, Azure Monitor (managed, but vendor-specific)
- Enterprise: Datadog, New Relic, Dynatrace (rich features, per-host pricing)
FAQ
Q: How many metrics should my application expose? 10-50 well-chosen metrics beats 1000 auto-generated ones. Focus on the four golden signals and business KPIs.
Q: What scrape interval should I use? 15 seconds is standard. Use 5 seconds for critical systems, 60 seconds for slow-changing infrastructure.
Q: How do I handle metric cardinality? Use static label values (status code class, not exact URL). Drop high-cardinality labels at ingestion if necessary.
Q: Should I use Prometheus or a SaaS solution? Prometheus is free but requires operational expertise. SaaS solutions reduce overhead but increase cost at scale. Many teams use both: Prometheus for real-time, SaaS for long-term.
Conclusion
Metrics and dashboards transform raw system data into operational intelligence. By instrumenting consistently, designing dashboards for decision-making, and alerting on symptoms rather than causes, you build an observability practice that reduces MTTR and improves system reliability.