Metrics Collection
Collect, aggregate, and expose application and infrastructure metrics with Prometheus, StatsD, and OpenTelemetry for monitoring and alerting.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Metrics collection transforms raw system behavior into time-series data that reveals performance trends, capacity limits, and anomalies. Unlike logs (discrete events) or traces (request journeys), metrics are numerical measurements aggregated over time — request rates, error percentages, queue depths, and memory usage. A well-designed metrics pipeline enables proactive alerting before users notice degradation.
When to Use
Use this resource when:
- You need quantitative SLIs for error budgets and SLO dashboards
- Alerting must fire before logs are aggregated (sub-minute detection)
- Capacity planning requires historical throughput and resource usage trends
- Debugging requires correlating metrics across services (CPU spike + latency increase)
Solution
Prometheus Metrics in Go
import "github.com/prometheus/client_golang/prometheus"
var (
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "status"},
)
activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func init() {
prometheus.MustRegister(requestDuration, activeConnections)
}
func handleRequest(w http.ResponseWriter, r *http.Request) {
activeConnections.Inc()
defer activeConnections.Dec()
start := time.Now()
defer func() {
requestDuration.WithLabelValues(
r.Method,
strconv.Itoa(w.Status()),
).Observe(time.Since(start).Seconds())
}()
// Handler logic...
}
StatsD Metrics (Node.js)
const StatsD = require('node-statsd');
const client = new StatsD({ host: 'localhost', port: 8125 });
function processPayment(orderId, amount) {
const start = Date.now();
try {
const result = paymentGateway.charge(amount);
client.increment('payment.success');
client.gauge('payment.amount', amount);
return result;
} catch (err) {
client.increment('payment.error', 1, ['gateway:stripe', 'error:declined']);
throw err;
} finally {
client.timing('payment.duration', Date.now() - start);
}
}
OpenTelemetry Metrics (Python)
from opentelemetry import metrics
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.metrics import MeterProvider
reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter(__name__)
counter = meter.create_counter("orders.created", description="Orders created")
histogram = meter.create_histogram("order.value", description="Order value in USD")
def create_order(items, total):
counter.add(1, {"region": "us-east"})
histogram.record(total, {"region": "us-east"})
return Order(items=items, total=total)
Explanation
Metric types:
| Type | Use Case | Example |
|---|---|---|
| Counter | Monotonically increasing | Total requests, errors |
| Gauge | Value goes up and down | Active connections, queue depth |
| Histogram | Distribution of values | Request latency, payload size |
| Summary | Quantiles (client-side) | 99th percentile latency |
Cardinality danger:
- Good labels:
method=GET,status=200,region=us-east - Bad labels:
user_id=12345,session_id=abc— causes metric explosion - Rule of thumb: Keep unique label combinations under 10,000
Variants
| Backend | Collection | Best For |
|---|---|---|
| Prometheus | Pull (scrape) | Kubernetes; PromQL queries |
| StatsD | Push (UDP) | Legacy apps; simple counters |
| InfluxDB | Push (HTTP) | High cardinality; tags |
| Datadog | Agent push | SaaS; out-of-box dashboards |
| CloudWatch | AWS integration | AWS-native apps |
Best Practices
- Use histograms for latency: Counters and gauges lose distribution shape
- Add
lebuckets for SLOs:histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) - Name consistently:
subsystem_metric_unit(e.g.,http_requests_total) - Alert on rates, not totals:
rate(errors[5m]) > 0.01noterrors > 1000 - Separate metric and business logic: Keep instrumentation thin; never block on metric emission
Common Mistakes
- High-cardinality labels: User IDs as labels crash Prometheus storage
- Missing units:
request_durationwithout_secondsor_millisecondscreates confusion - Alerting on gauges: Queue depth alone doesn’t indicate failure; combine with processing rate
- No retention policy: Keeping 1-second resolution for 5 years wastes storage; implement a data retention policy.
- Forgetting to instrument failures: Only measuring success hides partial outages
Frequently Asked Questions
Q: How do I choose between Prometheus and StatsD? A: Use Prometheus for new cloud-native apps. Use StatsD for legacy apps where adding an HTTP endpoint is hard.
Q: What’s the performance overhead of metrics collection? A: Negligible for counters and gauges (<1%). Histograms with many buckets add slightly more; use predefined buckets.
Q: Should I collect metrics from the client (browser)? A: Yes. Core Web Vitals, API error rates, and navigation timing from real users are essential SLIs.
Related Resources
Metrics Collection and Alerting with Prometheus
Instrument applications and infrastructure with Prometheus metrics, configure alerting rules, and set up recording rules for efficient monitoring of service health
RecipePrometheus API Monitoring
Monitor API performance and health with Prometheus metrics, custom collectors, and alerting rules.
RecipeObservability Dashboards with Grafana and Prometheus
Build interactive Grafana dashboards that visualize Prometheus metrics with panels, variables, and alerts for comprehensive service observability
RecipeDistributed Tracing
Trace requests across distributed microservices with OpenTelemetry, Jaeger, and Zipkin for latency debugging and performance optimization.
RecipeStructured Logging
Implement structured logging with JSON output, correlation IDs, and log aggregation for production observability.