Complete Guide to Observability with the Grafana Stack
Set up metrics, logs, and traces with Grafana, Prometheus, Loki, and Tempo. Covers instrumentation, dashboards, alerting, and distributed tracing for production systems.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Complete Guide to Observability with the Grafana Stack
Introduction
Observability means understanding your system’s internal state from its external outputs — metrics, logs, and traces. The Grafana stack (Prometheus for metrics, Loki for logs, Tempo for traces, Grafana for visualization) provides a complete open-source observability platform. This guide covers instrumentation, configuration, dashboards, alerting, and distributed tracing.
The Three Pillars
| Pillar | Tool | What It Answers |
|---|---|---|
| Metrics | Prometheus | How many? How fast? How long? |
| Logs | Loki | What happened? Why? |
| Traces | Tempo | Where did time go? What called what? |
Prometheus (Metrics)
Installation with Docker
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
volumes:
prometheus-data:
Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "app"
static_configs:
- targets: ["app:8080"]
metrics_path: /metrics
Instrumenting applications
Python
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from flask import Flask, request
app = Flask(__name__)
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"HTTP request latency",
["method", "endpoint"]
)
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.endpoint
).observe(time.time() - request.start_time)
return response
@app.route("/metrics")
def metrics():
return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}
Node.js
const promClient = require("prom-client");
const collectDefaultMetrics = promClient.collectDefaultMetrics;
collectDefaultMetrics({ register: promClient.register });
const httpRequestDuration = new promClient.Histogram({
name: "http_request_duration_seconds",
help: "HTTP request duration",
labelNames: ["method", "route", "status"],
buckets: [0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
});
app.use((req, res, next) => {
const start = Date.now();
res.on("finish", () => {
httpRequestDuration
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe((Date.now() - start) / 1000);
});
next();
});
app.get("/metrics", async (req, res) => {
res.set("Content-Type", promClient.register.contentType);
res.end(await promClient.register.metrics());
});
PromQL queries
# Request rate (requests per second)
rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Memory usage
container_memory_working_set_bytes{container!=""}
# Average latency by endpoint
avg by (endpoint) (rate(http_request_duration_seconds_sum[5m]) /
rate(http_request_duration_seconds_count[5m]))
Loki (Logs)
Installation
# docker-compose.yml
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
volumes:
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail.yml:/etc/promtail/promtail.yml
command: -config.file=/etc/promtail/promtail.yml
volumes:
loki-data:
Promtail configuration
# promtail.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: app-logs
static_configs:
- targets: [localhost]
labels:
job: app
environment: production
__path__: /var/log/app/*.log
- job_name: docker-logs
docker_sd_configs:
- name: docker
filters:
- name: label
values: ["logging=promtail"]
relabel_configs:
- source_labels: ["__meta_docker_container_name"]
target_label: container
LogQL queries
# All logs from app job
{job="app"}
# Error logs with regex filter
{job="app"} |= "error" | json | line_format "{{.msg}}"
# Rate of error logs per minute
sum(rate({job="app"} |= "error" [1m])) by (level)
# Logs with label filter
{job="app", environment="production"} |= "timeout" | json | level="error"
Tempo (Traces)
Installation
# docker-compose.yml
services:
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
- tempo-data:/var/tempo
ports:
- "14268:14268" # Jaeger ingest
- "3200:3200" # Tempo query
Configuration
# tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
jaeger:
protocols:
thrift_http:
otlp:
protocols:
grpc:
http:
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
Instrumenting with OpenTelemetry
Python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://tempo:4318/v1/traces"))
)
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
tracer = trace.get_tracer(__name__)
@app.route("/users/<user_id>")
def get_user(user_id):
with tracer.start_as_current_span("get_user") as span:
span.set_attribute("user.id", user_id)
user = fetch_user(user_id)
span.set_attribute("user.name", user.name)
return user
Node.js
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const { ExpressInstrumentation } = require("@opentelemetry/instrumentation-express");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: "http://tempo:4318/v1/traces",
}),
instrumentations: [new ExpressInstrumentation(), new HttpInstrumentation()],
});
sdk.start();
Grafana (Visualization)
Data sources
# provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ["job", "instance", "pod"]
Dashboard provisioning
# provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: Default
folder: ""
type: file
options:
path: /var/lib/grafana/dashboards
Example dashboard JSON (RED metrics)
{
"dashboard": {
"title": "Service RED Metrics",
"panels": [
{
"title": "Request Rate",
"type": "stat",
"targets": [
{ "expr": "sum(rate(http_requests_total[5m]))" }
]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{ "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))" }
]
},
{
"title": "P95 Latency",
"type": "stat",
"targets": [
{ "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))" }
]
},
{
"title": "Latency Over Time",
"type": "timeseries",
"targets": [
{ "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (endpoint)" }
]
}
]
}
}
Alerting
Prometheus alert rules
# alert_rules.yml
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High P95 latency on {{ $labels.job }}"
description: "P95 latency is {{ $value }}s for the last 10 minutes"
- alert: PodDown
expr: up{job="app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.instance }} is down"
Alertmanager configuration
# alertmanager.yml
route:
receiver: slack
group_by: ["alertname", "severity"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: slack
slack_configs:
- api_url: "https://hooks.slack.com/services/XXX"
channel: "#alerts"
send_resolved: true
Best Practices
- Use the RED method — Rate, Errors, Duration for every service
- Use the USE method — Utilization, Saturation, Errors for resources
- Label consistently — service, environment, version on all metrics
- Set retention wisely — 15s scrape for 15 days, then downsample
- Use structured logging — JSON logs with trace IDs for Loki correlation
- Inject trace IDs in logs — enables jumping from traces to logs in Grafana
- Use OpenTelemetry — vendor-neutral instrumentation standard
- Provision dashboards as code — version control your dashboards
- Alert on symptoms, not causes — alert on user-visible degradation
- Set SLO-based alerts — burn rate alerts catch sustained degradation
- Use Grafana templates — one dashboard for all services with variables
- Keep cardinality low — avoid high-cardinality labels (user IDs, request IDs)
Common Mistakes
- Using high-cardinality labels — explodes Prometheus memory usage
- Not setting retention limits — Prometheus disk fills up
- Alerting on every spike — alert fatigue kills signal quality
- Not correlating traces with logs — manual log searching wastes time
- Storing all logs at the same retention — expensive and unnecessary
- Not instrumenting downstream calls — blind spots in trace graphs
- Using counters without rate() — raw counters are meaningless
- Not versioning dashboards — ad-hoc changes break shared views
- Ignoring USE metrics for infrastructure — CPU and disk saturation missed
- Not testing alert rules — alerts fire incorrectly or not at all
Frequently Asked Questions
What is the difference between metrics, logs, and traces?
Metrics are numeric measurements aggregated over time (CPU usage, request count). Logs are discrete events with timestamps and context (error messages, audit trails). Traces follow a single request across service boundaries, showing the causal chain and timing of each step. All three are needed for full observability.
How do I correlate traces with logs in Grafana?
Configure Tempo’s tracesToLogs in the datasource config to link to Loki. When viewing a trace, Grafana shows a “Logs” tab that queries Loki for logs matching the trace’s service, span, and time range. Ensure your application logs include the trace ID as a field.
Should I use Grafana Cloud or self-host?
Grafana Cloud is managed and scales automatically — good for teams without ops capacity. Self-hosting gives full control and avoids per-metric pricing but requires maintenance capacity. For small teams, Grafana Cloud’s free tier covers up to 10k active metrics and 50GB logs.