Distributed Tracing
Trace requests across distributed microservices with OpenTelemetry, Jaeger, and Zipkin for latency debugging and performance optimization.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Distributed tracing follows a single request as it travels through microservices, databases, message queues, and third-party APIs. Unlike logs (discrete events) or metrics (aggregated numbers), traces reveal the full journey — showing exactly where time is spent and which service causes delays. OpenTelemetry has become the industry standard for instrumenting applications and exporting traces to Jaeger, Zipkin, or cloud providers.
When to Use
Use this resource when:
- Debugging latency in microservices architectures
- Understanding call graphs across 10+ services
- Optimizing critical user journeys (checkout, login, search)
- Identifying cascading failures and retry storms
Solution
OpenTelemetry Auto-Instrumentation (Node.js)
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
traceExporter: new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces'
}),
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
Custom Span Creation (Go)
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
func processOrder(ctx context.Context, orderID string) error {
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "processOrder",
trace.WithAttributes(attribute.String("order.id", orderID)))
defer span.End()
// Child span for database call
ctx, dbSpan := tracer.Start(ctx, "validateInventory")
err := db.CheckStock(orderID)
dbSpan.End()
if err != nil {
span.RecordError(err)
return err
}
span.SetStatus(codes.Ok, "order processed")
return nil
}
Propagation via HTTP Headers
from opentelemetry import trace
from opentelemetry.propagate import extract, inject
import requests
tracer = trace.get_tracer(__name__)
def handle_request(headers):
# Extract parent context from incoming request
context = extract(headers)
with tracer.start_as_current_span("process-payment", context=context):
# Outgoing request carries trace context
outgoing_headers = {}
inject(outgoing_headers)
response = requests.post(
"https://payment-api.example.com/charge",
headers=outgoing_headers
)
return response.json()
Explanation
Trace anatomy:
- Trace: A complete user request (e.g., “add to cart”)
- Span: A single operation within the trace (e.g., “query database”)
- Span context: Trace ID + Span ID + flags, propagated across service boundaries
- Baggage: Key-value pairs shared across the entire trace
W3C Trace Context standard:
traceparent: 00-traceid-spanid-flagstracestate: Vendor-specific extensions
Sampling strategies:
- Head-based: Decide at the edge (simple; consistent)
- Tail-based: Decide after completion (catches rare errors; expensive)
- Probability: Random percentage (cheap; may miss edge cases)
Variants
| Backend | Best For | Notable Features |
|---|---|---|
| Jaeger | Open source, self-hosted | Native OpenTelemetry; good UI |
| Zipkin | Simple setups | Minimal resource footprint |
| AWS X-Ray | AWS-native apps | Service map; integration with ALB/Lambda |
| Datadog | Enterprise SaaS | APM + traces + logs unified |
| Grafana Tempo | Grafana stack | Cost-effective at scale |
Best Practices
- Instrument at framework level: Auto-instrument HTTP, gRPC, database, and message queue clients
- Add business attributes: user_id, order_id, tenant_id make traces actionable
- Keep cardinality low: Don’t put unique IDs in span names (use attributes instead)
- Sample aggressively in production: 1-5% is usually sufficient for debugging
- Link traces to logs: Include trace_id in log entries for cross-referencing
Common Mistakes
- Missing context propagation: Spans break across service boundaries if headers aren’t forwarded
- Span explosion: Creating spans for every loop iteration creates unreadable traces
- High-cardinality tags: User IDs or session IDs as span names crash storage
- Not sampling in dev: Full tracing in development makes it easy to verify instrumentation
- Ignoring async flows: Background jobs, callbacks, and timers need manual span parenting
Frequently Asked Questions
Q: Do I need to change my code for every function? A: No. Auto-instrumentation covers HTTP, DB, and queue clients. Only add manual spans for critical business operations.
Q: What’s the performance overhead? A: Typically <1% CPU and memory when sampling 1-5%. Head-based sampling is cheaper than tail-based.
Q: Can I trace frontend JavaScript too? A: Yes. OpenTelemetry JS instruments browser apps, connecting user clicks to backend traces end-to-end.
Related Resources
Observability Dashboards with Grafana and Prometheus
Build interactive Grafana dashboards that visualize Prometheus metrics with panels, variables, and alerts for comprehensive service observability
RecipeMetrics Collection and Alerting with Prometheus
Instrument applications and infrastructure with Prometheus metrics, configure alerting rules, and set up recording rules for efficient monitoring of service health
RecipePrometheus API Monitoring
Monitor API performance and health with Prometheus metrics, custom collectors, and alerting rules.
RecipeStructured Logging
Implement structured logging with JSON output, correlation IDs, and log aggregation for production observability.
GuideMicroservices Architecture — When to Use and When Not To
A practical guide to microservices: benefits, trade-offs, common patterns, and when to choose them over monoliths. Covers decomposition strategies and operational complexity.