Skip to content
SP StackPractices
intermediate By StackPractices

Distributed Tracing

Trace requests across distributed microservices with OpenTelemetry, Jaeger, and Zipkin for latency debugging and performance optimization.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Distributed tracing follows a single request as it travels through microservices, databases, message queues, and third-party APIs. Unlike logs (discrete events) or metrics (aggregated numbers), traces reveal the full journey — showing exactly where time is spent and which service causes delays. OpenTelemetry has become the industry standard for instrumenting applications and exporting traces to Jaeger, Zipkin, or cloud providers.

When to Use

Use this resource when:

  • Debugging latency in microservices architectures
  • Understanding call graphs across 10+ services
  • Optimizing critical user journeys (checkout, login, search)
  • Identifying cascading failures and retry storms

Solution

OpenTelemetry Auto-Instrumentation (Node.js)

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  traceExporter: new JaegerExporter({
    endpoint: 'http://jaeger:14268/api/traces'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

Custom Span Creation (Go)

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

func processOrder(ctx context.Context, orderID string) error {
    tracer := otel.Tracer("order-service")
    
    ctx, span := tracer.Start(ctx, "processOrder",
        trace.WithAttributes(attribute.String("order.id", orderID)))
    defer span.End()
    
    // Child span for database call
    ctx, dbSpan := tracer.Start(ctx, "validateInventory")
    err := db.CheckStock(orderID)
    dbSpan.End()
    
    if err != nil {
        span.RecordError(err)
        return err
    }
    
    span.SetStatus(codes.Ok, "order processed")
    return nil
}

Propagation via HTTP Headers

from opentelemetry import trace
from opentelemetry.propagate import extract, inject
import requests

tracer = trace.get_tracer(__name__)

def handle_request(headers):
    # Extract parent context from incoming request
    context = extract(headers)
    
    with tracer.start_as_current_span("process-payment", context=context):
        # Outgoing request carries trace context
        outgoing_headers = {}
        inject(outgoing_headers)
        
        response = requests.post(
            "https://payment-api.example.com/charge",
            headers=outgoing_headers
        )
        return response.json()

Explanation

Trace anatomy:

  • Trace: A complete user request (e.g., “add to cart”)
  • Span: A single operation within the trace (e.g., “query database”)
  • Span context: Trace ID + Span ID + flags, propagated across service boundaries
  • Baggage: Key-value pairs shared across the entire trace

W3C Trace Context standard:

  • traceparent: 00-traceid-spanid-flags
  • tracestate: Vendor-specific extensions

Sampling strategies:

  • Head-based: Decide at the edge (simple; consistent)
  • Tail-based: Decide after completion (catches rare errors; expensive)
  • Probability: Random percentage (cheap; may miss edge cases)

Variants

BackendBest ForNotable Features
JaegerOpen source, self-hostedNative OpenTelemetry; good UI
ZipkinSimple setupsMinimal resource footprint
AWS X-RayAWS-native appsService map; integration with ALB/Lambda
DatadogEnterprise SaaSAPM + traces + logs unified
Grafana TempoGrafana stackCost-effective at scale

Best Practices

  • Instrument at framework level: Auto-instrument HTTP, gRPC, database, and message queue clients
  • Add business attributes: user_id, order_id, tenant_id make traces actionable
  • Keep cardinality low: Don’t put unique IDs in span names (use attributes instead)
  • Sample aggressively in production: 1-5% is usually sufficient for debugging
  • Link traces to logs: Include trace_id in log entries for cross-referencing

Common Mistakes

  1. Missing context propagation: Spans break across service boundaries if headers aren’t forwarded
  2. Span explosion: Creating spans for every loop iteration creates unreadable traces
  3. High-cardinality tags: User IDs or session IDs as span names crash storage
  4. Not sampling in dev: Full tracing in development makes it easy to verify instrumentation
  5. Ignoring async flows: Background jobs, callbacks, and timers need manual span parenting

Frequently Asked Questions

Q: Do I need to change my code for every function? A: No. Auto-instrumentation covers HTTP, DB, and queue clients. Only add manual spans for critical business operations.

Q: What’s the performance overhead? A: Typically <1% CPU and memory when sampling 1-5%. Head-based sampling is cheaper than tail-based.

Q: Can I trace frontend JavaScript too? A: Yes. OpenTelemetry JS instruments browser apps, connecting user clicks to backend traces end-to-end.