Skip to content
SP StackPractices
intermediate By StackPractices

Distributed Tracing — End-to-End Request Flow Across Microservices

A practical guide to distributed tracing: instrumenting applications, trace propagation, sampling strategies, and diagnosing latency in microservice architectures with OpenTelemetry, Jaeger, and Zipkin.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Distributed tracing captures the full journey of a request as it travels through multiple services. Unlike logs and metrics, traces show causality and timing across service boundaries, making them essential for debugging latency, understanding dependencies, and optimizing request paths in distributed systems.

This guide covers instrumentation, trace context propagation, sampling, and operational practices.

When to Use

  • You operate a microservices architecture with more than 5 services
  • Debugging latency requires correlating logs across multiple services
  • You need to understand service dependencies and critical paths
  • Your mean time to resolution (MTTR) for cross-service issues exceeds 30 minutes
  • You want to measure end-to-end request latency, not just per-service metrics

Core Concepts

ConceptDescription
TraceA complete record of a single request’s journey through the system
SpanA single operation within a trace (one unit of work)
Span ContextMetadata propagated across service boundaries (trace ID, span ID, baggage)
Parent-ChildRelationship showing which span caused another span
BaggageKey-value pairs propagated alongside the trace context
SamplingDeciding which traces to capture (head, tail, or adaptive)

Architecture

┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
│   API     │──→│  Auth   │──→│ Orders  │──→│ Payment │
│  Gateway  │   │ Service │   │ Service │   │ Service │
└──────────┘   └──────────┘   └──────────┘   └──────────┘
       │             │             │             │
       └─────────────┴─────────────┴─────────────┘

                  [Trace Collector]

              [Jaeger / Zipkin / Tempo]

Step-by-Step Distributed Tracing Setup

1. Instrument Your Application

Add OpenTelemetry SDK to your services:

# Example: Python Flask with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource, SERVICE_NAME

# Configure tracer provider
resource = Resource.create({SERVICE_NAME: "orders-service"})
provider = TracerProvider(resource=resource)
trace.set_tracer_provider(provider)

# Export to collector (Jaeger, Zipkin, or Tempo)
otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

# Auto-instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

tracer = trace.get_tracer(__name__)

@app.route("/orders/<order_id>")
def get_order(order_id):
    with tracer.start_as_current_span("get_order") as span:
        span.set_attribute("order.id", order_id)
        
        # Add child spans for database calls
        with tracer.start_as_current_span("fetch_order_db") as db_span:
            order = db.query(Order).get(order_id)
            db_span.set_attribute("db.rows_returned", 1)
        
        # Add child spans for external calls
        with tracer.start_as_current_span("verify_payment") as payment_span:
            status = payment_client.verify(order.payment_id)
            payment_span.set_attribute("payment.status", status)
        
        return jsonify(order.to_dict())
// Example: Node.js Express with OpenTelemetry
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4317' }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'payment-service'
});
sdk.start();

// Manual span creation
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('payment-service');

app.post('/payments', async (req, res) => {
  const span = tracer.startSpan('process_payment');
  span.setAttribute('payment.amount', req.body.amount);
  
  try {
    const result = await processPayment(req.body);
    span.setAttribute('payment.status', 'success');
    res.json(result);
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: trace.StatusCode.ERROR });
    res.status(500).json({ error: error.message });
  } finally {
    span.end();
  }
});

Instrumentation checklist:

  • Auto-instrument HTTP frameworks, database clients, and messaging libraries
  • Create manual spans for business operations (not just infrastructure)
  • Add attributes to spans for filtering and correlation
  • Record exceptions with stack traces
  • Set span status (OK, ERROR) explicitly

2. Propagate Trace Context

Ensure trace IDs flow across all service boundaries:

# Example: Propagate trace context in HTTP headers
import requests
from opentelemetry.propagate import inject
from opentelemetry import trace

def call_user_service(user_id):
    headers = {}
    inject(headers)  # Adds traceparent, tracestate headers
    
    response = requests.get(
        f"http://user-service/users/{user_id}",
        headers=headers
    )
    return response.json()
// Example: Spring Boot with trace propagation
@RestController
public class OrderController {
    @Autowired
    private RestTemplate restTemplate;
    
    @GetMapping("/orders/{id}")
    public Order getOrder(@PathVariable String id) {
        // Trace context automatically propagated via RestTemplate
        User user = restTemplate.getForObject(
            "http://user-service/users/{id}", User.class, id
        );
        return orderService.findById(id, user);
    }
}

Propagation requirements:

  • HTTP: Use traceparent and tracestate headers (W3C standard)
  • gRPC: Use metadata keys traceparent and tracestate
  • Message queues: Embed trace context in message attributes/headers
  • Async processing: Ensure context propagates to thread pools and callbacks

3. Configure Sampling

Capture traces efficiently without overwhelming storage:

Sampling TypeWhen to UseTrade-off
Head-basedDecide at request start based on rateSimple, but may miss interesting slow traces
Tail-basedCollect all spans, decide after completionCatches slow/error traces, higher memory cost
AdaptiveAdjust rate based on traffic patternsBest coverage, more complex configuration
# Example: OpenTelemetry Collector sampling configuration
processors:
  prob_sampler:
    type: probabilistic
    sampling_percentage: 10.0  # Sample 10% of traces
    
  tail_sampler:
    type: tail_based
    policies:
      - name: slow_requests
        type: latency
        latency_threshold_ms: 1000
      - name: errors
        type: status_code
        status_codes: [ERROR]
# Example: Programmatic sampling
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Sample 5% of traces deterministically
sampler = TraceIdRatioBased(0.05)
provider = TracerProvider(sampler=sampler)

Sampling best practices:

  • Start with 1-10% sampling in production
  • Always sample error traces and slow requests (tail-based)
  • Use consistent sampling across services (same trace ID → same decision)
  • Monitor sampling rate and storage costs

4. Correlate with Logs and Metrics

Link traces to other observability signals:

# Example: Adding trace context to logs
import structlog
from opentelemetry import trace

logger = structlog.get_logger()

def log_with_trace(message, **kwargs):
    current_span = trace.get_current_span()
    span_context = current_span.get_span_context()
    
    logger.info(
        message,
        trace_id=format(span_context.trace_id, '032x'),
        span_id=format(span_context.span_id, '016x'),
        **kwargs
    )

# Usage
log_with_trace("Processing payment", payment_id="pay-123", amount=99.99)

Correlation patterns:

  • Logs: Include trace_id and span_id in every log entry
  • Metrics: Tag latency metrics with trace_id for drill-down
  • Errors: Attach trace context to error tracking (Sentry, Bugsnag)
  • Dashboards: Link from latency spikes directly to example traces

5. Query and Analyze Traces

Use your trace backend to find and diagnose issues:

# Example: Jaeger query patterns

# Find traces for a specific service
service=orders-service

# Find slow traces (>500ms)
service=orders-service duration>500ms

# Find error traces
service=orders-service error=true

# Find traces for a specific user
tags={"user.id":"user-123"}

# Find traces that touched multiple services
service=orders-service | select traceID, spanID, duration

Common trace analysis queries:

  • Latency hotspots: Group by service, find slowest spans
  • Error correlation: Which services fail together?
  • Dependency mapping: Which services call which?
  • Bottleneck identification: Where is time spent in a trace?

Best Practices

  • Instrument at the framework level first. HTTP clients, databases, and message queues give the most value with least effort.
  • Use semantic conventions. Follow OpenTelemetry semantic conventions for span names and attributes.
  • Avoid high-cardinality attributes. User IDs in span names cause index explosion; use attributes instead.
  • Sample intelligently. Tail-based sampling captures the most important traces.
  • Keep trace depth reasonable. Limit to 50-100 spans per trace; deep nesting hurts readability.
  • Monitor the monitoring. Alert if trace collection rate drops or collector queue backs up.

Common Mistakes

  • Missing context propagation. A broken trace is worse than no trace — verify headers flow everywhere.
  • Over-instrumenting. Every loop iteration does not need a span. Instrument operations, not iterations.
  • Using trace IDs as log search. Traces complement logs; they do not replace them.
  • Ignoring sampling costs. 100% sampling in high-traffic systems generates terabytes of data.
  • Not correlating with metrics. Traces show what happened; metrics show how often. Use both.

Variants

  • Request shadowing: Duplicate traffic to a shadow environment with full tracing
  • Synthetic tracing: Inject fake requests to continuously monitor paths
  • eBPF-based tracing: Kernel-level tracing without application instrumentation
  • Service mesh tracing: Istio/Linkerd automatic trace propagation

FAQ

Q: What is the difference between distributed tracing and logging? Logs are discrete events. Traces show causality and timing across services. Use both: traces for request flow, logs for detailed state.

Q: How much overhead does tracing add? Typically 1-5% CPU and memory. Sampling reduces this further. The overhead is usually worth the debugging speedup.

Q: Should I use Jaeger, Zipkin, or Tempo? All support OpenTelemetry. Jaeger has the largest community. Zipkin is simpler. Tempo is Grafana-native and cost-efficient at scale.

Q: Can I trace asynchronous workflows? Yes, but ensure trace context propagates across message queues, callbacks, and thread pools. This is the most common source of broken traces.

Conclusion

Distributed tracing is essential for operating microservices at scale. By instrumenting your applications, propagating context faithfully, and sampling intelligently, you transform opaque cross-service failures into visual, debuggable request flows.