ObservabilityFeb 20265 min read

Distributed Tracing: Implementing Context Propagation at Scale

How we tracked transaction paths across asynchronous barriers, queue boundaries, and thread pools

#The Blind Spot: Asynchronous Context Loss

In distributed microservices, a single user transaction spans multiple services connected by HTTP, Kafka queues, and asynchronous execution pools. If a request loses its trace identifier (TraceID) along the way, troubleshooting a failure becomes nearly impossible. ThreadLocal-based diagnostic tools work well in synchronous code, but they fail immediately when execution shifts to a custom thread pool or a message queue.

Distributed Tracing Context Propagation Flow

// Distributed Tracing Context: TraceID propagation path spanning HTTP gateways, cross-thread boundaries, and Kafka message systems.

#Context Propagation across Thread Barriers

To bridge tracing context across Java thread pools, we wrapped all system executors in custom trace decorators. These decorators capture the OpenTelemetry context from the parent thread and inject it into the child worker thread before the task executes.

javaRead-Only

// Custom executor wrapper ensuring trace context propagation
public class ContextPropagatingExecutor implements Executor {
    private final Executor delegate;

    public ContextPropagatingExecutor(Executor delegate) {
        this.delegate = delegate;
    }

    @Override
    public void execute(Runnable command) {
        // Capture context of current thread
        Context parentContext = Context.current();
        
        delegate.execute(() -> {
            // Attach captured context to the execution thread
            try (Scope scope = parentContext.makeCurrent()) {
                command.run();
            }
        });
    }
}

#Kafka Header Context Injection

When publishing messages to Kafka, we inject OpenTelemetry trace headers (`traceparent` and `tracestate`) into the record's metadata headers. On the consumer side, an interceptor extracts the headers and restores the active span context, maintaining a single continuous trace span across the message queue boundary.

#Taming Telemetry Volume: Tail-Based Sampling

Generating tracing spans for 100% of successful HTTP transactions generates petabytes of telemetry data, leading to astronomical costs. We configured an OpenTelemetry Collector gateway to perform **tail-based sampling**. The collector buffers trace spans in-memory; it drops 99% of successful transactions but preserves 100% of traces containing HTTP status errors, slow query exceptions, or explicit failure spans.

Use tail-based sampling in your tracing collectors. Collecting 100% of failures and only a small fraction of successes protects your observability budget while maintaining maximum diagnostic value.

Distributed Tracing: Implementing Context Propagation at Scale

#The Blind Spot: Asynchronous Context Loss

#Context Propagation across Thread Barriers

#Kafka Header Context Injection

#Taming Telemetry Volume: Tail-Based Sampling

Have questions about this pattern?