Distributed Tracing: Implementing Context Propagation at Scale
How we tracked transaction paths across asynchronous barriers, queue boundaries, and thread pools
#The Blind Spot: Asynchronous Context Loss
In distributed microservices, a single user transaction spans multiple services connected by HTTP, Kafka queues, and asynchronous execution pools. If a request loses its trace identifier (TraceID) along the way, troubleshooting a failure becomes nearly impossible. ThreadLocal-based diagnostic tools work well in synchronous code, but they fail immediately when execution shifts to a custom thread pool or a message queue.

// Distributed Tracing Context: TraceID propagation path spanning HTTP gateways, cross-thread boundaries, and Kafka message systems.
#Context Propagation across Thread Barriers
To bridge tracing context across Java thread pools, we wrapped all system executors in custom trace decorators. These decorators capture the OpenTelemetry context from the parent thread and inject it into the child worker thread before the task executes.
// Custom executor wrapper ensuring trace context propagation
public class ContextPropagatingExecutor implements Executor {
private final Executor delegate;
public ContextPropagatingExecutor(Executor delegate) {
this.delegate = delegate;
}
@Override
public void execute(Runnable command) {
// Capture context of current thread
Context parentContext = Context.current();
delegate.execute(() -> {
// Attach captured context to the execution thread
try (Scope scope = parentContext.makeCurrent()) {
command.run();
}
});
}
}#Kafka Header Context Injection
When publishing messages to Kafka, we inject OpenTelemetry trace headers (`traceparent` and `tracestate`) into the record's metadata headers. On the consumer side, an interceptor extracts the headers and restores the active span context, maintaining a single continuous trace span across the message queue boundary.
#Taming Telemetry Volume: Tail-Based Sampling
Generating tracing spans for 100% of successful HTTP transactions generates petabytes of telemetry data, leading to astronomical costs. We configured an OpenTelemetry Collector gateway to perform **tail-based sampling**. The collector buffers trace spans in-memory; it drops 99% of successful transactions but preserves 100% of traces containing HTTP status errors, slow query exceptions, or explicit failure spans.