What this usually means
Correlation ID propagation fails when the context carrier (thread-local, HTTP header, or message envelope) is not explicitly forwarded across an execution boundary. The most common culprit is a thread pool: the default executor does not inherit the parent thread's ThreadLocal variables. Similarly, asynchronous frameworks (CompletableFuture, reactive streams, message listeners) each have their own context propagation mechanism. Without explicit instrumentation (e.g., MDC clear/put, OpenTelemetry context injection, or manual header forwarding), the correlation ID is lost. Another frequent cause is that the ID is generated at the ingress but never added to outgoing requests or messages — a simple oversight in middleware or client code.
The first ten minutes — establish facts before touching code.
- 1Add a unique marker to your correlation ID header name (e.g., 'X-Correlation-Id') and grep ingress logs for it; if missing after first hop, check your API gateway or ingress controller.
- 2Enable thread pool tracing: add `-Djava.util.concurrent.ForkJoinPool.common.parallelism=1` temporarily to force single-thread execution and see if the ID appears.
- 3Inspect MDC context after async execution: `logger.info("MDC: {}", MDC.getCopyOfContextMap());` inside the async task.
- 4For HTTP, dump all incoming and outgoing headers using `curl -v` or a proxy like mitmproxy; verify the correlation ID header is present on the request and forwarded in the response.
- 5Check message queue listeners: print the message headers before processing; if the ID is in the headers but not in MDC, the listener middleware is not reading it.
- 6Look for missing `HystrixRequestVariableDefault` or `RequestContextHolder` usage in Spring-based services — these are common pitfalls.
The specific files, logs, configs, and dashboards that usually own this bug.
- searchApplication logs at the boundary: search for 'correlation' or 'traceId' in the immediate log lines before and after the async call.
- searchThread dump analysis: look for threads with an MDC or ThreadLocal that is empty after invocation.
- searchSource code of the async executor/service: check if the thread pool is wrapped with a context-aware executor (e.g., `ThreadPoolTaskExecutor` with `setTaskDecorator`).
- searchHTTP client configuration: check if the client (RestTemplate, WebClient, Feign) is configured to forward headers.
- searchMessage broker configuration: examine queue listener container factory for custom `afterReceive` or `beforeHandle` interceptors.
- searchOpenTelemetry SDK configuration: verify `OTEL_PROPAGATORS` environment variable includes `tracecontext` and `baggage`.
- searchAPI gateway or reverse proxy (Kong, NGINX, Envoy): confirm the correlation ID header is whitelisted in the proxy's header forwarding rules.
Practical causes, not theory. These are the things you will actually find.
- warningThread pool executor not inheriting parent thread's ThreadLocal — missing custom `beforeExecute`/`afterExecute` or `TaskDecorator`
- warningReactive frameworks (WebFlux, RxJava) using a different context (e.g., Reactor's `Context` instead of MDC) without bridging
- warningMessage queue consumer not extracting correlation ID from message headers into MDC before processing
- warningHTTP client (e.g., Apache HttpClient) not configured to propagate headers from the current request context
- warningMissing or misconfigured OpenTelemetry propagator — only W3C Trace Context set but not Baggage
- warningAsynchronous event listeners (e.g., Spring `@Async`) not wrapping the task with a context-aware executor
- warningCustom thread pool created with `Executors.newFixedThreadPool` instead of a context-aware factory
Concrete fix directions. Pick the one that matches your root cause.
- buildWrap thread pool with a `TaskDecorator` that copies MDC context from the submitting thread to the executing thread
- buildFor Spring `@Async`, provide a custom `AsyncConfigurer` that returns a `ThreadPoolTaskExecutor` with a decorator that transfers MDC and request attributes
- buildFor reactive stacks, use `Hooks.enableAutomaticContextPropagation()` in Reactor or manually propagate context through `SubscriberContext`
- buildIn message listeners, add a `MessagePostProcessor` or a `ChannelInterceptor` that reads the correlation ID header and sets it into MDC before the handler runs
- buildFor HTTP clients, use an interceptor (e.g., `ClientHttpRequestInterceptor`) that copies the correlation ID from the current request's headers
- buildStandardize on OpenTelemetry: instrument all services with the SDK and set `OTEL_PROPAGATORS=tracecontext,baggage` on every service
- buildImplement a middleware filter that ensures every outgoing HTTP call includes the correlation ID from the incoming request
A fix you cannot prove is a guess. Close the loop.
- verifiedRun a synthetic transaction across 3+ services and check that all logs share the same correlation ID in your centralized logging system
- verifiedAdd a test that asserts MDC contains the correlation ID inside an async task: `assertThat(MDC.get("correlationId")).isEqualTo(expectedId);`
- verifiedUse distributed tracing tools (Jaeger, Zipkin) to verify the trace spans are connected — a broken correlation ID will show as separate traces
- verifiedDeploy a canary with the fix and compare log correlation rates between the canary and baseline using a script that parses log lines
- verifiedWrite an integration test that calls an async endpoint and verifies the response headers include the correlation ID
Things that make this bug worse or harder to find.
- warningCopying only the correlation ID but not other tracing context (e.g., span ID) — breaks full trace visualization
- warningUsing `InheritableThreadLocal` blindly — it can cause memory leaks in thread pools because child threads inherit from the creating thread, not the submitting thread
- warningForgetting to clear MDC after async task completion — corrupts subsequent tasks in the same thread
- warningPropagating correlation ID via a custom thread-local but not cleaning it up on exceptions — leads to stale context
- warningAssuming all libraries automatically propagate context — e.g., `RestTemplate` does not forward request headers unless configured
- warningAdding correlation ID propagation only at the edge (API gateway) but not in internal service-to-service calls
Lost Correlation IDs After Moving to Async Event Processing
Timeline
- 09:15Deploy new order processing service with async event-driven architecture
- 09:30SRE alerts on 'increase in orphaned logs' — logs missing correlation IDs in production
- 09:45Check Elasticsearch: 60% of order events have no correlation ID; all from the new async handler
- 10:00Inspect async handler code: uses @Async with default executor
- 10:15Add logging inside async method: MDC.get('correlationId') returns null
- 10:30Check RabbitMQ listener: message headers contain correlation ID, but MDC not set before handler
- 10:45Implement custom TaskDecorator for thread pool to copy MDC from submitting thread
- 11:00Add MessagePostProcessor to set MDC from message headers before processing
- 11:30Deploy fix to staging, verify all logs now have correlation ID
- 12:00Promote to production; correlation rate back to 100%
We were migrating from a synchronous REST monolith to an event-driven microservice architecture. The new order processing service would receive an HTTP request, publish a message to RabbitMQ, and then an async handler would process it. Almost immediately after deployment, our logging aggregator showed a massive spike in untagged logs — entries without a correlation ID. We couldn't trace any order from start to finish.
I started by checking the RabbitMQ message headers using a temporary consumer that printed all headers. The correlation ID was there — it was being set by the producer. But inside the async handler, MDC.get('correlationId') returned null. The problem was clear: the message listener did not extract the header into the MDC, and the @Async task ran on a thread pool that didn't inherit the parent thread's MDC context.
The fix was two-fold. First, I replaced the default SimpleAsyncTaskExecutor with a ThreadPoolTaskExecutor that used a custom TaskDecorator to copy the MDC from the submitting thread. Second, I added a MessagePostProcessor in the RabbitMQ listener container factory that read the correlation ID from the message headers and set it into MDC before the handler executed. After deploying, all logs were correctly correlated. The key lesson: never assume context propagation across async boundaries — you must explicitly transfer it.
Root cause
Default Spring @Async executor does not inherit MDC context, and RabbitMQ listener did not extract correlation ID from message headers into MDC.
The fix
Custom TaskDecorator for thread pool MDC propagation and a MessagePostProcessor to set MDC from message headers.
The lesson
Always verify context propagation when introducing async processing; test by asserting MDC values inside async tasks and inspecting message headers.
MDC (Mapped Diagnostic Context) is implemented via ThreadLocal. When you submit a task to a thread pool, the task runs on a different thread that does not automatically inherit the submitting thread's ThreadLocal values. This is by design: ThreadLocal values are scoped to the thread that sets them. Java provides InheritableThreadLocal, but it inherits from the creating thread, not the submitting thread — and in a thread pool, threads are created once and reused, so using InheritableThreadLocal would give every task on that thread the same (stale) context.
The correct fix is to copy the context at task submission time using a custom wrapper. In Java, this is done with a TaskDecorator (Spring) or by overriding beforeExecute/afterExecute in a custom ThreadPoolExecutor. Example: `executor.setTaskDecorator(runnable -> { Map<String, String> context = MDC.getCopyOfContextMap(); return () -> { MDC.setContextMap(context); try { runnable.run(); } finally { MDC.clear(); } }; });` Always clear the context after execution to prevent memory leaks.
Reactive frameworks (Project Reactor, RxJava, Vert.x) do not use ThreadLocal at all because a single request can hop between threads. Instead, they provide a subscriber context that is propagated through the reactive chain. For example, in Reactor, you can store the correlation ID in `Context` using `subscriberContext()` and retrieve it via `Mono.deferContextual(ctx -> ...)`. But if you have legacy code that relies on MDC, you need a bridge. Reactor provides `Hooks.enableAutomaticContextPropagation()` (since 3.3.0) that hooks into operators to copy the context, but it's not always sufficient.
A common pattern is to use a filter that captures the correlation ID from the incoming request and puts it into Reactor's Context, then a custom operator that extracts it into MDC at the start of each reactive operation. Libraries like `logbook` or `spring-cloud-sleuth` do this automatically, but if you roll your own, you must ensure every reactive operator preserves the context. A missing `publishOn` or `subscribeOn` can drop the context.
When a service receives a correlation ID in an HTTP header, it must forward that header to downstream services. This is not automatic. In Spring Boot, `RestTemplate` does not propagate headers unless you add an interceptor. Example: `restTemplate.getInterceptors().add((request, body, execution) -> { request.getHeaders().add("X-Correlation-Id", MDC.get("correlationId")); return execution.execute(request, body); });` Similarly, for WebClient, you need to use a filter or a `ExchangeFilterFunction`.
A more robust approach is to use a servlet filter that stores the correlation ID in a request-scoped bean or a custom context holder, and then configure all HTTP clients to read from that holder. But beware of thread-safety: if you use a ThreadLocal, it won't work in reactive stacks. OpenTelemetry's `Context` solves this by being immutable and propagated through the instrumented libraries, but it requires proper instrumentation of all HTTP clients and servers.
Message queues (RabbitMQ, Kafka, SQS) are another common breaking point. The correlation ID must be placed in the message headers by the producer. On the consumer side, the listener must extract it and set it into MDC before processing. Many frameworks (e.g., Spring Cloud Stream) do this automatically if configured, but custom listeners often miss it.
For Kafka, you can use a `ConsumerInterceptor` or a `RecordInterceptor` in Spring Kafka. For RabbitMQ, use a `MessagePostProcessor` in the `SimpleMessageListenerContainer`. The critical detail is to set the MDC before the handler method runs and clear it after, even on exceptions. A try-finally block in a decorator around the listener method ensures cleanup.
OpenTelemetry (OTel) provides a vendor-agnostic way to propagate context across services. It defines a `Context` object that is immutable and propagated through the instrumentation API. OTel's SDK automatically handles thread pools, HTTP calls, and message queues if you use the provided instrumentation libraries. For example, `opentelemetry-java-instrumentation` includes agent or library instrumentation that wraps executors, HTTP clients, and messaging libraries to propagate the context.
To adopt OTel, you need to: (1) Add the OTel agent or dependency, (2) Configure propagators via `OTEL_PROPAGATORS` (e.g., `tracecontext,baggage`), (3) Ensure all services use the same propagator. The main advantage is that you don't need to write custom MDC propagation code. However, OTel's context is separate from MDC, so you still need to bridge it if you want logs to show the trace ID. OTel provides a `LoggingSpanExporter` and integration with Logback via `OpenTelemetryAppender`.
Frequently asked questions
Why does InheritableThreadLocal not solve the thread pool problem?
InheritableThreadLocal inherits context from the thread that creates the child thread. In a thread pool, threads are created once and reused, so the context would be the same for all tasks on that thread — the context of the thread that created the pool, not the task submitter. Also, it can cause memory leaks because the context is never cleared. The correct approach is to copy context at task submission time and clear it after execution.
How do I propagate correlation ID in a reactive Spring WebFlux application?
In WebFlux, you cannot rely on ThreadLocal because the processing can switch threads. Instead, use Reactor's `Context`. You can store the correlation ID in a filter: `exchange.getAttributes().put("correlationId", id);` and then pass it through the reactive chain using `subscriberContext()`. For MDC integration, use `Hooks.enableAutomaticContextPropagation()` or manually bridge via a custom operator that sets MDC at the start of each operation.
My correlation ID is present in the log of the first service but missing in the second — what should I check?
First, verify that the first service is sending the correlation ID in the outgoing HTTP request headers. Use a tool like `curl` or a proxy to capture headers. Second, check that the second service's web framework is reading the header and storing it in its context. For example, in Spring Boot, you might need a filter that calls `MDC.put("correlationId", request.getHeader("X-Correlation-Id"))`. Also check that the second service's configuration does not strip unknown headers.
Does OpenTelemetry automatically fix correlation ID propagation?
OpenTelemetry provides automatic context propagation for many libraries (HTTP, gRPC, messaging, thread pools) through its instrumentation. If you use the OTel Java agent or the library instrumentation, it will handle propagation across async boundaries. However, you still need to configure it properly (e.g., set propagators) and ensure all services are instrumented. Also, OTel's context is separate from MDC, so you need to bridge it for logging — OTel provides an appender for this.
How can I test correlation ID propagation in integration tests?
Write an integration test that sends a request with a specific correlation ID header to your service, then calls downstream services (real or mocked) and asserts that the downstream logs contain the same ID. You can also mock the MDC and verify that the expected key is set. For async scenarios, use `CountDownLatch` to wait for async tasks and then check the logs or MDC state. Tools like Testcontainers can help spin up real message queues for end-to-end testing.