LEARN · DEBUGGING GUIDE

Distributed Tracing Spans Not Connecting: A Debugging Guide

Spans that don't connect into a single trace waste observability investment. This guide covers the real reasons—sampling gaps, missing context headers, clock drift, and agent misconfiguration—with concrete commands to verify each layer.

AdvancedObservability8 min read

What this usually means

Spans not connecting means the trace context (trace ID, span ID, parent span ID) is not being propagated across service boundaries. The most common cause is a failure in the distributed tracing instrumentation—either the client library isn't injecting the correct headers, or the receiving service isn't extracting them. This can happen when services use different tracing libraries (e.g., OpenTracing vs OpenTelemetry), when proxies or load balancers strip custom headers, or when sampling decisions are made independently per service. Another subtle cause is clock skew: if service clocks are off by more than a few milliseconds, the tracing backend may reject parent-child relationships based on timestamps. Finally, misconfigured sampling rates can cause parent spans to be dropped while child spans are retained (or vice versa), leaving orphans.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 11. Inspect trace context headers in request logs: `grep 'traceparent' /var/log/nginx/access.log` or use `curl -v http://service-a/endpoint | grep -i trace` to see if headers are present.
  • 22. For a known failing request, capture the headers at each hop using `tcpdump -A port 80 | grep -E 'traceparent|tracestate|X-B3-TraceId'`.
  • 33. Check tracing agent configuration on each service: look for `OTEL_TRACES_SAMPLER` or `JAEGER_SAMPLER_TYPE` env vars and ensure they match.
  • 44. Compare system clocks across hosts: `chronyc tracking` or `ntpq -p`; skew >100ms is problematic.
  • 55. Use the tracing backend's debug endpoint: e.g., Jaeger's `api/traces/{trace-id}?raw=true` to see raw span data including parent span IDs.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchService mesh logs (Envoy/Linkerd): `kubectl logs -n istio-system -l app=istio-proxy --tail=100 | grep -i trace`
  • searchTracing agent logs: `/var/log/opentelemetry-collector.log` or `journalctl -u jaeger-agent -f`
  • searchApplication logs with trace context: `grep 'trace.id=' /var/log/app.log`
  • searchTracing backend (Jaeger/Zipkin) storage: query raw spans with `curl 'http://jaeger:16686/api/traces?service=my-service&limit=10&raw=true'`
  • searchLoad balancer config (NGINX, HAProxy): check if headers like `X-B3-*` or `traceparent` are in the `proxy_set_header` block.
  • searchOpenTelemetry Collector config: `otelcol --config /etc/otel/config.yaml` and verify exporters/batch processors.
  • searchKubernetes pod annotations: `kubectl get pod -o yaml | grep 'sidecar.istio.io/inject'` to confirm sidecar injection.
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningSampling mismatch: Service A samples at 1%, Service B samples at 100%, causing parent spans to be dropped.
  • warningMissing header propagation: A reverse proxy (NGINX, Envoy) strips `traceparent` or `x-b3-traceid` headers.
  • warningClock skew >1 second between services, causing the tracing backend to reject spans as out-of-order.
  • warningDual instrumentation: Service A uses OpenTracing (Jaeger), Service B uses OpenTelemetry, and the context format isn't translated.
  • warningAsync processing break: A message queue (Kafka, RabbitMQ) consumer doesn't propagate the tracing context from the message headers.
  • warningAgent misconfiguration: The tracing agent (e.g., Jaeger Agent) is sending spans to the wrong collector endpoint or the collector drops them due to rate limiting.
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildStandardize on one tracing library across all services (prefer OpenTelemetry), and use the same version of the SDK.
  • buildConfigure consistent sampling: set `OTEL_TRACES_SAMPLER=parentbased_traceidratio` with the same ratio across all services, or use a head-based sampler that decides at the edge.
  • buildFix header propagation in proxies: add `proxy_set_header X-B3-TraceId $http_x_b3_traceid;` and similar for all B3 headers in NGINX.
  • buildSynchronize clocks using NTP everywhere: install `chrony` and verify with `chronyc sources -v`.
  • buildFor async messaging, ensure the tracing context is serialized into message headers and extracted on consumption. Use OpenTelemetry's `TextMapPropagator` to inject/extract.
  • buildReduce collector batch timeout/delay to avoid spans being dropped: e.g., `OTEL_BSP_SCHEDULE_DELAY=100` (milliseconds).
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedSend a test request and capture the trace ID at the entry point. Then query the tracing backend for that trace ID and confirm all spans appear in the waterfall.
  • verifiedCheck the number of spans per trace: a full trace should have at least as many spans as services involved. Use `curl 'http://jaeger:16686/api/traces/{trace-id}' | jq '.data[0].spans | length'`.
  • verifiedVerify parent-child relationships: for each span, ensure `references[0].refType` is `CHILD_OF` and `traceID` matches.
  • verifiedRun a chaos test that injects varying latency and verify traces remain connected under load.
  • verifiedUse the tracing backend's 'Trace Graph' or 'Dependencies' view to see if services are correctly linked.
  • verifiedMonitor the collector's 'dropped spans' metric: `kubectl exec -n observability otel-collector-0 -- curl http://localhost:8888/metrics | grep otelcol_exporter_enqueue_failed_spans`.
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningDon't blindly increase sampling to 100% as a fix; it doubles costs without addressing the root cause.
  • warningDon't assume all proxies forward custom headers; verify with `curl -v` after each hop.
  • warningDon't ignore clock skew because 'it's only a few milliseconds'; even 10ms can cause issues with high-frequency tracing.
  • warningDon't mix different tracing SDKs without a propagator bridge (e.g., OpenTelemetry to Jaeger via `OTEL_PROPAGATORS=jaeger`).
  • warningDon't forget to propagate context in background tasks (e.g., async/await, threads) within the same service; use `Context.current()` correctly.
  • warningDon't rely solely on the tracing UI; always check raw span data via API for debugging.
( 07 )War story

The Case of the Disconnected Microservices

Senior SREKubernetes, Istio, Jaeger, OpenTelemetry Collector, NGINX Ingress, Go services

Timeline

  1. 14:00Alert: trace connectivity drops below 80% for service 'order-api'.
  2. 14:02Check Jaeger UI: many traces show only a single span for 'order-api', no downstream spans.
  3. 14:05Query raw trace for a failing request: parent span ID missing on downstream spans.
  4. 14:10Check ingress NGINX logs: B3 headers are present in incoming request.
  5. 14:15Check Istio proxy logs: `x-b3-traceid` header is present, but `x-b3-spanid` is overwritten.
  6. 14:20Discover Istio mTLS replaces certain headers due to Istio 1.10 bug with B3 propagation.
  7. 14:30Apply Istio patch: set `meshConfig.defaultConfig.tracing.zipkin.address` and enable `tracing.sampling=100` temporarily.
  8. 14:35Verify: new traces show full waterfall. Root cause: Istio sidecar was overriding span ID.

I was on call when the trace connectivity alert fired. For weeks we had been seeing orphan spans in Jaeger, but today it hit a threshold. I opened the Jaeger UI and immediately saw the problem: traces for 'order-api' had only one span each—no child spans from 'payment-service' or 'inventory-service'. The trace IDs were consistent, but the waterfall showed a single bar.

I started at the ingress. The NGINX logs showed the incoming request had `X-B3-TraceId` and `X-B3-SpanId` headers. Good. I followed the request into the Istio sidecar proxy logs and saw the headers were present. But when I looked at the downstream 'payment-service' proxy logs, the `X-B3-SpanId` was different from the one set by 'order-api'. That was the smoking gun: the span ID was being overwritten somewhere.

I dug into Istio's tracing configuration and found that Istio 1.10 had a known issue with B3 header propagation when mTLS was enabled—the sidecar would generate a new span ID instead of using the parent's. The fix was to set the `tracing.zipkin.address` in the mesh config and restart the proxies. After that, traces connected. I left the sampling at 100% for an hour to verify, then dropped it back to 1% once confirmed. The lesson: always test header propagation through every proxy, and keep Istio up to date.

Root cause

Istio sidecar proxy overwrote the B3 span ID header due to a known bug in Istio 1.10 when mTLS was enabled, causing parent-child span relationships to break.

The fix

Set `meshConfig.defaultConfig.tracing.zipkin.address` to the Jaeger collector endpoint and temporarily increased tracing sampling to 100% to verify propagation.

The lesson

Never assume a service mesh forwards all headers correctly; always verify with raw span dumps and proxy logs.

( 08 )Header Propagation: The Critical Path

Distributed tracing relies on context propagation via headers. The most common formats are W3C Trace Context (`traceparent`, `tracestate`) and Zipkin B3 (`x-b3-traceid`, `x-b3-spanid`, `x-b3-parentspanid`, `x-b3-sampled`). If any intermediate proxy, load balancer, or service mesh component strips or modifies these headers, the trace breaks. In Kubernetes, especially with Istio or Envoy, you must verify that the sidecar proxies are configured to propagate the headers. Use `kubectl exec` to check the Envoy configuration: `curl http://localhost:15000/config_dump | jq '.configs[1].dynamic_listeners[0].active_state.listener.filter_chains[0].filters[0].typed_config.http_filters[0]'` and look for the 'tracing' filter.

Another subtle issue: some frameworks (e.g., Spring Cloud Sleuth) use their own header format. If you mix them, you need a propagator bridge. For example, OpenTelemetry can work with B3 by setting `OTEL_PROPAGATORS=jaeger,b3`. Always run a simple curl test through each service: `curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" http://service-a/endpoint` and check the logs for the header value at the receiving end.

( 09 )Sampling Inconsistency: The Orphan Span Problem

When services have independent sampling decisions, a parent span may be sampled out while a child span is retained, or vice versa. This creates orphaned spans that appear as separate traces. The solution is to use a consistent sampling strategy across services. The most robust approach is head-based sampling with a deterministic trace ID ratio, so the same trace ID always results in the same sampling decision. OpenTelemetry supports this with `parentbased_traceidratio` sampler. Set `OTEL_TRACES_SAMPLER=parentbased_traceidratio` and `OTEL_TRACES_SAMPLER_ARG=0.1` on all services.

Another approach is tail-based sampling, where a collector node makes the decision after seeing all spans. This requires a long tail buffer and is more complex to set up. For most cases, head-based with consistent ratio is sufficient. Verify by querying the tracing backend for a sample trace ID and checking that the `sampled` flag is consistent across spans.

( 10 )Clock Skew: When Timestamps Lie

Tracing backends use span timestamps to order spans and build the waterfall. If clocks are skewed, the backend may reject parent-child relationships because the child span appears to start before the parent ends. The typical tolerance is a few milliseconds, but with NTP drift, skew can exceed 100ms. Use `ntpq -p` to check offset and jitter. If you see an offset >50ms, fix it with `chronyc makestep`. In cloud environments, use the instance metadata service to get the correct time (e.g., AWS's `169.254.169.123`).

If you cannot fix clock skew immediately, you can configure the tracing backend to allow a larger clock skew window. For example, in Jaeger, you can set `--query.max-clock-skew-adjustment=10s` to allow up to 10 seconds of skew, but this is a band-aid. Always aim for sub-millisecond accuracy across all hosts.

( 11 )Collector and Agent Configuration Pitfalls

The OpenTelemetry Collector or Jaeger Agent can drop spans due to misconfiguration. Common issues: batch processors with too short a timeout causing spans to be dropped before export, or exporters that fail silently. Check the collector's metrics endpoint: `curl http://localhost:8888/metrics | grep -E 'otelcol_exporter_enqueue_failed|otelcol_processor_dropped'`. If you see dropped spans, increase the batch timeout or adjust the queue size. For Jaeger Agent, check `--reporter.tchannel.host-port` and ensure the agent can reach the collector. Use `telnet jaeger-collector 14267` to verify connectivity.

Also check the sampling configuration on the agent: Jaeger Agent can apply its own sampling if not configured correctly. Set `--sampler.type=remote` and point to the collector's sampling endpoint, or disable agent sampling entirely by setting `--sampler.type=const --sampler.param=1` for testing.

Frequently asked questions

What is the difference between trace ID and span ID, and why does it matter for connecting spans?

A trace ID identifies the entire request flow, while a span ID identifies a single unit of work within that flow. Each span also carries a parent span ID. For spans to connect, the parent span ID in a child span must match the span ID of its parent. If any service loses or modifies the parent span ID, the child span becomes an orphan. Always verify that the parent span ID is correctly propagated in headers.

How can I tell if my service mesh is dropping trace headers?

Enable debug logging for the proxy (e.g., Istio: `istioctl proxy-config log <pod> --level tracing:debug`) and look for header values in the logs. Alternatively, use `tcpdump` on the pod's network interface to capture raw HTTP headers. A quick way is to inject a known header via curl and check the response headers from downstream services for the expected trace context.

Should I use W3C Trace Context or B3 headers?

W3C Trace Context is the standard and recommended for new deployments. It is supported by OpenTelemetry and most modern tracing systems. B3 is older but still widely used, especially in Zipkin-influenced stacks. If you have a mixed environment, configure multiple propagators in OpenTelemetry: `OTEL_PROPAGATORS=tracecontext,baggage,b3`. Whichever you choose, consistency across all services is critical.

What does 'parentbased_traceidratio' sampler do exactly?

This sampler makes the sampling decision based on the trace ID alone, using a deterministic hash. It ensures that if a parent span is sampled, all child spans of that trace are also sampled, regardless of the service. This prevents orphan spans due to inconsistent sampling. Set `OTEL_TRACES_SAMPLER=parentbased_traceidratio` and `OTEL_TRACES_SAMPLER_ARG=0.1` for 10% sampling on all services.

Can clock skew really cause spans not to connect?

Yes, especially in systems with strict ordering checks. For example, if a child span's start time is before the parent span's end time by more than a small tolerance, the backend may flag it as invalid. In Jaeger, the default clock skew adjustment is 500ms, but if clocks are off by seconds, spans will appear disconnected. Always synchronize clocks using NTP and monitor with `ntpq -p`.