What this usually means
The most common underlying cause is a mismatch between the OpenTelemetry SDK configuration and the collector/backend: either spans are being sampled out, the exporter is failing silently due to timeouts or queue overflow, or context propagation is broken (e.g., missing or incorrect W3C traceparent headers). In production, resource limits on the collector, network latency, or authentication errors often cause spans to be dropped before they reach the storage backend.
The first ten minutes — establish facts before touching code.
- 1Check SDK logs: look for 'Failed to export spans' or 'exporter timeout'—increase log level to DEBUG if needed.
- 2Verify sampling configuration: confirm the sampler is not set to 'always_off' or a low probability rate.
- 3Test with a simple curl request and check if the traceparent header is passed correctly.
- 4Inspect collector metrics: check otelcol_exporter_queue_size and otelcol_exporter_send_failed_span_count.
- 5Validate endpoint URL and authentication: ensure the collector endpoint is reachable and TLS certs are valid.
- 6Enable the batch span processor's detailed logging: set OTEL_BSP_EXPORT_TIMEOUT to a higher value temporarily.
The specific files, logs, configs, and dashboards that usually own this bug.
- searchSDK logs (stdout/stderr) from the application process
- searchOpenTelemetry Collector logs (journalctl -u otelcol or collector container logs)
- searchCollector metrics endpoint (e.g., http://localhost:8888/metrics)
- searchApplication code: tracer provider initialization and instrumentation setup
- searchHTTP traffic: capture headers with tcpdump to verify traceparent propagation
- searchExporter configuration in collector config.yaml (endpoint, headers, tls)
- searchBackend UI (Jaeger/Tempo/DataDog) search with trace ID from SDK logs
Practical causes, not theory. These are the things you will actually find.
- warningSampler misconfiguration: sampler set to 'always_off' or a ratio too low for low-traffic services
- warningExporter timeout or queue full: default batch processor settings are too tight for high-throughput services
- warningContext propagation failure: W3C traceparent header is missing or overwritten by a proxy/load balancer
- warningCollector authentication or TLS mismatch: invalid API key or expired certificate
- warningSpan attribute limits: spans dropped because attribute size exceeds collector's limit
- warningResource exhaustion: CPU/memory pressure on collector causing spans to be dropped
Concrete fix directions. Pick the one that matches your root cause.
- buildSet sampler to 'always_on' temporarily for a specific service to verify sampling is the issue, then adjust ratio.
- buildIncrease batch processor timeout and max queue size: set OTEL_BSP_SCHEDULE_DELAY to 5000ms, OTEL_BSP_MAX_EXPORT_BATCH_SIZE to 512.
- buildEnsure all services propagate the same traceparent header: use OpenTelemetry propagators (W3CTraceContextPropagator).
- buildAdd a tail-based sampler in the collector for high-volume services.
- buildConfigure collector to retry with backoff: set 'retry_on_failure.enabled: true' in the exporter section.
- buildWrap the tracer provider initialization in a health check that logs a test span.
A fix you cannot prove is a guess. Close the loop.
- verifiedGenerate a test span manually using the SDK API and check if it appears in the backend.
- verifiedUse the collector's debug exporter to log all spans received: add 'debug' exporter and set verbosity to 'detailed'.
- verifiedRun a continuous ping with trace ID and verify the trace appears in the backend within seconds.
- verifiedMonitor collector metrics: successful export count should increase after fix.
- verifiedCheck trace context propagation end-to-end by enabling 'propagation' debug logging in the SDK.
Things that make this bug worse or harder to find.
- warningChanging sampling rates without checking collector queue metrics first.
- warningDisabling TLS validation in production without understanding the security implications.
- warningAdding retries to the exporter without setting a maximum time to prevent infinite retries.
- warningModifying the collector pipeline without restarting the collector.
- warningAssuming the default OpenTelemetry configuration works for production throughput.
- warningForgetting to enable the W3C trace context propagator in multi-service environments.
Production Traces Disappear After Collector Upgrade
Timeline
- 09:15Alert: 'No traces received for service payment-service in last 5 minutes'.
- 09:18Check Tempo UI: no new traces for any service. SDK logs show 'Failed to export spans: rpc error: code = Unavailable'.
- 09:22SSH into collector host: 'journalctl -u otel-collector -n 100' shows 'exporter is shutting down' repeated.
- 09:25Check collector metrics: otelcol_exporter_queue_size is 10000 (full).
- 09:30Realize the collector was upgraded overnight: new config has 'sending_queue: enabled: true, queue_size: 10000' but exporter timeout unchanged at 5s.
- 09:35Increase exporter timeout to 30s and reduce queue size to 1000. Restart collector.
- 09:40Traces start appearing. Metrics show successful exports.
- 09:45Root cause confirmed: default batcher settings could not flush the queue within 5s, causing backpressure and drop.
I got paged at 9:15 AM that payment-service had zero traces for 5 minutes. At first I thought it was a sampling issue—maybe someone changed the ratio. But opening Tempo showed zero traces across all services, which meant it wasn't per-service config. The SDK logs were flooding with 'Failed to export spans: rpc error: code = Unavailable'. This pointed to the collector being the bottleneck.
I jumped on the collector host and checked the logs. The collector was logging 'exporter is shutting down' repeatedly. I hit the metrics endpoint and saw that the exporter queue was full (10,000). That's when I remembered the platform team had upgraded the collector overnight to v0.75 with a new default queue configuration. The queue size was now 10,000 but the exporter timeout was still the default 5 seconds. Under normal load, the queue would fill up and the exporter couldn't flush it fast enough, so spans were dropped.
I increased the exporter timeout to 30 seconds and reduced the queue size to 1000 (to match the old behavior). After restarting the collector, traces started flowing again within a minute. The lesson: never assume new defaults are safe for your throughput. Always baseline test collector changes with production traffic patterns. I also added a Prometheus alert on queue size to catch this earlier next time.
Root cause
Collector upgrade introduced a larger default sending queue without adjusting the exporter timeout, causing the queue to fill and spans to be dropped.
The fix
Set exporter timeout to 30s and queue size to 1000 in collector config, then restart.
The lesson
Always test collector configuration changes against production traffic patterns; monitor queue metrics proactively.
The OpenTelemetry SDK groups spans into batches using the BatchSpanProcessor. These batches are sent to the configured exporter (e.g., OTLP gRPC). The exporter then sends the data to the collector or backend. If the exporter fails (timeout, network error), the SDK retries based on its retry policy. If retries fail, spans are dropped.
The collector itself has an internal pipeline: receivers -> processors -> exporters. Each exporter has its own queue. If the backend is slow, the queue fills up and the receiver starts rejecting spans. This is the most common bottleneck in high-throughput production setups.
The default sampler in OpenTelemetry is 'parentbased_always_on' which samples all spans. However, many teams change this to a probabilistic sampler (e.g., 10%) to reduce volume. The problem: if a service has very low traffic, 10% sampling might produce no traces at all for hours. Always check the sampler configuration per service.
Sampling can also happen at the collector level via the 'tail_sampling' processor. This processor can drop entire traces based on criteria like latency or error status. Misconfiguration here can cause traces to vanish even if the SDK sent them correctly.
For distributed traces to work, the W3C traceparent header must be passed between services. If a load balancer, API gateway, or a custom middleware strips or modifies this header, traces become orphaned (no parent). This is especially common when using non-HTTP transports (e.g., gRPC metadata, message queues).
To debug, capture HTTP headers with tcpdump: `sudo tcpdump -A -i any port 8080 | grep -i traceparent`. If the header is missing, check your propagator setup. In Go, you must explicitly set `otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))`. Many forget this step.
When the collector runs out of memory or CPU, it may start dropping spans silently. Check collector process metrics: `top -p $(pgrep otelcol)` or use the collector's own metrics endpoint. Key metrics: `process_runtime_memory_rss_bytes`, `process_cpu_seconds_total`.
Backpressure from the backend (e.g., Tempo is overloaded) can cause the collector's exporter to block. If the block exceeds the exporter timeout, spans are dropped. Mitigations: use a load-balanced collector fleet, enable 'sending_queue' with proper sizing, and set 'retry_on_failure' with exponential backoff.
Frequently asked questions
Why do traces appear in development but not in production?
Development environments often have lower traffic and different collector/backend configurations. In production, sampling rates may be lower, exporter timeouts may be hit, or network policies (firewalls, TLS) may block traffic. Also, production load balancers or proxies may strip trace context headers. Always test with a production-like setup.
How can I check if the SDK is actually sending spans?
Enable debug logging on the SDK: set environment variable `OTEL_LOG_LEVEL=debug`. You'll see lines like 'Exporting X spans. Sending to endpoint Y' and any errors. Alternatively, add a console exporter to print spans to stdout temporarily.
What is the difference between head-based and tail-based sampling?
Head-based sampling decides at the root span whether to keep a trace (e.g., probabilistic sampler). Tail-based sampling evaluates the entire trace after all spans are collected, allowing decisions based on latency or errors. Head-based is simpler but can miss rare errors. Tail-based is more accurate but requires storing spans temporarily.
My collector is dropping spans but I see no errors in the logs. Why?
The collector may be configured with a debug logging level that suppresses errors, or the exporter is silently dropping spans due to a full queue. Check collector metrics for `otelcol_exporter_dropped_spans`. Also, ensure the collector's log level is set to 'INFO' or 'DEBUG' to see warnings.