When you have a monolith, a single log line can tell you everything: request arrived, database query took 200ms, response sent. But in a microservices architecture, a single user request can fan out across 10, 20, or 50 services. Suddenly, that 200ms database query is invisible because the log is split across three different services.
Distributed tracing solves this by giving each request a unique ID that travels with it across every service. You can then reconstruct the entire path of that request—every hop, every latency spike, every error—in one view.
How Trace Context Propagates
The core idea is simple: when a request enters your system, create a trace ID. Every service that touches that request passes the trace ID along via headers. The W3C Trace Context standard defines two headers: traceparent (contains trace ID, span ID, and trace flags) and tracestate (for vendor-specific data).
Here's the tricky part: every inter-service call must explicitly propagate these headers. If your HTTP client library doesn't automatically inject them, you have to do it manually. I've seen teams spend days debugging 'missing traces' only to find that a gRPC call wasn't forwarding the context.
// Example: Manual propagation with OpenTelemetry in Node.js
const { trace } = require('@opentelemetry/api');
const http = require('http');
function makeRequest(url) {
const span = trace.getTracer('example').startSpan('makeRequest');
const ctx = trace.setSpan(trace.context.active(), span);
const headers = {};
trace.propagation.inject(ctx, headers);
http.get(url, { headers }, (res) => {
// ... handle response
span.end();
}).on('error', (err) => {
span.recordException(err);
span.end();
});
}A Real Incident: The 5-Second Checkout
The Checkout Timeout
- 14:02Pager alerts: checkout endpoint p99 latency jumps from 300ms to 5s.
- 14:05Engineers check logs: every service looks normal individually.
- 14:10Open a distributed trace for a failed checkout. Trace shows a 4.5s gap between 'order-service' and 'payment-gateway'.
- 14:12Inspect the span for the payment call: it has a tag 'http.host' pointing to an old staging endpoint instead of production.
- 14:15Fix the misconfigured environment variable in the deployment config. Deploy fix.
- 14:18Latency drops back to 300ms. Incident resolved.
Lesson
Without distributed tracing, we would have wasted hours checking individual service logs. The trace immediately pointed to the payment service's downstream call, revealing the configuration error that was invisible in per-service metrics.
Sampling: You Can't Keep Every Trace
In a high-traffic system, storing every single trace is expensive. You need sampling. The simplest approach is head-based sampling: at the root of each trace, decide with probability p whether to keep it. For example, keep 1% of all traces. This is easy to implement but can miss rare errors.
Tail-based sampling is more sophisticated: you collect all spans temporarily, then decide which traces to keep based on their properties (e.g., any span with an error, or latency above a threshold). This ensures you capture the interesting traces, but requires a centralized decision service and more infrastructure.
Head-based sampling with a fixed rate can miss critical traces. If you sample 1% and your error rate is 0.1%, you'll only capture about 1 in 10 errors. Use adaptive sampling or tail-based sampling for high-reliability systems.
What Makes a Good Trace?
A trace is more than just a list of spans. Each span should carry meaningful attributes: service name, operation, duration, status code, and key metadata (e.g., user ID, region, request size). Avoid putting high-cardinality values like session IDs into every span—it kills performance.
Also, think about span granularity. Too coarse: you lose insight into which internal function caused the delay. Too fine: you overwhelm storage and the UI. A good rule is to create a span for every external call (HTTP, DB, message queue) and for any internal operation that takes more than 10ms.
A trace that doesn't tell you where the time went is just a pretty waterfall. Always add enough detail to answer: 'Why was this slow?'
Correlating Traces with Logs and Metrics
Tracing is most powerful when combined with logs and metrics. The trace ID should appear in every log line for the request. Then, when you see a slow trace, you can jump to the exact logs for that request. Similarly, metrics like request duration can be tagged with trace IDs to drill down.
OpenTelemetry's log correlation is straightforward: include trace_id and span_id in your log format. Most logging frameworks support this with a simple configuration change.
{
"timestamp": "2025-03-15T14:02:05.123Z",
"level": "error",
"message": "Payment gateway timeout",
"service": "order-service",
"trace_id": "abc123def456",
"span_id": "span789",
"user_id": "user42"
}Getting Started Without the Burnout
You don't need to instrument everything at once. Start with one critical path—say, the user login flow or the checkout flow. Instrument the entry service, then the next two downstream services. Once you see how traces look, expand.
Use OpenTelemetry's auto-instrumentation for common libraries (HTTP, gRPC, database drivers). It covers 90% of what you need. Only add manual instrumentation for business logic that's unique to your system.
And please, don't set up a tracing backend on your first day. Use a managed service (Datadog, Honeycomb, Grafana Cloud) or a simple Jaeger all-in-one deployment to see traces quickly. You can optimize later.
of tracing value comes from auto-instrumentation of HTTP, gRPC, and database calls.
Frequently asked questions
What's the difference between distributed tracing and logging?
Logging records events at a single point in time in one service. Distributed tracing correlates events across multiple services into one request lifecycle, showing time spent in each service and the causal order.
Do I need to instrument every single service?
Not necessarily. Start with services that handle user-facing requests or are part of critical paths. Even partial tracing gives you 80% of the value. You can gradually add more services.
How does sampling work in distributed tracing?
Sampling decides which traces to keep. Head-based sampling decides at the root of a trace, usually with a fixed probability. Tail-based sampling uses a centralized decision based on trace properties (e.g., errors or high latency) after the trace completes.
Is OpenTelemetry the only way to do distributed tracing?
No, but it's become the industry standard. Alternatives include Jaeger's direct SDK, Zipkin's Brave, or vendor-specific agents like Datadog's APM. OpenTelemetry is vendor-agnostic and integrates with many backends.