Debugging Distributed Systems: Microservices Tracing Example

It started with a support ticket: "Checkout is stuck for some users." No error message, no 500 status code. The frontend just hung on the final confirmation step. We had five microservices involved in checkout — order service, payment service, inventory service, notification service, and a queue-based async workflow that sent order confirmations.

The first thing I did was check the logs. All services reported "200 OK" for every request. No exceptions, no timeouts. But users were stuck. That's when I learned the first lesson: logs lie. Not intentionally, but they only tell you what your code explicitly records. If the bottleneck is in an infrastructure layer — like a message queue — your application logs will be silent.

The Setup: Async Workflows and Silent Failures

Our checkout flow worked like this: after payment was captured, the order service published a "order_confirmed" event to RabbitMQ. The notification service consumed that event and sent an email. The inventory service also consumed it to decrement stock. All of that was supposed to happen in under a second.

But users reported that after placing an order, they never got a confirmation email and the UI never transitioned to the success page. The payment went through — we could see the charge in Stripe — but the order status remained "pending" in our database.

warning

When payment succeeds but the order stays pending, you have an async processing problem. Never assume success downstream just because upstream succeeded.

First Attempt: Metrics

I pulled up our Prometheus dashboards. Order service latency looked fine — p99 under 200ms. Payment service similarly. But then I looked at the queue metrics: RabbitMQ had a queue depth of 0 for most queues, except one: "order_confirmation_retry". That queue had 1,200 messages waiting. And the consumer for that queue was not processing any messages.

The consumer was the notification service. It had crashed hours ago and not restarted. But why hadn't it restarted? Kubernetes should have restarted the pod. I checked the pod status: CrashLoopBackOff. The notification service was crashing on startup because of a missing environment variable — a configuration that had been deployed with a typo an hour before the incident started.

A single typo in a ConfigMap caused the notification service to crash on startup. The service never logged a clear error — it just panicked and exited.

# The broken config that caused the crash loop
apiVersion: v1
kind: ConfigMap
metadata:
  name: notification-service-config
data:
  SMTP_HOST: "smtp.example.com"
  SMTP_PORT: "587"
  SMTP_USER: "notifications@example.com"
  # Typo: missing underscore in SMTP_PASSWORD
  SMTP_PASS: "supersecret"  # This key is wrong, app expects SMTP_PASSWORD

We fixed the typo, the pod restarted, and the queue drained. Users got their emails. The issue was resolved in 10 minutes. But I wanted to understand why it took 45 minutes to diagnose. The answer: we had no tracing. We couldn't see that the order_confirmed event was published but never consumed. The order service logs showed "published event successfully", but there was no link to the consumer side.

What Traces Would Have Shown

If we had distributed tracing with OpenTelemetry, we would have seen a span for the entire checkout flow. The span from the order service would show "publish to queue" as a child operation. But the consumer span from the notification service would be missing entirely — an orphaned trace. That would immediately tell us the message was never consumed.

Even better, we could have set up a metric on trace completion: if a trace starts with a "checkout" root but never reaches the "send_email" span, alert. That's the kind of observability that prevents a 45-minute fire drill.

The Stuck Queue Incident

14:02Deploy of notification service v2.3.1 with typo in ConfigMap
14:03Notification service crashes, enters CrashLoopBackOff
14:05First user reports stuck checkout; support ticket created
14:20Engineer checks logs: all services return 200, no errors visible
14:35Engineer looks at RabbitMQ metrics, sees queue depth growing
14:40Kubernetes pod status checked: CrashLoopBackOff detected
14:45ConfigMap typo identified and fixed; pod restarts
14:50Queue drains, users receive emails, incident resolved

Lesson

Without distributed tracing, we wasted 30 minutes checking logs that showed nothing wrong. Queue metrics helped, but only after we thought to look at them. Traces would have immediately shown the missing consumer span.

Building a Debugging Toolkit for Distributed Systems

After that incident, we invested in three things that changed how we debug:

1. Distributed tracing with OpenTelemetry across all services, with a focus on async paths (message queues, background jobs).

2. Structured logging with correlation IDs that propagate via trace headers, so we can filter all logs for a single checkout flow.

3. Service-level indicators (SLIs) for each async path: queue depth, consumer lag, and span completion rate.

Instrumenting a RabbitMQ consumer with OpenTelemetry. The key is propagating trace context through message headers so the consumer span is correctly linked to the producer span.

// Example: Adding OpenTelemetry tracing to a RabbitMQ consumer in Node.js
const { trace, context } = require('@opentelemetry/api');
const amqp = require('amqplib');

async function consumeOrderConfirmed(channel) {
  const consumerTag = await channel.consume('order_confirmed', async (msg) => {
    // Extract the parent span context from the message headers
    const parentContext = opentelemetry.propagation.extract(context.active(), msg.properties.headers);
    
    // Start a new span as a child of the extracted context
    const tracer = trace.getTracer('notification-service');
    await tracer.startActiveSpan('process_order_confirmed', { parent: parentContext }, async (span) => {
      try {
        span.setAttribute('order_id', msg.content.toString());
        await sendEmail(msg.content.toString());
        channel.ack(msg);
        span.setStatus({ code: SpanStatusCode.OK });
      } catch (err) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
        channel.nack(msg, false, true); // requeue
      } finally {
        span.end();
      }
    });
  });
}

The Non-Obvious: Async Paths Are Where Bugs Go to Hide

Synchronous request-response paths are relatively easy to debug. You have a single HTTP request, a trace ID, and you can follow it from ingress to egress. But async paths — queues, pub/sub, scheduled jobs — break that linear flow. The producer and consumer live in different processes, possibly on different machines, with no shared context unless you explicitly propagate it.

And the worst part: failures in async paths often don't surface as errors. A message that fails to be consumed might be requeued, retried, and eventually dead-lettered. The producer sees "published successfully" and moves on. The consumer might log "processing failed, requeueing" but if you're not looking at that specific log, you miss it. The user, meanwhile, sees infinite loading.

lightbulb

Always instrument your message brokers. Track queue depth, consumer lag, dead-letter queue size, and — most importantly — the age of the oldest unprocessed message. That last metric catches issues where messages are being processed but too slowly (e.g., due to a performance regression).

The Debugging Loop: From Logs to Traces to Metrics

When a new incident comes in, I follow a sequence:

1. Check if there's an ongoing alert — metrics first. Look for anomalies in latency, error rate, throughput, and queue metrics.

2. If metrics show a problem, pull up traces for the affected endpoint or flow. Filter by error or high latency. Look for missing spans (orphaned traces) or unusually long child spans.

3. Use the trace ID to correlate logs across services. Even with structured logging, the trace ID is the glue that lets you see the full story.

4. If the issue is a silent failure (no errors, just stuck users), focus on async paths: is the message being produced? Is it being consumed? What's the queue depth? Is the consumer alive?

73%

of distributed system outages involve async processing paths (based on internal analysis of 200 incidents)

That stat comes from our own post-incident reviews. The majority of outages aren't from a service crashing — they're from a service silently not processing messages, or processing them too slowly, or processing them incorrectly without error. Distributed tracing is the only way to see that a message was produced but the consumer never picked it up.

What We Changed After the Incident

1Added OpenTelemetry auto-instrumentation to all services, including RabbitMQ and Redis clients.
2Created a dashboard for each async workflow showing: producer publish rate, consumer ack rate, queue depth, consumer lag, and dead-letter count.
3Set up an alert on 'consumer not acking any messages for 5 minutes' — this would have caught the crash loop immediately.
4Added a health check endpoint to each service that verifies it can connect to its message broker and database, so Kubernetes can detect issues before traffic is routed.
5Introduced a 'trace completeness' metric: for each root span type (e.g., checkout), we measure the percentage of traces that contain all expected child spans. If it drops below 95%, alert.

info

The trace completeness metric is the most valuable addition. It directly tells you when an async path breaks, even if the error is silent. We've caught three similar incidents since implementing it.

The cost of adding distributed tracing was not trivial — about two engineering weeks to instrument all services and set up the backend (we used Jaeger). But it paid for itself in the first week when it caught a similar queue issue before any user noticed.

If you're debugging distributed systems and you don't have distributed tracing, you're flying blind. Logs are useful, metrics are essential, but traces are the only tool that can connect a request across service boundaries. And if you're running async workflows, traces are non-negotiable.

Frequently asked questions

What is the hardest part of debugging distributed systems?

The hardest part is the lack of a single source of truth. In a monolith, you can often reproduce a bug by running the same code path. In distributed systems, the same request might take different routes depending on load, network timing, or state. You need distributed tracing, structured logging, and metrics to even see what happened.

Why can't I just use logs to debug microservices?

Logs are per-service and per-instance. Without a correlation ID, you can't link a log line from service A to the corresponding line in service B. Also, logs don't capture timing of async operations like queue waits. You might see 'message processed' but not the 5-second delay before processing.

How do I start adding distributed tracing to my system?

The easiest path is to use OpenTelemetry with auto-instrumentation for your language and frameworks (e.g., Python with Django, Java with Spring). It adds minimal overhead and can export traces to Jaeger, Zipkin, or a SaaS backend. Start by instrumenting your ingress controller and key services.

What are common pitfalls when reading traces?

One pitfall is assuming a trace is the full picture. Traces only show what's instrumented. If a service doesn't propagate the trace context, you get a broken trace. Also, sampling can miss rare errors. Always export 100% of error traces, and sample the rest.

Logs, Traces, and the Lies They Tell: Debugging a Stuck Queue in Production