Observability11 min read

Why Your Production Logs Are Lying to You

Logs tell you what the code reported. They almost never tell you what actually happened. Here is the gap, and how to close it.

loggingobservabilityproductionincidentsdebugging

Three years ago I ran a post-mortem on a payment outage where the logs showed a clean transaction stream. Zero errors. Zero warnings. Eight percent of payments had silently failed — refund requests started showing up in support tickets two days later.

The logs were not lying by omission. They were lying by design: the success log fired before the database commit. Every payment that failed between log emission and commit looked like a success in our log stream. We had built a system that confidently reported things that had not happened yet.

The five ways logs deceive you

  1. 1Logs report intent, not outcome — the write fires before the operation is durable
  2. 2Missing context makes individual log lines unattributable at scale
  3. 3Clock skew renders multi-service timelines incorrect
  4. 4Log sampling drops the exact events you need during high-traffic incidents
  5. 5Misconfigured log levels silently remove the signal you are relying on

Intent vs outcome: the most dangerous gap

Imagine this code pattern, which I have seen in production at companies that should know better:

You log 'order placed' and then you persist it to the database. If the database write fails after the log, your log shows a successful order that never existed. The fix is to log after confirmation — after the commit, after the ACK, after the external API returns 200.

This sounds obvious. Under deadline pressure, with a senior engineer saying 'just add a log line here so we can track it,' it gets done wrong. Audit your most critical log lines: do they fire before or after the thing they claim to record?

Wrong: log fires before persistence
logger.info('order.placed', { orderId });
await db.orders.insert({ orderId, ...payload }); // if this throws, log is false
Correct: log fires after confirmation
await db.orders.insert({ orderId, ...payload });
logger.info('order.placed', { orderId }); // only emits if insert succeeded

Missing context: the needle-in-a-haystack problem

A log line that says 'connection failed' tells you nothing in a system processing 50,000 requests per second. Which connection? Which user? Which tenant? Which service called it? Logs without a request ID, user ID, and trace ID are archaeology artifacts — you can find them but you cannot explain them in context.

The fix is straightforward: inject a correlation ID at the entry point (the HTTP handler, the queue consumer, the cron job), put it in thread-local or async context, and pull it into every log line in that request's lifetime. One hour of instrumentation work pays dividends for years.

The ghost tenant

  1. 09:14Support ticket: 'API returning 403 intermittently for some users'
  2. 09:31On-call checks logs: sees 403s but no pattern — no user ID, no tenant in log lines
  3. 10:45After 74 minutes: discovers logs have tenant only in debug level (disabled in prod)
  4. 10:52Temporary debug enable reveals all 403s are from one tenant with an expired token
  5. 11:05Fix: tenant ID added to all auth log lines permanently. Resolution: 111 minutes

Lesson

Debug-level context should be promoted to info for fields that identify who the request is for.

What a well-instrumented log line looks like

Structured log with full context
{
  "ts": "2025-06-08T14:32:07.441Z",
  "level": "info",
  "event": "payment.confirmed",
  "traceId": "4f3a2b1c",
  "requestId": "req_9xk2",
  "userId": "usr_8812",
  "tenantId": "t_acme",
  "paymentId": "pay_7731",
  "amountCents": 4999,
  "durationMs": 143
}
  • arrow_rightISO timestamp with millisecond precision — needed for cross-service ordering
  • arrow_righttraceId threads the full request through every downstream service
  • arrow_rightevent is a dot-namespaced verb, not free-form prose — queryable without regex
  • arrow_rightuserId and tenantId on every line — never debug-only
  • arrow_rightdurationMs on every external call — performance regression is observable without adding instrumentation later

Frequently asked questions

How do I make production logs more useful for debugging?

Every log line should answer: who (request ID / user / tenant), what (event name), when (ISO timestamp with milliseconds), and why it matters (structured fields, not prose). Add a trace ID so you can follow a request across services.

What is structured logging and why does it matter?

Structured logging emits JSON or key-value pairs rather than free-form strings. This lets you query across fields — 'show me all 500s for tenant X in the last hour' — without fragile regex on log prose.

Why do logs sometimes show success before a crash?

Because the log write happens before the operation completes or commits. A 'payment recorded' log line fires before the database write is durable. If the process dies between the log and the commit, you have a false success in the log stream.