Three years ago I ran a post-mortem on a payment outage where the logs showed a clean transaction stream. Zero errors. Zero warnings. Eight percent of payments had silently failed — refund requests started showing up in support tickets two days later.
The logs were not lying by omission. They were lying by design: the success log fired before the database commit. Every payment that failed between log emission and commit looked like a success in our log stream. We had built a system that confidently reported things that had not happened yet.
The five ways logs deceive you
- 1Logs report intent, not outcome — the write fires before the operation is durable
- 2Missing context makes individual log lines unattributable at scale
- 3Clock skew renders multi-service timelines incorrect
- 4Log sampling drops the exact events you need during high-traffic incidents
- 5Misconfigured log levels silently remove the signal you are relying on
Related debugging guides on Buglyst
Intent vs outcome: the most dangerous gap
Imagine this code pattern, which I have seen in production at companies that should know better:
You log 'order placed' and then you persist it to the database. If the database write fails after the log, your log shows a successful order that never existed. The fix is to log after confirmation — after the commit, after the ACK, after the external API returns 200.
This sounds obvious. Under deadline pressure, with a senior engineer saying 'just add a log line here so we can track it,' it gets done wrong. Audit your most critical log lines: do they fire before or after the thing they claim to record?
logger.info('order.placed', { orderId });
await db.orders.insert({ orderId, ...payload }); // if this throws, log is falseawait db.orders.insert({ orderId, ...payload });
logger.info('order.placed', { orderId }); // only emits if insert succeededMissing context: the needle-in-a-haystack problem
A log line that says 'connection failed' tells you nothing in a system processing 50,000 requests per second. Which connection? Which user? Which tenant? Which service called it? Logs without a request ID, user ID, and trace ID are archaeology artifacts — you can find them but you cannot explain them in context.
The fix is straightforward: inject a correlation ID at the entry point (the HTTP handler, the queue consumer, the cron job), put it in thread-local or async context, and pull it into every log line in that request's lifetime. One hour of instrumentation work pays dividends for years.
The ghost tenant
- 09:14Support ticket: 'API returning 403 intermittently for some users'
- 09:31On-call checks logs: sees 403s but no pattern — no user ID, no tenant in log lines
- 10:45After 74 minutes: discovers logs have tenant only in debug level (disabled in prod)
- 10:52Temporary debug enable reveals all 403s are from one tenant with an expired token
- 11:05Fix: tenant ID added to all auth log lines permanently. Resolution: 111 minutes
Lesson
Debug-level context should be promoted to info for fields that identify who the request is for.
What a well-instrumented log line looks like
{
"ts": "2025-06-08T14:32:07.441Z",
"level": "info",
"event": "payment.confirmed",
"traceId": "4f3a2b1c",
"requestId": "req_9xk2",
"userId": "usr_8812",
"tenantId": "t_acme",
"paymentId": "pay_7731",
"amountCents": 4999,
"durationMs": 143
}- arrow_rightISO timestamp with millisecond precision — needed for cross-service ordering
- arrow_righttraceId threads the full request through every downstream service
- arrow_rightevent is a dot-namespaced verb, not free-form prose — queryable without regex
- arrow_rightuserId and tenantId on every line — never debug-only
- arrow_rightdurationMs on every external call — performance regression is observable without adding instrumentation later
Frequently asked questions
How do I make production logs more useful for debugging?
Every log line should answer: who (request ID / user / tenant), what (event name), when (ISO timestamp with milliseconds), and why it matters (structured fields, not prose). Add a trace ID so you can follow a request across services.
What is structured logging and why does it matter?
Structured logging emits JSON or key-value pairs rather than free-form strings. This lets you query across fields — 'show me all 500s for tenant X in the last hour' — without fragile regex on log prose.
Why do logs sometimes show success before a crash?
Because the log write happens before the operation completes or commits. A 'payment recorded' log line fires before the database write is durable. If the process dies between the log and the commit, you have a false success in the log stream.