When Logs Lie: Observability Gaps and Silent Failures

Every engineer has been there: an alert fires, you SSH into the box, tail the logs, and see nothing wrong. No errors, no warnings. Just a stream of '200 OK' and 'INFO'. You restart the service, the alert goes away, and you close the ticket with 'transient issue'. But deep down you know — something happened, and the logs didn't tell you.

Logs are the most accessible observability tool we have. They're also the most deceptive. They give us a false sense of completeness. We treat them as a ground truth, but they're really just the output of whatever the developer thought to log at 3am. This post covers the concrete ways logs mislead us, with real incidents I've experienced or investigated.

The Ordering Mirage: When Logs Rearrange Reality

Logs from a single process are usually timestamped and sequential. But as soon as you involve multiple threads, async queues, or microservices, the ordering becomes a guess. I once debugged a production outage where a downstream service returned a 500, but the upstream service's log showed the request as 'completed successfully'. The timeline in the logs suggested the success happened before the failure — which was impossible.

warning

Timestamps from different machines are never perfectly synchronized. Even with NTP, clock skew of 10–100ms is common. When a request crosses multiple services, the logs may appear out of order. Always rely on a monotonic correlation ID, not wall-clock time, to reconstruct causality.

The Asynchronous Log Write Problem

A common async logging pattern. The first log line is emitted before the await — but if the await raises, the second log never runs. Yet the first log suggests the operation started fine.

import asyncio
import logging

logger = logging.getLogger(__name__)

async def process_payment(user_id: str, amount: float):
    logger.info("Processing payment for user %s", user_id)
    # Actually send to payment gateway (async)
    result = await payment_gateway.charge(user_id, amount)
    logger.info("Payment result: %s", result)
    return result

The code above looks innocent. But imagine the await raises an exception that is caught elsewhere, or the payment gateway hangs. The first log says 'Processing payment' — but the operation never completes. If you only look at the logs, you see a start but no end. That gap is easy to miss in a firehose of logs.

The fix: always pair a start and end log with the same correlation ID, and set up a watch that alerts if a start log is not followed by an end log within a timeout.

23%

of production incidents in a 2022 study involved misleading log data as a contributing factor

The Silent Drop: When the Pipeline Eats Your Logs

Logs don't magically appear in your SIEM or Elasticsearch cluster. They travel through a pipeline: application -> stdout -> log shipper (e.g., Filebeat, Fluentd) -> buffer (e.g., Kafka) -> indexer. Every step can fail silently.

I worked on a team where we had a daily batch job that logged exactly 10,000 lines. One day, a dashboard showed only 9,500. No error logs. No alert. The Filebeat process had hit a memory limit and started dropping messages. The logs that made it through looked perfectly normal — we had no idea we were missing 5% of our data.

info

Add a sequence number to every log line. If you see a gap in the sequence, you know the pipeline dropped messages. This is simple to implement and can be done with a global counter per process.

The Lying Status Code: HTTP 200 Does Not Mean Success

One of my favorite categories of misleading logs is the '200 OK' that is actually a failure. Here's a real scenario from a microservice I owned: an API endpoint returned 200 but the response body contained an error message. The logging framework was configured to log the status code and the path, but not the body. Every single call logged as successful, while users saw errors in the UI.

The 200 That Wasn't

14:32Alert: customer signup success rate dropped to 30%
14:35Check logs: all signup API calls return 200
14:40Check response body: actual error is 'duplicate email' but status is still 200
15:00Root cause: API gateway swallows error responses and returns 200 with error in body

Lesson

Never trust the HTTP status code alone. Log the response body or at least a hash of it. Better yet, use a schema that includes a success flag.

The I/O Buffer: When the Log Says Something That Never Happened

Most programming languages buffer stdout. If the process crashes immediately after a log line, the line may have been written to the buffer but never flushed to disk. The log viewer shows it, but the data never existed in the file. Conversely, a log line can appear after a crash due to stale buffers being flushed by the OS — giving the impression that the process continued running after it died.

A simple demonstration of how buffered I/O can swallow log output. Always use auto-flush in production logging configurations.

# Example: Python's print buffer behavior
$ python3 -c "import sys; print('before', end=''); sys.exit(1)" 2>&1 | cat
# Nothing output! The print was buffered and never flushed.

# With flush=True:
$ python3 -c "import sys; print('before', flush=True, end=''); sys.exit(1)" 2>&1 | cat
before

How to Protect Yourself from Log Lies

1Add correlation IDs to every log line and propagate them across all services (HTTP headers, message queues).
2Log the full context: request ID, user ID, action, and outcome. Include a success/failure field explicitly.
3Use structured logging (JSON) so you can query and alert on specific fields.
4Monitor your log pipeline health: throughput, lag, dropped messages. Set alerts on anomalies.
5Cross-check logs with metrics and traces. If metrics show high error rates but logs show none, you have a logging problem.
6Test your logging: inject synthetic errors and verify they appear in the log aggregator within seconds.

Logs are not the source of truth. They are a noisy, delayed, lossy signal. Treat them as one signal among many.

The Observability Triad: Logs, Metrics, Traces

Relying on logs alone is like trying to debug a car with only a rearview mirror. Metrics give you the speed (throughput, latency, error rate). Traces give you the journey of a single request across services. Logs give you the detailed events. Each can lie, but they rarely lie in the same way.

When I see a discrepancy between logs and metrics, I start investigating the logging system itself. That's often where the real bug is.

Frequently asked questions

Why do logs sometimes show a success but the operation actually failed?

Logs are often written before the actual I/O completes. For example, a log line like 'Payment processed' might be emitted before the payment gateway's response is fully committed. If the gateway then fails asynchronously, the log never records the failure. Always log after the operation's durability is confirmed, not before.

Can log aggregation tools like Elasticsearch or Loki lose logs?

Yes. Network partitions, buffer overflows, or misconfigured rate limits can cause logs to be dropped silently. I've seen cases where Logstash dropped logs due to a full disk, and the only hint was a subtle count mismatch in dashboards. Always add a health check that monitors the log pipeline's throughput and alerts on dips.

How do I detect when logs are lying?

Cross-validate logs with other signals: metrics (e.g., request latency, error rate) and traces (e.g., span status). If logs say 'all good' but error rate spikes, something is off. Also, implement log sampling and test your logging pipeline with synthetic events to ensure end-to-end delivery.

What's the best practice for logging in asynchronous systems?

Always include a unique correlation ID that is passed across async boundaries (queues, event streams). Log at the start and end of each async operation, and use structured logging so you can query by correlation ID. Consider using OpenTelemetry for automatic trace propagation.

When Logs Lie: The Gaps Between What You Log and What Actually Happened