Every production engineer has a log horror story. Mine happened at 3 AM on a Tuesday. A payment service started emitting 100 GB of logs per hour. The root cause? A `null` pointer in a customer name field that the logging framework tried to interpolate. The string concatenation in the log statement threw an exception, which was caught by a generic handler that logged the exception... which also tried to interpolate the same null. Infinite loop. Logs backed up, disk filled, service crashed.
That's when I learned that log statements are code. They can fail, they can rot, and they can bring down your system if you don't treat them defensively. Defensive logging is the practice of writing log output that is robust against the messiness of real production data: nulls, encoding issues, PII, deeply nested objects, and even logging infrastructure failures.
The Three Principles of Defensive Logging
- 1Log output is untrusted input – sanitize and validate before emitting.
- 2Every log line must be parseable and self-describing – use structured schemas.
- 3Logging should never throw – handle all edge cases silently.
Principle 1: Sanitize Log Arguments Like User Input
The most common source of log-induced crashes is string interpolation with null or non-serializable objects. In Java, `String.format("user %s logged in", user.getName())` throws if `getName()` returns null. In Python, `f"user {user.name} logged in"` raises `AttributeError` if `user` is None. The fix: use logging frameworks that accept structured arguments and delay formatting until the log level is confirmed to be active.
import logging
# BAD: string interpolation runs even if level is disabled
logging.debug(f"User {user.name} action {action}") # throws if user is None
# GOOD: let the logger format lazily
logging.debug("User %s action %s", user.name, action) # only formatted if DEBUG enabled
# BETTER: use structured logging with a serializer that handles None
import structlog
logger = structlog.get_logger()
logger.debug("user_action", user=user, action=action) # structlog handles None safelyEven with lazy formatting, if the log level is active and the argument is an object with a broken __str__ method, you still crash. Always wrap custom toString/__repr__ in try-except or use a serializer that guards against exceptions.
PII Redaction at the Library Level
Post-processing PII redaction in your log aggregator is too late. By the time logs hit Elasticsearch or Datadog, the damage is done—PII may have been indexed and cached. You need to redact at the source, in the logging library itself. Most structured logging frameworks support custom serializers or filters. Wire them up to scan known PII patterns (emails, credit cards, SSNs) and replace them with hashes or masked strings.
const piiPatterns = [
/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b/g, // email
/\b\d{3}-\d{2}-\d{4}\b/g, // SSN
/\b(?:\d[ -]*?){13,16}\b/g // credit card
];
function redactPII(obj) {
if (typeof obj === 'string') {
return piiPatterns.reduce((s, re) => s.replace(re, '***'), obj);
}
if (obj && typeof obj === 'object') {
for (const key of Object.keys(obj)) {
obj[key] = redactPII(obj[key]);
}
}
return obj;
}
// Use with pino logger
const pino = require('pino');
const logger = pino({
serializers: {
req: (req) => redactPII({ method: req.method, url: req.url }),
err: pino.stdSerializers.err
}
});Context Propagation: The Glue for Debugging
Defensive logging isn't just about avoiding crashes—it's about making logs useful. The single most impactful practice is attaching request context (trace ID, user ID, session ID) to every log line. Without it, you're left grepping for correlation tokens in free-text messages. Use your language's async local storage (e.g., `threading.local` in Python, `AsyncLocal` in Node.js) to propagate context automatically.
package main
import (
"context"
"log/slog"
)
type contextKey string
const traceIDKey contextKey = "trace_id"
func WithTraceID(ctx context.Context, id string) context.Context {
return context.WithValue(ctx, traceIDKey, id)
}
func GetTraceID(ctx context.Context) string {
if id, ok := ctx.Value(traceIDKey).(string); ok {
return id
}
return "unknown"
}
// Middleware that adds trace ID to logger
func LogMiddleware(ctx context.Context, logger *slog.Logger) *slog.Logger {
return logger.With("trace_id", GetTraceID(ctx))
}Log Health Monitoring: Your Canary
Logs themselves need monitoring. Set up alerts for: missing required fields (e.g., no trace_id in 1% of logs), excessive error rate (more than 5% of total logs are ERROR level), log format drift (e.g., JSON parse failures in your log shipper), and log volume spikes (deviation from rolling average by 3 sigma). These are symptoms of defensive logging failures—and they often precede system outages.
The Silent Log Rot Incident
- 14:32Deploy of catalog service v2.1.3 with new structured logging format
- 14:45Log shipper starts reporting 5% parse errors; team dismisses as 'upgrade noise'
- 15:10Parse errors reach 40%; logs for catalog service become unreadable
- 15:22On-call investigates: new logging code uses a field named 'event' which is a reserved keyword in the log shipper's schema
- 15:30Rollback to v2.1.2; parse errors drop to 0%
- 15:45Post-mortem: no log health alerts were configured for format drift. The log rot had been silent for 40 minutes.
Lesson
Always validate new log schemas against your log pipeline's schema registry. Set up alerts for parse error rate > 0%.
Property-Based Testing for Logs
Logging code is rarely tested. That's a mistake. Use property-based testing to generate random inputs and verify that your logging never throws, never produces invalid JSON, never contains raw PII, and always includes required fields. Tools like Hypothesis (Python) or fast-check (JavaScript) let you specify invariants and let the framework find counterexamples.
from hypothesis import given, strategies as st
import structlog
import json
@given(
name=st.text(),
email=st.emails(),
user_id=st.integers(min_value=0, max_value=2**31-1)
)
def test_log_output_is_safe(name, email, user_id):
logger = structlog.get_logger()
# Capture log output
with structlog.testing.capture_logs() as cap:
logger.info("user_event", name=name, email=email, user_id=user_id)
record = cap.entries[0]
# Assert no exceptions raised
assert record
# Assert email is redacted (if your filter is active)
assert '***' in str(record) or email not in str(record)
# Assert required fields present
assert 'event' in record
assert 'user_id' in recordAdd property-based tests to your CI pipeline. They're especially effective at catching null, empty string, and unicode edge cases that manual tests miss.
Defensive logging is not about writing more logs—it's about writing logs that survive production. Sanitize, structure, propagate context, monitor health, and test. Your future on-call self will thank you.
Frequently asked questions
What is defensive logging?
Defensive logging is the practice of writing log statements that are robust against malformed data, encoding issues, PII leaks, and infrastructure failures. It treats log output as a production data path that must be validated and sanitized.
How do I prevent PII from leaking into logs?
Implement redaction at the logging framework level using a whitelist of safe fields. For JSON logs, use a custom serializer that strips or masks fields matching regex patterns for emails, SSNs, credit cards, etc. Test with a set of known PII samples.
Should I log in production at DEBUG level?
Only if you have dynamic log-level control per service and can automatically sample high-volume loggers. Otherwise, DEBUG logs can cause data rot and cost spikes. Prefer structured INFO logs with enough context to reconstruct the flow.
How do I test my logging code?
Use property-based testing (e.g., QuickCheck or Hypothesis) to generate random inputs and verify that log output always conforms to your schema, never contains raw PII, and never throws exceptions. Also write integration tests that capture log output and assert expected fields.