Structured logging in JSON sounds simple: instead of writing a string like 'User 123 logged in', you write a JSON object like {"event": "login", "user_id": 123}. But once you run that in production for a week, you hit a dozen edge cases that your initial schema didn't account for. I've seen teams burn hours on dashboards that silently stopped working because a field name changed or a nested object went missing.
This guide covers the concrete decisions you need to make — field naming conventions, schema evolution, trace context, error representation, and what to do when your log pipeline chokes on malformed JSON. I'll also share a story from my own team where a missing field cost us three hours of debugging a production incident.
The Minimum Viable Log Schema
Before you add any domain-specific fields, every log line should have at least these four fields:
- arrow_righttimestamp — RFC 3339 format (e.g., 2025-03-15T14:30:00.123Z). Use microseconds or nanoseconds if your system needs sub-millisecond resolution.
- arrow_rightlevel — one of: DEBUG, INFO, WARN, ERROR, FATAL. Use strings, not numbers. Strings are human-readable and query engines can sort them.
- arrow_rightmessage — a human-readable summary of the event. This is the fallback when a developer greps logs without a query.
- arrow_rightservice — the name of the service that emitted the log. This is critical when you aggregate logs from multiple microservices.
I also recommend adding a version field (e.g., "log_schema_version": 1) from day one. It costs a few bytes per log line and saves you when you need to migrate to a new schema later.
Field Naming Conventions That Survive Production
The most common argument I see is camelCase vs snake_case. I've worked with both, and snake_case wins because: (1) most log aggregation tools treat field names case-insensitively, but snake_case is more readable in queries; (2) when you export logs to a data warehouse like BigQuery, column names are case-insensitive but snake_case is the convention; (3) it's consistent with the rest of the observability ecosystem (Prometheus metrics use snake_case).
Avoid dynamic keys. I once saw a system that logged event-specific data under the event name as a key: {"user_login": {"user_id": 123}}. This makes it impossible to write a query that aggregates across all events. Instead, use a flat structure with an event field: {"event": "user_login", "user_id": 123}.
Reserved Fields and Their Types
- arrow_righttimestamp: string (RFC 3339)
- arrow_rightlevel: string (one of DEBUG, INFO, WARN, ERROR, FATAL)
- arrow_rightmessage: string
- arrow_rightservice: string
- arrow_righttrace_id: string (optional, but add if you use distributed tracing)
- arrow_rightspan_id: string (optional)
- arrow_righterror: object (optional, see below)
- arrow_rightduration_ms: number (for request timing)
- arrow_rightuser_id: string (if applicable)
{
"timestamp": "2025-03-15T14:30:00.123Z",
"level": "ERROR",
"message": "Failed to connect to database",
"service": "user-service",
"trace_id": "abc123def456",
"span_id": "span789",
"error": {
"message": "connection refused",
"type": "ConnectionError",
"stack": [
"at Socket._onError (net.js:689:5)",
"at emitErrorNT (internal/streams/destroy.js:106:8)"
]
},
"duration_ms": 2047,
"user_id": "u_42"
}The War Story: A Missing Field That Broke Our Dashboard
The Case of the Silent Dashboard
- 14:00Deploy of payment-service v2.3.0 to staging
- 14:15QA reports that the 'failed payments' dashboard shows zero errors, even though they triggered a failing payment
- 14:20On-call engineer checks raw logs — errors are there, but the field name is 'error_message' instead of 'error.message'
- 14:30Team discovers that a library update changed the log field from nested object to flat string
- 14:45Hotfix deployed to restore the original field structure
- 15:00Dashboard back to normal, but incident cost 3 hours of engineering time
Lesson
A simple schema inconsistency — flat string vs nested object — made the dashboard silently return zero results. If we had a schema validation step in CI that checked the log format of test runs, we would have caught this before deployment.
We now run a simple script in our CI pipeline that sends a sample log line to a mock aggregator and verifies the JSON structure matches a schema file. It's saved us from at least four similar regressions since then.
Error Representation: Flat vs Nested
This is the most common design debate I see. Some teams flatten error fields: {"error_message": "...", "error_type": "...", "error_stack": "..."}. Others nest them: {"error": {"message": "...", "type": "...", "stack": [...]}}.
I strongly prefer nested. It groups related fields together, which makes queries like error.type:ConnectionError possible without prefixing every field with error_. And when you export logs to a structured store, nested objects can be cast to a STRUCT type, while flat fields require a separate table or view.
If you use nested objects, ensure your log shipper (e.g., Fluentd, Logstash) supports deep nesting. Some shippers flatten all nested objects by default — you'll lose the structure. Check the configuration before you deploy.
Handling Sensitive Data: Redact at the Source
You should never log raw passwords, tokens, credit card numbers, or PII. But accidental logging happens. The safest approach is to redact at the source — in your application code — using a structured logging library that supports field-level redaction.
For example, in Python's structlog, you can define a processor that scrubs fields matching a pattern:
Furthermore, add a CI check that scans log output for regex patterns matching common sensitive data (e.g., API keys, email addresses) and fails the build if found.
import structlog
def redact_sensitive(logger, method_name, event_dict):
sensitive_keys = ['password', 'token', 'secret', 'ssn']
for key in sensitive_keys:
if key in event_dict:
event_dict[key] = '***REDACTED***'
return event_dict
structlog.configure(
processors=[
redact_sensitive,
structlog.processors.JSONRenderer()
]
)
logger = structlog.get_logger()
logger.info("user login", user_id=42, password="supersecret")Schema Validation in CI
The incident I described above could have been prevented with a simple schema check. Here's what we do now:
1. Define a JSON Schema file in the repository (e.g., log_schema.json) that specifies required fields, types, and allowed values for level.
2. In CI, run the application with a special flag that logs a single line to stdout, then pipe that line through a schema validator (e.g., ajv for Node.js, jsonschema for Python).
3. If validation fails, the build fails. This catches missing fields, wrong types, and unexpected nesting.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["timestamp", "level", "message", "service"],
"properties": {
"timestamp": { "type": "string", "pattern": "^\\d{4}-\\d{2}-\\d{2}T" },
"level": { "type": "string", "enum": ["DEBUG", "INFO", "WARN", "ERROR", "FATAL"] },
"message": { "type": "string" },
"service": { "type": "string" },
"trace_id": { "type": "string" },
"error": {
"type": "object",
"properties": {
"message": { "type": "string" },
"type": { "type": "string" },
"stack": { "type": "array", "items": { "type": "string" } }
},
"required": ["message", "type"]
}
}
}What About Multi-Line Logs and Exceptions?
One JSON object per line is the standard (JSON Lines format). But what about stack traces that span multiple lines? If you inline them as a string with escaped newlines, you lose readability. I recommend storing the stack trace as an array of strings, each representing one line. This keeps each log line as a single JSON object and makes it easy to display the stack trace in a UI with proper formatting.
Example: "stack": ["Error: something broke", " at Object.<anonymous> (file.js:10:5)", ...].
A Note on Timestamp Precision
If your service processes thousands of requests per second, millisecond precision might not be enough. I've seen logs from the same request with identical timestamps because the clock resolution was too coarse. Use microseconds (six digits after seconds) or nanoseconds if your runtime supports it. In Go, use time.RFC3339Nano. In Node.js, use new Date().toISOString() which gives milliseconds — not enough. Consider a library like 'microtime' or format with a custom function.
Final Recommendations
- 1Start with the minimal schema and add fields only when you have a query that needs them.
- 2Enforce snake_case, consistent types, and required fields via a schema registry or CI check.
- 3Use nested objects for errors and other logical groups, but verify your log shipper doesn't flatten them.
- 4Add trace_id and span_id to every log line if you use distributed tracing — it makes debugging journeys across services possible.
- 5Redact sensitive data at the source, not in the log pipeline. Assume every log could be leaked.
- 6Validate your log format in CI. It takes 10 minutes to set up and saves hours of debugging.
of teams using structured logging report schema drift as their top pain point (source: internal survey, 2024)
Structured logging in JSON is not just about machine readability — it's about building a reliable observability pipeline. The decisions you make today (field names, nesting, types) will either make your future debugging effortless or painful. I've seen both sides, and I strongly recommend investing in a schema upfront. Your future self (and your on-call team) will thank you.
Frequently asked questions
What is the difference between structured logging and unstructured logging?
Unstructured logging is plain-text messages like 'User 123 logged in'. Structured logging outputs key-value pairs or JSON, e.g., {"event": "login", "user_id": 123, "timestamp": "..."}. Structured logs are machine-parseable, queryable, and much easier to aggregate and alert on.
Should I use camelCase or snake_case for JSON log field names?
snake_case. Most log aggregation tools (Elasticsearch, Loki, BigQuery) treat field names case-insensitively or have better support for snake_case. More importantly, your query language (e.g., Lucene, LogQL) will be cleaner when field names are consistent across all services.
How do I handle errors and stack traces in structured logs?
Include a dedicated 'error' object with fields: message, type, stack (array of strings), and code if applicable. Avoid inlining the stack trace into the main message. Example: {"error": {"message": "connection refused", "type": "ConnectionError", "stack": ["at Socket._onError (net.js:...)"]}}.
What is the most common mistake teams make when adopting structured logging?
Not enforcing a schema. Teams start with ad-hoc fields, then after six months they have 50 different field names for the same concept (e.g., 'user_id', 'userId', 'uid', 'customerId'). This makes dashboards unreliable and queries painful. Use a schema registry or a shared logger configuration across services.