What this usually means
Production-only bugs happen because production differs from development in ways that matter. Rather than guessing, work through a checklist to systematically eliminate possible gaps. The most common gaps: data (production has values dev data never has), concurrency (production has real concurrent traffic), configuration (production has different env vars or feature flags), dependencies (production uses different versions or external services), and scale (production has more data, more traffic, or more users).
The first ten minutes \u2014 establish facts before touching code.
- 1Is the bug consistent or intermittent? Consistent bugs are easier — they are triggered by a specific condition. Intermittent bugs suggest a race condition or resource contention.
- 2Can you reproduce it in staging with production-like data? If not, the gap is in the data or scale.
- 3Check the request that triggers the bug. What is different about it? Specific user? Specific input? Specific time?
- 4Compare the full environment: OS, runtime version, database version, installed packages, available memory.
- 5Enable verbose logging for the affected code path in production temporarily. Capture request payloads and responses.
- 6Check recent changes: deployments, config updates, database migrations, third-party service updates.
The specific files, logs, configs, and dashboards that usually own this bug.
- searchProduction error tracking — full stack trace, request context, user context
- searchProduction logs — verbose logging for the affected code path
- searchProduction database — a snapshot or read replica of the data involved in the bug
- searchProduction environment configuration — all env vars, feature flags, secrets
- searchProduction infrastructure — load balancer, CDN, firewall, network topology
- searchRecent change history — deployments, config changes, dependency updates
- searchProduction monitoring — CPU, memory, database connections, error rates
Practical causes, not theory. These are the things you will actually find.
- warningProduction data contains values (null, empty, very long, special characters) that dev data does not
- warningConcurrent requests cause a race condition that never happens with single-user local testing
- warningProduction environment variable or feature flag differs from staging
- warningProduction runs a different database version, runtime version, or operating system
- warningThird-party API behaves differently or is rate-limited in production
- warningLoad balancer or CDN modifies requests or responses in production
- warningProduction has less memory or CPU, causing timeouts or OOM behaviour
Concrete fix directions. Pick the one that matches your root cause.
- buildClone the production data (anonymised) to a staging environment and attempt to reproduce the bug
- buildAdd detailed request logging in production: log input, output, and key decision points
- buildCreate a production-like load test to reproduce race conditions
- buildUse feature flags to enable verbose debugging for a subset of production traffic
- buildAdd a correlation ID to every request so you can trace the full journey through logs
- buildSet up a shadow traffic or canary deployment to test fixes on a small percentage of production traffic
A fix you cannot prove is a guess. Close the loop.
- verifiedThe bug is reproducible in a staging environment with production-like data.
- verifiedLogs show the exact input and output for the failing request, and the root cause is identified.
- verifiedThe fix is deployed to a canary or small percentage of traffic first and error rates drop.
- verifiedFull production deploy and error rates remain at zero for the affected endpoint.
- verifiedA regression test is added that covers the specific production scenario.
Things that make this bug worse or harder to find.
- warningTrying to fix the bug without reproducing it first
- warningNot capturing enough context in production error logs
- warningAssuming the bug is a code issue when it is a data or configuration issue
- warningDeploying a fix directly to all production traffic without testing on a subset first
- warningNot adding a regression test for the specific production scenario