All guides

LEARN \u00b7 DEBUGGING GUIDE

Production-only bug: a systematic debugging checklist

A bug exists only in production. No amount of local testing reproduces it. You need a systematic approach to find the gap between your machine and the production environment.

IntermediateWorks locally, fails in production

What this usually means

Production-only bugs happen because production differs from development in ways that matter. Rather than guessing, work through a checklist to systematically eliminate possible gaps. The most common gaps: data (production has values dev data never has), concurrency (production has real concurrent traffic), configuration (production has different env vars or feature flags), dependencies (production uses different versions or external services), and scale (production has more data, more traffic, or more users).

( 01 )Fast diagnosis

The first ten minutes \u2014 establish facts before touching code.

  • 1Is the bug consistent or intermittent? Consistent bugs are easier — they are triggered by a specific condition. Intermittent bugs suggest a race condition or resource contention.
  • 2Can you reproduce it in staging with production-like data? If not, the gap is in the data or scale.
  • 3Check the request that triggers the bug. What is different about it? Specific user? Specific input? Specific time?
  • 4Compare the full environment: OS, runtime version, database version, installed packages, available memory.
  • 5Enable verbose logging for the affected code path in production temporarily. Capture request payloads and responses.
  • 6Check recent changes: deployments, config updates, database migrations, third-party service updates.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchProduction error tracking — full stack trace, request context, user context
  • searchProduction logs — verbose logging for the affected code path
  • searchProduction database — a snapshot or read replica of the data involved in the bug
  • searchProduction environment configuration — all env vars, feature flags, secrets
  • searchProduction infrastructure — load balancer, CDN, firewall, network topology
  • searchRecent change history — deployments, config changes, dependency updates
  • searchProduction monitoring — CPU, memory, database connections, error rates
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningProduction data contains values (null, empty, very long, special characters) that dev data does not
  • warningConcurrent requests cause a race condition that never happens with single-user local testing
  • warningProduction environment variable or feature flag differs from staging
  • warningProduction runs a different database version, runtime version, or operating system
  • warningThird-party API behaves differently or is rate-limited in production
  • warningLoad balancer or CDN modifies requests or responses in production
  • warningProduction has less memory or CPU, causing timeouts or OOM behaviour
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildClone the production data (anonymised) to a staging environment and attempt to reproduce the bug
  • buildAdd detailed request logging in production: log input, output, and key decision points
  • buildCreate a production-like load test to reproduce race conditions
  • buildUse feature flags to enable verbose debugging for a subset of production traffic
  • buildAdd a correlation ID to every request so you can trace the full journey through logs
  • buildSet up a shadow traffic or canary deployment to test fixes on a small percentage of production traffic
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedThe bug is reproducible in a staging environment with production-like data.
  • verifiedLogs show the exact input and output for the failing request, and the root cause is identified.
  • verifiedThe fix is deployed to a canary or small percentage of traffic first and error rates drop.
  • verifiedFull production deploy and error rates remain at zero for the affected endpoint.
  • verifiedA regression test is added that covers the specific production scenario.
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningTrying to fix the bug without reproducing it first
  • warningNot capturing enough context in production error logs
  • warningAssuming the bug is a code issue when it is a data or configuration issue
  • warningDeploying a fix directly to all production traffic without testing on a subset first
  • warningNot adding a regression test for the specific production scenario