All guides

LEARN \u00b7 DEBUGGING GUIDE

API returns 500 only in production: how to debug it

The endpoint works fine in local dev. Hit it in production — 500 Internal Server Error. No useful error message. The logs show a stack trace you have never seen before.

IntermediateWorks locally, fails in production

What this usually means

A 500 means an unhandled exception. If it only happens in production, something in the production environment triggers a code path your local environment never reaches. It could be production-only data (a user record with a null field your local data never has), a production-only configuration (a feature flag that changes behaviour), a production-only integration (a third-party API that behaves differently), or a production-only load pattern (a race condition that only appears under concurrent requests).

( 01 )Fast diagnosis

The first ten minutes \u2014 establish facts before touching code.

  • 1Get the full stack trace from production logs. Do not guess — the stack trace tells you exactly which line threw.
  • 2Check what is different about the failing request. Is it a specific user, a specific input value, a specific time of day?
  • 3Add error tracking (Sentry, Datadog, or a simple try/catch with full context logging) if production errors are not logged with enough detail.
  • 4Check if the error correlates with a recent deployment. Was it working before and broke after the last release?
  • 5Look for null/undefined access in the code path. Production data often has missing or unexpected values that local seed data does not.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchProduction error logs — full stack trace with line numbers
  • searchError tracking tool (Sentry, Datadog, Bugsnag) — request context, user data, breadcrumbs
  • searchThe specific request payload that triggered the 500
  • searchRecent deployment diff — what changed?
  • searchDatabase — the specific record being accessed during the error
  • searchExternal API integrations — are they reachable from production?
  • searchServer resource usage — memory, CPU, disk at the time of the error
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningNull or undefined value in production data that local test data always has
  • warningProduction database has different schema, constraints, or data types
  • warningThird-party API returns unexpected response format in production
  • warningProduction environment has stricter security settings that block a request
  • warningRace condition that only appears under concurrent production traffic
  • warningMemory limit reached in production but not locally
  • warningA dependency behaves differently in production mode (minification, tree-shaking, env-specific code)
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildAdd structured error logging that captures the full request context (input, user ID, timestamp) with every 500
  • buildAdd input validation that fails fast with a 400 instead of letting bad data cause a 500 downstream
  • buildUse an error tracking service to aggregate production errors and see patterns across requests
  • buildCreate a staging environment with production-like data to reproduce the error before deploying fixes
  • buildAdd a global error handler that catches unhandled exceptions and returns a safe error response with a correlation ID
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedReproduce the error in staging with production-like data and the same request payload.
  • verifiedDeploy a fix and monitor the error rate — it should drop to zero for that specific error.
  • verifiedAdd a regression test that covers the edge case (null value, missing field, unexpected response).
  • verifiedCheck that the error tracking tool shows the fix resolved the issue across all instances.
  • verifiedRun a load test against the endpoint to ensure the fix holds under production traffic levels.
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningDeploying a fix without understanding the root cause because 'it works locally'
  • warningCatching all errors with a blanket try/catch and returning 200 — this hides bugs
  • warningNot logging enough context with production errors — a stack trace without request data is hard to debug
  • warningAssuming production data looks like local seed data
  • warningNot setting up error alerting — you should know about 500s before users report them