Observability14 min read

Debugging Production Issues Without a Debugger: Approaches That Work

Attaching an interactive debugger in production is usually impossible. Here’s how I gather signal, reproduce issues, and restore service using other techniques.

production-debuggingobservabilityincident-responselogstracing

If you’re supporting a service that matters, the first time something blows up in production, you’ll learn just how little the debugger helps. Interactive stepping and variable inspection are luxuries of development and staging, not tools you get in prod at scale (or under load, or with privacy constraints). So you need a different set of instincts—and better tooling—to diagnose and fix live issues.

I’ve spent the last decade chasing production ghosts across JVMs, microservices, and the occasional mystery container. In this post, I’ll cover patterns I reach for when I can’t halt the world for a breakpoint.

Why No Debugger? The Real Limitations

A debugger isn’t just unavailable for technical reasons. In high-availability setups, pausing a thread can cause cascading failures, missed SLAs, or even breach compliance obligations. Remote debugging ports are a security risk. Some execution environments (AWS Lambda, Cloud Run) flat-out block debugging hooks.

Further, on distributed systems, the error you’re chasing might span services on different hosts, in different clouds, or at unpredictable times. You can’t attach a debugger to all of it, nor can you step through user-triggered concurrency bugs.

Log-First, Log-Smart

Logs are your first source of truth, but only if you’ve done the work up front. I advocate for structured, event-focused logs with contextual fields: request IDs, user IDs, region, and timing. Avoid relying on log grep gymnastics. When something breaks, see if you can trace input to output across systems with unique IDs.

When logs are inadequate, reach for dynamic log level changes (if your system allows logback or zap level tweaks). If not, shipping a targeted log patch can be faster than burning cycles on guesswork.

Dynamically increase log verbosity on a Kubernetes pod
kubectl exec -it prod-app-7g8h bash
export LOG_LEVEL=debug
kill -HUP $(pgrep myapp)

Distributed Tracing: Correlate, Don’t Guess

Distributed tracing (OpenTelemetry, Jaeger, Zipkin) lets you reconstruct requests as they hop across services. Traces can expose bottlenecks, pinpoint failing downstreams, and show timing patterns that logs alone might obscure.

Adopt tracing as a first-class citizen—not as a postmortem patch. Without trace IDs, you’re piecing together a puzzle with missing pieces.

Correlating logs and traces by request ID cut our incident triage time by 60% during a critical outage last year.

Incident War Story: The Hanging Checkout

E-commerce Checkout Stalls on Black Friday

  1. 09:13PagerDuty alert for checkout latency spike
  2. 09:16No exceptions in application logs; latency observed only on a single region
  3. 09:22Dynamic logging enabled for payment service
  4. 09:27Distributed trace reveals downstream call to inventory service stuck on mutex lock
  5. 09:32Hotfix disables problematic code path; orders resume flowing

Lesson

Without the ability to debug live, correlating logs and traces exposed a lock contention issue that was invisible in error logs. The fix came from targeted dynamic logging and cross-service tracing.

Production-Safe Tactics: Profiling and Runtime Introspection

  • arrow_rightUse eBPF profilers (e.g., Cilium, Pixie) for low-overhead sampling in prod.
  • arrow_rightHeap dumps and goroutine dumps (JVM, Go): capture during anomalies, not preemptively.
  • arrow_rightOn JVM: `jstack`, on Go: `curl http://localhost:6060/debug/pprof/goroutine?debug=2`.
  • arrow_rightBe mindful: always test these methods in staging first to check for impact.
warning

Never blindly enable expensive instrumentation in production—profiling and dump tools can degrade performance or exhaust resources if misused.

  1. 1Start with correlating logs for anomalous patterns.
  2. 2Escalate to distributed tracing to map out request lifecycles.
  3. 3Introduce dynamic logging or targeted profiling as needed.
  4. 4Always communicate with your team about live changes.

Frequently asked questions

Why can't I use a debugger in production?

Attaching a debugger may halt critical processes, create security risks, or isn’t technically possible due to sandboxing, scale, or performance constraints.

How can I get more context if logs aren’t enough?

Add targeted log lines, use distributed tracing, or leverage runtime profiling tools that introduce minimal overhead (e.g., eBPF-based profilers, pprof).

What’s the safest way to add diagnostics to live systems?

Prefer dynamic log level changes, temporary feature flags, or hot reloadable configs that don’t require full deploys or service restarts.

How do I avoid making problems worse while investigating?

Coordinate with incident responders, communicate changes, and always validate the impact of observability tweaks before rolling to all prod nodes.