I've been woken up by PagerDuty at 3:14 AM more times than I can count. But the incident that still makes my palms sweat happened at 2:47 PM on a Tuesday. My VP of Engineering was in the on-call rotation that week. He pinged me three seconds after the alert fired: 'What's happening?'
The service was a Redis-backed session store for our checkout flow. Every request was returning 500. The blast radius was the entire east coast. My heart rate hit 130 before I could finish typing 'looking into it'.
That incident taught me that the biggest obstacle to debugging in production isn't technical — it's psychological. When your boss is watching and customers are tweeting, your brain wants to pattern-match to the first plausible cause and run with it. That's exactly how you make the outage longer.
The Redis cluster that wasn't really down
- 14:47Alert: Session store error rate > 5% for checkout-east
- 14:48VP pings. I open Elasticsearch and see 'connection refused' from Redis cluster nodes
- 14:49I SSH into a Redis node. `redis-cli ping` returns PONG. Node is fine.
- 14:51Check client-side connection pool config. All max_connections set to 200. But cluster has 6 nodes — we should have 1200 connections. Actual open connections: 12 (each app server is pooling 2).
- 14:53Realize the connection pool library was misconfigured to use a single endpoint instead of the cluster discovery endpoint. A recent deploy changed the env var from 'CLUSTER_DISCOVERY=http://...' to 'REDIS_URI=redis://...' — which bypassed cluster mode.
- 14:57Revert the deploy config. Connections re-establish. Error rate drops to zero.
Lesson
The failure symptom looked exactly like a downed Redis cluster. But the nodes were healthy. The real bug was a config change that made the client library treat a cluster as a single node, exhausting the one node's connection limit. Had I restarted the 'down' nodes, I would have only made it worse.
The first 90 seconds: stop, breathe, structure
When an alert fires, your sympathetic nervous system dumps cortisol and adrenaline. Fine for running from a bear. Terrible for debugging a distributed system. The first thing I do now is a physical ritual: I put my hands flat on the desk and take three slow breaths. This is not a meditation technique — it's a deliberate pause to let the prefrontal cortex come back online.
Then I open a blank document and write down three things: the exact time of the alert, the last known good state (from metrics or a deploy timestamp), and the blast radius. I do this before I run a single command. It forces me to think about scope before cause.
Pre-write an incident response template as a shell script or a Markdown file in your dotfiles. When the alert fires, run `incident new` and it creates a doc with placeholders for time, affected service, error rate, recent deploys, and hypotheses. I use a simple bash function that opens a Google Doc with pre-filled headers.
Write down three hypotheses before you run any command
The single most effective technique I've adopted from the world of rational debugging is hypothesis-driven investigation. Before I type `curl` or `tail -f`, I write down three possible root causes. Not the most likely one — three, ranked by likelihood. And I force myself to include at least one that seems improbable.
In the Redis incident, my initial list was: 1) Redis cluster nodes are down, 2) Network partition between app and Redis, 3) Client library misconfiguration. I almost skipped #3 because 'we haven't changed the library in months'. But because I wrote it down, I checked the env var. That was the bug.
#!/bin/bash
# incident-hypotheses.sh — run this after creating the doc
cat <<EOF | tee hypotheses.txt
# Incident: ${INCIDENT_ID}
# Time: $(date -u)
Hypothesis 1: [most likely]
Evidence for:
Evidence against:
Test command:
Hypothesis 2: [second likely]
Evidence for:
Evidence against:
Test command:
Hypothesis 3: [wild card]
Evidence for:
Evidence against:
Test command:
EOFThe signal-to-noise problem: what to look at first
When you're under pressure, every tool screams at you. Grafana shows red. Logs are full of stack traces. PagerDuty is still alerting. The temptation is to click on the brightest fire. But that's a trap: the loudest alert is often a symptom, not the cause.
I use a decision tree. First, check if the problem is global or partial (split traffic, read replicas, canary). Second, check the deploy timeline — 90% of production incidents are caused by a recent change. Third, check resource exhaustion (CPU, memory, connections, file descriptors). Only after those three checks do I dive into application logs.
- 1Is the problem affecting all users or a subset? (e.g., only users on a specific availability zone, or only write requests?)
- 2What changed in the last hour? (deploy, config push, DNS change, database migration)
- 3Are we hitting a resource limit? (connection pool, throttling, disk space, inode exhaustion)
- 4Is the upstream dependency healthy? (check health endpoints, not just error rates)
The rollback decision: when to stop debugging
There's a moment in every incident where you have to decide: keep debugging or roll back. The worst engineers I've worked with treat rollback as failure. The best treat it as feature. If you don't have a clear hypothesis within two minutes, roll back the most recent change. You can always investigate the root cause after traffic is healthy.
In the Redis incident, the fix was a config revert. But I wasted three minutes SSHing into Redis nodes because I was afraid to say 'I don't know yet, let me check the config'. The VP didn't care about the root cause at 14:48 — he cared about the checkout flow working again.
Do not push a speculative fix under pressure. 'It might be a memory leak, let me increase the heap size' is how you get a second incident. If you don't have evidence, roll back. A rollback is a reversible action. A heap increase is not.
After the fire: the postmortem that actually prevents recurrence
Once the incident is resolved, the real work begins. Most postmortems are exercises in blame avoidance. The useful ones ask: 'Why didn't our monitoring catch this earlier?' and 'What made the debugging path longer than it should have been?'
For the Redis incident, the answer to the first question was: we monitored node health but not connection pool utilization. We added a metric for 'connections per node' and an alert when any node exceeds 80% of its max. For the second question: the env var change wasn't in the deploy diff because it was in a separate config repo. We now require all config changes to go through the same review pipeline as code changes.
of production incidents are caused by configuration changes, not code bugs (per ACM SIGOPS study, 2019)
The 5-whys that exposed our real vulnerability
We ran a 5-whys session the next day. It went like this:
Why did users get 500s? → Redis connection refused.
Why was Redis refusing connections? → Connection pool was exhausted on one node.
Why was the pool exhausted? → The client library wasn't using cluster discovery.
Why wasn't it using cluster discovery? → The config variable was changed from CLUSTER_DISCOVERY to REDIS_URI in a config push.
Why didn't the config change get caught? → Config changes weren't included in the deploy diff review.
The fix wasn't just reverting the config — it was adding config changes to the code review process. That's the kind of systemic improvement that prevents the same class of incident from happening again.
Your reputation isn't built on never having incidents. It's built on how methodically you handle them when they happen.
Practical takeaways for your next incident
- arrow_rightBuild an incident response script that creates a structured doc with timestamps, hypotheses, and commands. Test it in a drill every quarter.
- arrow_rightWhen you get an alert, write down three hypotheses before running any command. Include at least one unlikely one.
- arrow_rightUse a decision tree: global vs partial, recent change, resource exhaustion, upstream dependency. In that order.
- arrow_rightIf you don't have a clear hypothesis in two minutes, roll back the most recent change. No exceptions.
- arrow_rightAfter the incident, run a 5-whys focused on the detection gap. Add monitoring for the missing signal.
- arrow_rightPractice your incident response in a game day. It's the only way to make the muscle memory automatic.
The VP never mentioned the incident again. But the next time I was on call, he didn't ping me. He waited for my status update. That trust came from seeing that I had a process — not from seeing me fix the problem fast.
Debugging under pressure is not about being smart. It's about being systematic when your brain wants to be fast. The systems you build will fail. The question is whether you fail with them or rise above the noise.
Frequently asked questions
What should I do first when I get an alert for a production incident?
Stop typing. Take three deep breaths. Then open a fresh terminal and run your pre-written incident response script (e.g., grab timestamps, recent deploys, error rates, and affected services into a shared doc). Do not start debugging until you know the blast radius.
How do I avoid jumping to conclusions under pressure?
Write down three possible root causes before you run any diagnostic command. Force yourself to list two that are unlikely. This breaks the confirmation-bias loop. I keep a sticky note on my monitor: 'What else could it be?'
Is it okay to roll back immediately instead of debugging?
Yes — if the blast radius is growing and you don't have a clear hypothesis within 2 minutes, roll back. The goal is to restore service, not to win a debugging contest. You can investigate the root cause after traffic is healthy.
How do I communicate with my manager during an active incident?
Send a single message with the status, the action you're taking, and the ETA for next update. Example: 'Investigating elevated 500s on checkout. Rolling back deploy v1.2.3. Next update in 3 min.' No emojis, no apologies.