Engineering process9 min read

Debugging vs. Firefighting: Why Treating Production Incidents as Debugging Sessions Fails

When production goes down, your brain wants to debug. That instinct costs you hours. Here's why firefighting requires a fundamentally different approach, and the specific process I use to switch modes.

debuggingfirefightingincident responseproduction outagesSREengineering process

I've watched smart engineers burn thirty minutes on a production outage because they treated it like a debugging session. They pulled logs, ran queries, traced code paths — all the right debugging moves. But the site stayed down. The incident timer kept ticking. And the root cause? A bad config push that a rollback would have fixed in two minutes.

Debugging and firefighting are not the same skill. Confusing the two is one of the most expensive mistakes an engineer can make. Let me show you the difference, why it matters, and the exact process I use to keep myself (and my team) in firefighting mode when production is on fire.

Debugging Is a Solo Sport with Unlimited Time

Debugging is what you do when a unit test fails, or a feature doesn't behave as expected. You own the problem. You dig into the code, add print statements, step through with a debugger. You might spend an hour hunting a null pointer. And that's fine — no users are waiting.

The debugging mindset is: understand the root cause before applying a fix. It's methodical, narrow, and personal.

Firefighting Is a Team Sport with a Clock

Firefighting is what you do when your payment processing pipeline is returning 503s to customers. The goal is not root cause — the goal is restore service. Root cause comes later. Firefighting is broad, fast, and collaborative.

The firefighting mindset is: mitigate first. Rollback, toggle a feature flag, ramp down traffic. Then debug.

warning

If you find yourself alone, deep in Grafana dashboards, more than five minutes into an outage without having communicated a mitigation plan — you are debugging, not firefighting. Stop.

The Five-Minute Rule

I enforce a hard five-minute rule on every incident I lead. From the moment I acknowledge the alert, I have five minutes to either identify a clear cause or escalate. If I'm not sure, I roll back the most recent deploy or toggle the most recent feature flag. That's it.

This rule forces the firefighting mindset. It prevents the sunk-cost trap of 'I'm almost there, just one more query.'

A simple checklist to switch from debug mode to firefighting mode.
# My personal incident checklist (printed, stuck to monitor)
# 1. Acknowledge alert in PagerDuty
# 2. Post in #incident channel: "Investigating X alert"
# 3. Start 5-minute timer
# 4. Check deploy dashboard for last 30 min
# 5. Check feature flag dashboard for recent toggles
# 6. If no obvious cause at 5 min -> rollback last deploy
# 7. If rollback fails -> call escalation

A War Story: The Case of the Silent Queue

The Case of the Silent Queue

  1. 14:02Alert: Order processing lag spiking to 30 seconds
  2. 14:03Engineer A posts in incident channel, starts looking at Redis queue
  3. 14:08Engineer A finds a stalled consumer, begins debugging the consumer code
  4. 14:15Second alert: orders failing. Engineer A still debugging consumer logic
  5. 14:18Engineer B joins, asks 'Did we deploy anything recently?'
  6. 14:19Engineer B checks deploy dashboard — new consumer version deployed at 13:55
  7. 14:21Engineer B rolls back consumer to previous version. Queue starts draining.
  8. 14:25Service restored. Root cause: a null pointer in new consumer code.

Lesson

Engineer A spent 16 minutes debugging the consumer code instead of checking recent changes. A rollback at 14:03 would have restored service in 2 minutes. The five-minute rule would have forced a rollback at 14:08.

The Structural Differences

  • arrow_rightDebugging: single owner. Firefighting: incident commander + multiple roles.
  • arrow_rightDebugging: you can go down rabbit holes. Firefighting: you must stay broad and escalate.
  • arrow_rightDebugging: fix = understand + change code. Firefighting: fix = mitigate (rollback, throttle, failover).
  • arrow_rightDebugging: no communication overhead. Firefighting: constant status updates in a public channel.
  • arrow_rightDebugging: post-mortem optional. Firefighting: mandatory blameless post-incident review.

How to Train Your Team to Firefight

The best way to teach firefighting is through game days. Simulate an incident with a fake alert and a rollback script. Force engineers to follow the five-minute rule. Make them practice the handoff.

I also recommend adding a 'firefighting mode' label to your on-call rotation. When you're on call, you're not debugging — you're firefighting. Debugging happens after the incident is mitigated.

Debugging is for post-incident analysis. Firefighting is for incident response. Mix them up and you'll have both a longer outage and a confusing post-mortem.

The Handoff Protocol

When you rotate someone into an incident, you need a clean handoff. The outgoing person should summarize: current state, what's been tried, what's the mitigation plan, and any pending investigations. This should be written in the incident channel, not spoken. Spoken handoffs lose context.

I use a template: 'Handoff to @engineerX. Current status: 503s on /checkout. Mitigation: rolled back deploy v1.2.3 at 14:30. Pending: investigating if the bad config is still cached. Next step: flush CDN cache.'

Post-Incident: Two Separate Reviews

After the incident, most teams do a single post-mortem that covers both the response and the root cause. I think that's a mistake. The response review should be separate from the root cause analysis.

The response review asks: Did we follow the firefighting process? Did we mitigate quickly? Were handoffs clean? The root cause review asks: What broke? How do we prevent it? Mixing them leads to debates about whether the response was good enough, which distracts from finding the actual bug.

68%

Reduction in mean time to mitigate (MTTM) after adopting a structured incident response process with a five-minute rule and handoff protocol, based on my team's data over 12 months.

Debugging is a craft. Firefighting is a discipline. You need both, but know when to use which. Next time you get paged, stop. Take a breath. Look at the checklist. And for the love of uptime, don't start debugging until the fire is out.

Frequently asked questions

What's the difference between debugging and firefighting?

Debugging is a methodical, single-person exploration to find a root cause. Firefighting is a rapid, team-based response to stop a production incident and restore service. The goals, pace, and collaboration patterns are completely different.

Why is it dangerous to debug during an incident?

Debugging encourages tunnel vision and ownership. In an outage, you need broad awareness, quick mitigation (like rollback or feature flag), and constant communication. Debugging alone delays these, extending the outage.

What should I do first when a production incident starts?

Stop and read the alert. Broadcast that you're investigating in the incident channel. Then apply the five-minute rule: if you don't see an obvious cause in five minutes, roll back the most recent change or call for backup.

How do I transition from debugging to firefighting mindset?

Use a checklist. I have a printed card: '1. Acknowledge alert. 2. Post in #incident. 3. Start timer. 4. Check recent deploys/feature flags. 5. At 5 min, escalate if needed.' The physical act of following a list switches your brain out of debug mode.