Incident Runbook Guide: Write Runbooks That Work

Every on-call engineer has been there: 2 AM, PagerDuty screams, you open the wiki page titled 'Production Runbook' and find a 12-page document last updated in 2019. It tells you to 'check the logs' and 'verify the database is healthy.' No actual commands. No expected output. You might as well be reading a fortune cookie.

I've responded to hundreds of incidents across three companies, and the difference between a good runbook and a bad one is the difference between a 15-minute MTTR and a 90-minute fire drill. This article covers how to write runbooks that actually reduce mean time to resolution (MTTR) and don't embarrass you when the VP of Engineering is on the bridge.

The Anatomy of an Actionable Runbook

A runbook is a checklist for a specific failure mode. It must be executable by someone who has never seen the service before. Here's the structure I use:

arrow_rightTitle and symptom (e.g., 'High Error Rate on /api/checkout')
arrow_rightSeverity indicators (e.g., 'If error rate > 5% for 5 minutes, page SRE')
arrow_rightPrerequisites (e.g., 'Requires kubectl access to production cluster')
arrow_rightStep-by-step diagnosis with exact commands and expected output
arrow_rightMitigation steps with rollback commands
arrow_rightVerification steps to confirm the fix worked
arrow_rightEscalation path if steps fail

A Real Example: PostgreSQL Replication Lag Runbook

Here's an excerpt from a runbook I wrote for a service that used PostgreSQL streaming replication. The symptom was 'replica lag > 30 seconds during peak traffic.'

Example runbook snippet for PostgreSQL replication lag. Note the exact commands and the warning about data loss.

# Check replication lag on replica host
psql -h replica-1.example.com -U repl_user -d postgres -c "
  SELECT pg_last_wal_receive_lsn(),
         pg_last_wal_replay_lsn(),
         pg_last_xact_replay_timestamp(),
         now() - pg_last_xact_replay_timestamp() AS replication_lag;
"

# Expected output: replication_lag should be < 30 seconds
# If lag > 30s, proceed to mitigation below.

# Mitigation: promote replica to primary (if primary is dead)
# WARNING: this will cause data loss if primary is still accepting writes
pg_ctl promote -D /var/lib/postgresql/data

# Verify new primary is accepting writes
psql -h new-primary.example.com -U app_user -d myapp -c "INSERT INTO healthchecks(ts) VALUES (now());"

warning

Never include a step like 'fix the issue by restarting the pod' without specifying which pod, how to identify it, and the exact kubectl command. Vague steps are worse than no steps—they waste time.

Common Pitfalls I've Seen (and Caused)

1The 'general' runbook: A single runbook covering 'any database issue.' It's too vague. Split by symptom: replication lag, connection pool exhaustion, slow queries, etc.
2Missing failure modes: The runbook says 'run command X' but doesn't say what to do if command X fails (e.g., host unreachable, permission denied). Always include a fallback.
3Assuming tool availability: 'Run the health check script in /opt/scripts/health.sh' — but the script was deleted six months ago. Keep scripts in the same repo as the runbook.
4Stale contact info: Escalation path points to someone who left the company. Use PagerDuty schedules or alias, not names.

The Night the Runbook Pointed to a Dead Host

01:23PagerDuty alerts: 'Checkout service error rate > 10%'
01:25Engineer opens runbook. Step 1: 'SSH to checkout-01 to check logs'
01:26SSH fails — host unreachable. Runbook has no fallback.
01:40Engineer discovers from another team that checkout-01 was decommissioned two weeks ago.
01:55Engineer finds logs on the new host via Kubernetes. Runs mitigation manually.
02:10Issue resolved. MTTR: 47 minutes. Could have been 15 if runbook was current.

Lesson

Runbooks must be verified against the live environment. Schedule a quarterly 'runbook drill' where an engineer follows each runbook from start to finish and fixes mismatches.

Testing Runbooks: The Chaos Engineering Approach

You wouldn't deploy code without testing it. Why would you deploy a runbook without testing it? At my last company, we ran a monthly 'runbook review' during the on-call handoff. The incoming on-call engineer would pick one runbook and execute it in a staging environment. If any step failed or was ambiguous, we fixed it immediately.

Better yet, introduce chaos experiments that deliberately trigger the conditions in a runbook. For example, use a tool like Chaos Mesh to inject a network partition on a replica, then have an engineer follow the 'replica lag' runbook. You'll find gaps fast.

If your runbook isn't tested in a live environment, it's not a runbook—it's a suggestion.

Metadata That Keeps Runbooks Fresh

Every runbook should have a header block with metadata. I use YAML front matter in Markdown files stored in the service repo. Here's the template:

Runbook metadata header. The 'last_verified' field is critical — it tells the on-call engineer how stale the information might be.

---
title: "High Error Rate on /api/checkout"
owner: "checkout-team"
last_verified: "2024-10-15"
verified_by: "jdoe"
severity: "critical"
alert_condition: "error_rate > 5% for 5 minutes"
prerequisites:
  - "kubectl access to prod cluster"
  - "read-only Postgres credentials"
---

Where to Store Runbooks

Don't put runbooks in a wiki. Wikis are write-only memories. Put them in the same repository as the service code, under a `runbooks/` directory. This way, changing a runbook requires a pull request, just like changing code. The on-call engineer can open the repo and find the runbook quickly. Bonus: you can link runbooks directly from monitoring alerts using a URL template.

At my current company, every PagerDuty alert includes a link to the corresponding runbook in GitHub. The link is generated from the alert metadata. When an engineer clicks the alert, they land on the exact runbook they need.

60%

Reduction in MTTR after implementing tested, version-controlled runbooks (internal data, 2023)

What About Incident Playbooks?

Runbooks are for specific symptoms. Playbooks are for broader incident categories—like 'database incident' or 'deployment failure.' A playbook might say '1. Check runbooks for common database failure modes. 2. If none apply, escalate to DBA.' Playbooks are useful for new on-call engineers who need a decision tree. But don't confuse the two: a runbook without a specific command is a playbook. Both have their place, but this article focuses on runbooks because they're the ones that actually get executed during an incident.

Final Checklist Before You Call a Runbook 'Done'

arrow_rightEvery step has an exact command or click path (e.g., 'kubectl -n production get pods -l app=checkout').
arrow_rightEvery command includes expected output or a success indicator.
arrow_rightThere is a 'what if this step fails' section for each critical step.
arrow_rightAll hostnames, IPs, and credentials are current (or indicate where to find them).
arrow_rightThe runbook has been tested by someone other than the author in the last 90 days.
arrow_rightThe runbook is linked from the monitoring alert that triggers it.

lightbulb

Start with your top 5 alerts. Write a runbook for each. Test them. Iterate. You'll save hours of downtime in the next incident.

Frequently asked questions

How often should incident runbooks be updated?

Update a runbook whenever the corresponding service changes—deployment, config, dependency. At minimum, review and verify each runbook every quarter. Add a 'last verified' field with the date and engineer name.

What's the difference between a runbook and a playbook?

A runbook is a specific, step-by-step procedure for a known failure mode (e.g., 'PostgreSQL replica lag exceeds 30s'). A playbook is a broader strategy or set of principles for handling classes of incidents (e.g., 'Database incident response playbook'). Runbooks are tactical; playbooks are strategic.

Where should runbooks be stored?

Store runbooks in the same repository as the service code, in a `runbooks/` directory. This keeps them close to the engineers who maintain the service and allows pull requests for changes. Wikis get stale fast.

How long should a runbook be?

Aim for one page or less. If it's longer than a page, split it into multiple runbooks for specific symptoms. Engineers under stress should not need to scroll or search.

Writing an Incident Runbook That Actually Gets Used in Production