The best runbook in the world is the one that gets read at 2am. The second-best is the one that doesn't need to be read because the alert already told you what to do.
Five sections, in order
- Symptom — what the on-call is seeing right now. Specific log lines, specific error codes.
- Blast radius — who is affected and how badly. "All triage halts" vs. "a single user can't get a reply" route to different urgency.
- Quick triage — the first 3 things to check, with the commands to run.
- Common fix — if the symptom matches a known cause, the exact fix. If not, the escalation path.
- Postmortem hook — a line to add to the incident doc, even if it's a quick fix. Future you and future teammates need the breadcrumb.
Knowledge check
0/1 answered1. What is the highest-leverage section of a 2am runbook?
Discussion
0 commentsBe the first to start the conversation.