Book

Operational runbooks that work under stress

The preceding chapters of this guide each produce concrete deliverables: contact lists, dependency maps, authority packs, compliance calendars, exit playbooks, employee registries. Each is valuable on its own. Together, they represent the raw material of recovery. But in a crisis, the gap between “we have this documented somewhere” and “we can act on this right now” is the gap between survival and paralysis.

Why documentation usually fails

Most companies that invest in preparedness documentation discover, during an actual incident, that their documentation does not work. Not because it is inaccurate, but because it was designed for conditions that no longer exist.

Documentation fails when it assumes connectivity. Runbooks stored in Notion, Confluence, or Google Docs are inaccessible when those platforms are down. A recovery procedure that requires logging into the very system you are trying to recover is not a recovery procedure.
Documentation fails when it assumes identity. Internal wikis gated behind SSO become unreachable when the identity provider fails. Documentation protected by the same authentication layer as the systems it describes cannot help you recover those systems.
Documentation fails when it assumes expertise. Procedures written by the person who built the system often omit steps that feel obvious to them. Under stress, the person executing the runbook may not be the person who wrote it. Clarity beats brevity.
Documentation fails when it is stale. A runbook written eighteen months ago for an infrastructure that has since been reorganised is worse than no runbook at all: it provides false confidence and sends recovery teams down dead ends.
Documentation fails when no one owns it. Deliverables produced during a preparedness sprint and never revisited decay silently. Vendor lists become outdated. Contact numbers change. Access procedures are invalidated by infrastructure changes. Without a named owner and a review cadence, documentation is a snapshot of a past state, not a current resource.

The operational runbook must be designed to survive all of these failure modes simultaneously.

Tiered runbooks, not one monolith

The whitepaper covers scenarios with very different time horizons. A Cloudflare-style cascading outage resolves in hours. Sanctions may last months. A splinternet scenario is permanent. A single runbook cannot serve all three. Attempting to build one produces a document so long that no one reads it, or so generic that it provides no actionable guidance.

Structure the operational runbook around three tiers, each activating the next if the situation does not resolve.

Tier 1: The first-hour playbook

The first hour determines whether the organisation responds or freezes. At this stage, no one knows whether the incident is a transient outage, a provider failure, or the beginning of a prolonged disruption. The first-hour playbook does not attempt to diagnose the root cause. It establishes coordination, confirms the extent of the problem, and activates decision authority.

The first-hour playbook should fit on a single printed page and contain exactly three things: