Runbook (Incident Response Playbook)

Document opérationnel décrivant les actions à exécuter pour gérer un type d'incident.

Un Runbook (parfois Playbook) est un document opérationnel décrivant pas-à-pas les actions à exécuter pour gérer une situation opérationnelle spécifique — typiquement un incident, mais aussi maintenance, déploiement, ou procédure récurrente. Les runbooks transforment knowledge tribal en assets organisationnels durables, réduisent MTTR, et permettent on-call engineers de moins-experienced niveau de gérer incidents efficacement.

Types de runbooks :
(1) Incident response runbooks — "Database CPU > 90%", "Kafka consumer lag > 10000", "Payment service 5xx > 5%". Activated when alert fires, attached link in alert pour quick access.
(2) Maintenance runbooks — "Rotate certificate", "Scale cluster up for Black Friday", "Database vacuum full", "Apply OS patches".
(3) Deployment runbooks — "Deploy new release version X.Y.Z", "Rollback procedure".
(4) Disaster recovery runbooks — "Failover to secondary region", "Restore from backup".
(5) Onboarding runbooks — "How to access production", "Set up local dev environment".

Structure typique d'un runbook incident response :
(1) **Title** — "High CPU on production database" — clear, searchable.
(2) **Severity** — SEV1/SEV2/SEV3 (incident priority).
(3) **Symptoms** — "Application latency p99 > 2s, database CPU > 90% sustained 10min+".
(4) **Pre-conditions / Prerequisites** — "AWS access role X, kubectl context Y, on-call escalation contact Z".
(5) **Triage steps** — diagnostic queries, logs to check, dashboards to view.
(6) **Mitigation steps** — explicit commands to mitigate (scale up RDS instance class, kill long-running query, failover, restart pod).
(7) **Verification** — how to confirm mitigation worked (metrics returning normal, error rate dropping).
(8) **Rollback steps** — if mitigation made things worse.
(9) **Escalation** — who to page if cannot resolve in N minutes.
(10) **Post-incident** — what to document, ticket templates, post-mortem template link.
(11) **References** — relevant dashboards, related runbooks, architecture diagrams, recent incident reports.

Best practices :
(1) Write runbooks AS YOU LEARN — every post-mortem action item should produce or update a runbook. Don't try to write all upfront.
(2) Test runbooks regularly — chaos game days, fire drills, new hire onboarding (newcomer tries runbook unaided, identifies gaps).
(3) Keep runbooks WITH the code/infrastructure — in same Git repo as service, not in separate wiki that drifts out of date.
(4) Link runbooks to alerts — every alert in PagerDuty/Opsgenie should have a runbook URL field pre-populated.
(5) Make commands copy-pastable — exact commands, not pseudo-code, with placeholders clearly marked.
(6) Include safety nets — "WARNING: this command will restart database", confirmation prompts, dry-run modes.
(7) Date and version runbooks — last updated, last validated, owner.
(8) Search-friendly titles + tagging — "database", "performance", "redis".
(9) Use diagrams when helpful (architecture, decision trees).
(10) Automate when possible — runbook automation tools (Rundeck, AWS Systems Manager Automation, Azure Automation, GitHub Actions on-demand workflows) turn manual steps into one-click execution with audit trail.

Tooling : (1) Notion, Confluence, GitHub Wiki, GitLab Wiki — host runbooks ; (2) Backstage TechDocs (Spotify open source) — runbooks alongside service catalog ; (3) PagerDuty Runbook Automation (formerly Rundeck) ; (4) AWS Systems Manager Automation, Azure Automation Runbooks ; (5) FireHydrant runbook integrations during incidents ; (6) Incident.io runbook automation. Compétences SRE pratiques, ITIL4-HVIT.

Préparez vos certifications IT gratuitement