Postmortem (Post-incident Review)

Analyse retrospective d'un incident pour comprendre causes et prévenir récurrence.

Un Postmortem (aussi appelé Post-Incident Review ou PIR, Retrospective, After-Action Review) est un document et processus analysant un incident significatif après sa résolution pour comprendre les causes (root causes et contributing factors), documenter timeline, identifier améliorations, et générer action items concrets pour prévenir récurrence. Pratique cardinale de la culture SRE depuis Google SRE Book.

Principle clé : BLAMELESS — focus sur systems and processes, pas sur blaming individuals. Hindsight bias est tentant ("obviously they should have known") mais counter-productive : people make best decisions with information available at the time. Blame culture discourages reporting incidents, hides truth, prevents learning. Etsy, Google, Netflix, Spotify pionniers du blameless postmortem.

Quand écrire un postmortem :
- Tout SEV1/SEV2 (high impact users)
- Near-misses notables (could have been bad)
- Pattern d'incidents répétés similaires
- Customer-facing incidents > 1h downtime ou major data issue
- Security incidents (always)
- Failed deployments avec rollback
- Pas pour every minor alert (alert fatigue avoidance)

Structure typique :
(1) **Title + Date** — "Payment service 30 minutes outage on 2024-09-15".
(2) **Status** — Draft / Under Review / Published / Action items in progress / Complete.
(3) **Authors + Reviewers + Approvers** — incident commander writes initial draft, peer reviewers, leadership approves publication.
(4) **Executive summary** — 2-3 paragraphs : what happened, impact, fix, prevention.
(5) **Impact** — quantified : users affected (50k), duration (32 min), revenue lost (~\$45k), severity (SEV1), services impacted.
(6) **Timeline** — minute-by-minute UTC timeline : 14:23 deploy started, 14:31 first alert fired, 14:34 incident declared, 14:47 root cause identified, 15:02 fix deployed, 15:08 verified resolved. Include all relevant actions, decisions, dead ends.
(7) **Root cause(s)** — technical analysis : what code change/config/dependency broke ? Use 5 Whys ou similar techniques (don't stop at first technical cause — ask why systems allowed it).
(8) **Contributing factors** — gaps in testing, monitoring blindspots, missing runbook, organizational pressure, knowledge gaps.
(9) **What went well** — fast detection, clear escalation, good war room, rollback worked. (Important pour reinforce positive behaviors.)
(10) **What went wrong** — slow detection, runbook outdated, unclear ownership, communication breakdown.
(11) **Where we got lucky** — could have been worse (data corruption avoided, traffic was low). Identifies fragility.
(12) **Action items** — specific, assigned, dated. Categorize : prevent (fix root cause), detect (better monitoring), respond (better runbook/automation), reduce blast radius (better resilience). Track in ticketing system, don't let drop.
(13) **Lessons learned** — broader insights for org.
(14) **References** — Slack thread, dashboards, related incidents, code commits, runbooks, status page updates.

Process :
(1) Schedule postmortem meeting within 1-5 days of incident (memory fresh, not exhausted).
(2) Author writes draft 24-48h after incident — facts, timeline.
(3) Distribute for async review — collect comments, corrections.
(4) Meeting : walkthrough, discuss action items, assignments. 60-90 min usually.
(5) Publish broadly (entire engineering org or company depending sensitivity).
(6) Track action items to completion (RACI, sprint planning).
(7) Periodic review of past postmortems for trends ("we keep having database issues").

Tools : Confluence, Notion, GitHub PRs, Jellyfish, Incident.io (auto-generates postmortem from incident channel), FireHydrant, Rootly, Howie (Stripe internal), PagerDuty Postmortem feature.

Metrics : postmortem quality (peer review feedback), action item completion rate (% done in 90 days), reduction in incident recurrence. Compétences SRE pratiques, ITIL4-HVIT.

Préparez vos certifications IT gratuitement