SRE Lifecycle / Practices

Pratiques SRE (Site Reliability Engineering) couvrant le cycle de vie d'un service.

Le SRE Lifecycle regroupe les pratiques opérationnelles SRE couvrant le cycle de vie d'un service de la conception à la dépréciation. Inspiré du Google SRE Book et SRE Workbook, c'est le cadre que les SRE Teams utilisent pour assurer reliability sans tuer velocity.

Phases principales :
(1) Design Review — production-readiness review (PRR) : architecture, capacity planning, observability built-in, SLOs définis, runbooks préparés, on-call rotation prête. Service ne va pas en prod sans PRR validation.
(2) Launch — onboarding service vers infrastructure standard, dashboards, alerts, oncall.
(3) Operate — daily ops : alerting tuning, capacity adjustments, dependency monitoring, dashboards reviews.
(4) Improve — SLO violations conduisent à reliability work : chaos engineering, fault injection, load testing, hardening.
(5) Sunset — deprecate properly : migrate users, monitor traffic decline, decommission cleanly.

Pratiques transverses : (1) Toil reduction — automate manual repetitive ops work (target : <50% time on toil) ; (2) Error budgets gouvernent vélocité ; (3) Blameless postmortems ; (4) Game days et chaos engineering ; (5) On-call rotation healthy ; (6) Capacity planning quarterly ; (7) Production-readiness reviews ; (8) Service catalog (Backstage).

Maturity ladder : team owns ops → SRE consults → embedded SRE → standalone SRE team → platform SRE. Compétences DOP-C02, ITIL4-HVIT, SRE practical.

Préparez vos certifications IT gratuitement