MTTR (Mean Time To Recover/Repair/Resolve)

MTTR (Mean Time To Recover, Mean Time To Repair, Mean Time To Resolve, ou Mean Time To Respond — meanings vary) est une métrique opérationnelle mesurant le temps moyen écoulé entre la détection d'un incident et sa résolution complète. C'est l'une des 4 DORA metrics clés (Deployment Frequency, Lead Time for Changes, Change Failure Rate, MTTR/Time to Restore) utilisées pour évaluer la performance des équipes DevOps.

DORA performance tiers (State of DevOps Report 2023) :
- Elite : < 1 hour
- High : < 1 day
- Medium : 1 day - 1 week
- Low : 1 week - 1 month

Four MTTR variants distincts (source de confusion fréquente) :
(1) Mean Time To Detect (MTTD) — temps entre incident occurrence et detection (alert fired). Driven par observability quality.
(2) Mean Time To Respond (MTTR1) — temps entre detection et acknowledgment par humain (incident commander pickup).
(3) Mean Time To Repair / Resolve (MTTR2) — temps entre detection et restoration service (incident closed).
(4) Mean Time Between Failures (MTBF) — moyenne entre 2 incidents (reliability).

Décomposition utile dans post-mortem : Incident timestamps (T0 = service degradation, T1 = first alert fired, T2 = ack par on-call, T3 = root cause identified, T4 = mitigation deployed, T5 = service fully restored, T6 = customer impact ended). Métriques : detection time (T1-T0), response time (T2-T1), mitigation time (T5-T2), total customer impact (T6-T0).

Factors améliorant MTTR :
(1) Observability quality — golden signals dashboards (latency, errors, traffic, saturation), structured logging, distributed tracing, anomaly detection. Cannot fix what you can't see.
(2) Alerting maturity — alerts actionable (no alert fatigue), routing correct via PagerDuty/Opsgenie, on-call rotation healthy, runbooks attachés.
(3) Incident response maturity — incident commander role clear, severity levels defined (SEV1-4), communication channels (Slack #incidents, status page), war room procedures.
(4) Automation — auto-remediation playbooks (auto-restart services, auto-scale, auto-rollback bad deploys), runbooks automation, Kubernetes self-healing.
(5) Architecture resilience — circuit breakers, retries with backoff, bulkheads isolation, graceful degradation, redundancy multi-AZ/region.
(6) Deployment safety — canary, feature flags for instant kill switches, fast rollback.
(7) Post-incident learning — blameless post-mortems, action items tracking, sharing across orgs.

MTTR pitfalls / metrics gaming : (1) median > mean often more meaningful (outliers skew mean) ; (2) MTTR can be gamed by closing tickets early then reopening (define clear closure criteria) ; (3) excluding planned maintenance ; (4) low MTTR with high incident frequency = treating symptoms not causes ; (5) MTTR alone insufficient — combine with MTBF, change failure rate, severity-weighted incidents.

Incident management tools : PagerDuty, Opsgenie, FireHydrant, Incident.io, Rootly, Statuspage (Atlassian), Better Stack, Datadog Incident Management. SRE practices and Google SRE Book are seminal references.

Préparez vos certifications IT gratuitement