MTTD (Mean Time To Detect)

Temps moyen entre l'occurrence d'un incident et sa détection par le monitoring.

MTTD (Mean Time To Detect) est le temps moyen entre l'occurrence d'un incident et sa détection par les systèmes de monitoring/alerting ou par un humain. C'est la première composante du MTTR total — un système ne peut pas être réparé tant qu'il n'est pas détecté comme cassé.

Calcul : MTTD = somme(T_detection - T_incident_start) / nombre d'incidents.

MTTD typique par maturité :
- Excellent : < 5 minutes (synthetic monitoring, anomaly detection ML, dense alerting on golden signals)
- Good : 5-30 minutes (standard metric-based alerts)
- Poor : > 1 hour (frequently detected via user reports or social media — "down detector", "is X down?" Twitter posts)
- Worst : detected via customer support tickets, post-incident review

Factors déterminant MTTD :
(1) Monitoring coverage — instruments tous les services critiques ? les golden signals (latency, errors, traffic, saturation) ? business metrics (orders/min, signups/hour) ? Tools : Datadog, New Relic, Dynatrace, Grafana Cloud, Prometheus + Alertmanager, CloudWatch, Azure Monitor, Splunk.
(2) Alerting rules quality — symptom-based vs cause-based ? thresholds appropriés (no false positives causing alert fatigue) ? burn rate alerts SLO-based vs static thresholds ? Google SRE Book "Alerting on SLOs" pattern.
(3) Synthetic monitoring — proactive probes from external locations (CloudWatch Synthetics, Datadog Synthetics, Pingdom, UptimeRobot, Better Stack) — detect issues even before real traffic notices.
(4) Anomaly detection — ML-based detection (CloudWatch Anomaly Detection, Datadog Watchdog, Dynatrace Davis AI, Honeycomb BubbleUp) — catches unusual patterns missed by static thresholds.
(5) Log analytics — error rate monitoring on logs (Splunk Enterprise Security, Datadog Logs, Elastic, Sumo Logic) — patterns not in metrics.
(6) Distributed tracing — slow request detection across microservices (Jaeger, Tempo, Honeycomb, Datadog APM, X-Ray).
(7) End-user monitoring (RUM — Real User Monitoring) — frontend errors and slowness (Sentry, Datadog RUM, Dynatrace RUM, FullStory).
(8) Status page subscribers / SaaS monitoring — détecter when downstream dependencies fail.

MTTD pitfalls :
(1) Counting alert acknowledgment as detection — should count actual incident start (or first user impact) for true measurement.
(2) Alert fatigue — too many false positives lead engineers to ignore alerts, slowing real detection. Cure : symptom-based alerting, alert tuning, error budgets.
(3) Underreporting — incidents detected too late aren't always reported in MTTD if no formal incident declared.
(4) Blast radius bias — global incidents detected fast (everyone screams), regional/partial incidents slow.
(5) Cold paths (rarely-exercised features) detected only when used — synthetic monitoring crucial.

Reducing MTTD strategies :
(1) SLO-based alerting (multi-burn-rate alerts) — fast burn alerts catch incidents in minutes ;
(2) Synthetic checks every 1-5 minutes from multiple locations ;
(3) Anomaly detection on key business metrics (orders, signups, payments) ;
(4) Distributed tracing for latency outliers ;
(5) Push notifications (vs polling) where possible ;
(6) Cross-service dependency monitoring (canaries for upstream services) ;
(7) Customer feedback loops (in-app feedback, support ticket trends).

MTTD vs MTTR distinction matters in post-mortems — if detection was fast but resolution slow, focus on response/runbooks ; if detection was slow, focus on monitoring/alerting improvements. Compétences SRE.

Préparez vos certifications IT gratuitement