SLO Burn Rate

Vitesse à laquelle un service consomme son error budget, base des alertes SRE modernes.

Le SLO Burn Rate est la vitesse à laquelle un service consomme son error budget par rapport au taux normal acceptable. Calcul : Burn Rate = error rate observé / (1 - SLO target). Un burn rate de 1 signifie que le service consomme son budget au taux exactement prévu sur la fenêtre SLO ; burn rate > 1 signifie consommation plus rapide que prévu.

Exemple SLO 99.9% (error budget = 0.1%) sur 30 jours :
- Error rate observé = 0.1% → burn rate = 1 → on track, budget durera 30 jours
- Error rate = 1% → burn rate = 10 → consumes budget 10x faster → exhausted in 3 jours
- Error rate = 10% → burn rate = 100 → entire monthly budget gone in 7h12min
- Error rate = 50% → burn rate = 500 → budget consommé en 86 minutes

Multi-window multi-burn-rate alerting (Google SRE Workbook ch.5) — gold standard pour alerting moderne :

(1) Fast burn alert (page) : burn rate >= 14.4x sur 1 hour AND burn rate >= 14.4x sur 5 min (short window confirms still ongoing) → consumed 2% of monthly budget in 1h → page incident commander.

(2) Medium burn alert (ticket, non-page) : burn rate >= 6x sur 6h AND >= 6x sur 30min → consumed 5% in 6h → create ticket pour review.

(3) Slow burn alert (low urgency) : burn rate >= 3x sur 24h AND >= 3x sur 2h → consumed 10% in 24h → review at standup.

Les thresholds (14.4x, 6x, 3x) viennent de la math : combien faut-il consommer pour brûler X% du budget en Y temps. Tables disponibles dans Google SRE Workbook.

Avantages multi-burn-rate vs static threshold alerts :
(1) Reduces alert noise — alert seulement si erreurs vraiment significatives consume budget rapidly, pas si juste spike transient (5 erreurs sur 100k = 0.005% noise).
(2) Surfaces real problems early — fast burn alert fires bien avant exhaustion totale du budget, giving time to respond.
(3) Auto-tuned to SLO — change SLO de 99.9% à 99.95%, alert thresholds auto-adjust (formula-driven).
(4) Symptom-based — alerte sur user-facing impact (SLI degradation) plutôt que cause (CPU high, disk full) — focus sur ce qui matters customer.
(5) Multi-window prevent false alarms — short window (5min) confirms incident still ongoing before paging.

Implementation Prometheus example :
```promql
# Burn rate sur 1h pour SLO 99.9% availability
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) / 0.001 > 14.4
```

Tools : (1) Pyrra (open source) — generate Prometheus rules from SLO YAML definitions ; (2) Sloth (open source) — similar, generates recording rules + alerts ; (3) Nobl9 (commercial SLO platform) ; (4) Datadog SLOs avec burn rate alerts native ; (5) Grafana SLO ; (6) Google Cloud Monitoring SLO Monitoring with multi-burn-rate alerts builtin ; (7) Honeycomb BubbleUp + SLOs.

Best practices : (1) define SLI/SLO from user journeys (not infrastructure metrics) ; (2) start with conservative SLOs, adjust based on what's achievable + business commitment ; (3) use multi-burn-rate alerts (page, ticket, low urgency); (4) review SLOs quarterly with stakeholders ; (5) link error budget exhaustion to engineering priorities (reliability work). Reference: "Implementing SLOs" — Alex Hidalgo book. Compétences SRE practical.

Préparez vos certifications IT gratuitement