Canary Deployment

Déploiement progressif d'une nouvelle version à un sous-ensemble d'utilisateurs.

Le Canary Deployment est une stratégie de déploiement progressive où une nouvelle version (v2) est d'abord exposée à un petit pourcentage d'utilisateurs ou de traffic (typiquement 1-5%), monitorée intensivement, puis graduellement étendue (10%, 25%, 50%, 100%) si les métriques restent saines, ou rollback automatique si dégradation détectée. Le nom vient des canaris des mineurs (dont la mort signalait la présence de gaz toxiques) — la nouvelle version est le "canari" qui détecte les problèmes avant que tous les utilisateurs ne soient impactés.

Workflow détaillé : (1) v1 en production servant 100% traffic ; (2) déployer v2 alongside v1 (n nouveaux pods/instances) ; (3) router 5% traffic vers v2 via load balancer weighted routing ou service mesh ; (4) monitor golden signals (latency, errors, traffic, saturation), business metrics (conversion, revenue), exception rate ; (5) si tous indicateurs sains pendant période d'observation (15-60 min), increment to next % ; (6) répéter steps 4-5 jusqu'à 100% ; (7) décommissionner v1 ; (8) si dégradation détectée à n'importe quelle étape, auto-rollback (route 100% back to v1).

Avantages vs Blue/Green : (1) blast radius limité (impact 5% users vs 100% si v2 problématique) ; (2) feedback réel avec traffic production (vs Blue/Green smoke tests synthétiques) ; (3) coût infra incremental (vs Blue/Green 2x) ; (4) détection issues subtle (high-percentile latency, edge case errors) impossibles à voir en testing.

Mechanismes routing : (1) Layer 7 load balancer weighted target groups (ALB AWS, Application Gateway Azure, Cloud Load Balancing GCP) ; (2) Service Mesh (Istio VirtualService weighted routing, Linkerd traffic split, Consul, AWS App Mesh) ; (3) API Gateway routing (Kong, Apigee) ; (4) Feature flags (LaunchDarkly, Split.io, Unleash, ConfigCat) — fine-grained per-user routing based on attributes ; (5) Cloudflare Workers / CloudFront Functions — edge canary based on cookies/headers.

Metric-based auto-promotion tools : (1) Flagger (CNCF, integrates with Istio, Linkerd, App Mesh, NGINX, Gloo Edge, Contour, Traefik) — analyses metrics from Prometheus/Datadog/CloudWatch, auto-promote ou rollback ; (2) Argo Rollouts (similar, native Argo ecosystem) ; (3) Spinnaker Kayenta (Netflix automated canary analysis) — statistical comparison v1 vs v2 metrics, judgment Pass/Fail/Marginal ; (4) Harness Continuous Verification ; (5) AWS CodeDeploy avec CloudWatch alarms triggers.

Best practices : (1) start small (1%) — even 5% is too much for 100M user product ; (2) cohort selection — random subset OR specific cohort (employees first, free tier first, low-revenue regions first) ; (3) define rollback criteria clearly (error rate > X%, latency p99 > Y ms, business metric drop > Z%) ; (4) sufficient soak time per stage (data takes time to be statistically significant) ; (5) monitor not just averages but tail latencies (p95, p99) ; (6) use feature flags for instant kill switch sans redeploy ; (7) document deployment, post-deployment validation procedures.

Canary vs A/B testing : Canary = deployment safety mechanism (same intended functionality v2, just gradual rollout) ; A/B testing = feature experimentation (different variants tested for business impact, may keep both indefinitely). Often combined : feature flagged Canary deployment + A/B test new feature within canary cohort. Compétences DOP-C02, AZ-400.

Préparez vos certifications IT gratuitement