Skip to content
SP StackPractices
intermediate By StackPractices

Canary Deployment — Gradual Rollouts with Safety Controls

A practical guide to canary deployments: traffic splitting strategies, automated promotion, rollback triggers, and safely rolling out new versions to a subset of users.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Canary deployment releases a new version to a small subset of users first, then gradually increases traffic while monitoring for issues. It combines the safety of controlled exposure with the speed of continuous deployment, catching problems before they impact all users.

This guide covers traffic splitting, health metrics, automated promotion, and rollback strategies.

When to Use

  • You want to reduce risk when deploying new features
  • Your service has enough traffic to get meaningful metrics from 1-5% of users
  • You need to validate performance under real load before full rollout
  • You want to A/B test behavior alongside infrastructure changes
  • Gradual rollback is preferable to instant switch (blue-green)

Core Concepts

ConceptDescription
Canary GroupInitial subset of users receiving the new version
Traffic SplitPercentage of requests routed to canary vs baseline
PromotionIncreasing canary traffic percentage after validation
RollbackReducing canary traffic to zero if issues detected
Bake TimeMinimum observation period before next promotion step
Metric ThresholdAutomated criteria for promotion or rollback

Traffic Splitting Strategies

StrategyHow It WorksBest For
Random percentageRandomly split X% of requestsStateless APIs
User-basedRoute specific users/groups consistentlySession-aware apps
GeographicRoute by region or data centerMulti-region deployments
Header-basedRoute by request header (internal, beta)Testing with specific clients
ProgressiveStart at 1%, double every N minutesHigh-traffic services

Step-by-Step Canary Deployment

1. Define Canary Criteria

Set clear, measurable thresholds before deploying:

# Example: Canary analysis configuration
canary:
  stages:
    - name: "1% canary"
      traffic_percentage: 1
      bake_time_minutes: 15
      thresholds:
        error_rate: "< 0.1%"
        latency_p95: "< 200ms"
        cpu_utilization: "< 70%"
    - name: "10% canary"
      traffic_percentage: 10
      bake_time_minutes: 30
      thresholds:
        error_rate: "< 0.1%"
        latency_p95: "< 200ms"
    - name: "50% canary"
      traffic_percentage: 50
      bake_time_minutes: 30
      thresholds:
        error_rate: "< 0.1%"
        latency_p95: "< 200ms"
    - name: "100% rollout"
      traffic_percentage: 100

Key metrics to monitor:

  • Technical: Error rate, latency (p50/p95/p99), throughput, CPU, memory
  • Business: Conversion rate, cart abandonment, login success, payment completion
  • Custom: Feature-specific KPIs relevant to the change being deployed

2. Deploy the Canary

Route a small percentage of traffic to the new version:

# Example: Istio virtual service for canary
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp-canary
spec:
  hosts:
    - myapp.example.com
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: myapp
            subset: canary
          weight: 100
    - route:
        - destination:
            host: myapp
            subset: stable
          weight: 99
        - destination:
            host: myapp
            subset: canary
          weight: 1
# Example: NGINX weighted upstream
upstream myapp {
    server stable.internal:8080 weight=99;
    server canary.internal:8080 weight=1;
}

# Example: Kubernetes with Flagger
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  service:
    port: 80
    targetPort: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m

3. Monitor and Validate

Watch canary metrics against baseline:

# Example: Automated canary analysis script
import requests
import time

def analyze_canary(baseline_version, canary_version, duration_minutes=15):
    end_time = time.time() + (duration_minutes * 60)
    
    while time.time() < end_time:
        # Fetch metrics from monitoring system
        baseline_errors = get_error_rate(baseline_version)
        canary_errors = get_error_rate(canary_version)
        
        baseline_latency = get_p95_latency(baseline_version)
        canary_latency = get_p95_latency(canary_version)
        
        # Check thresholds
        if canary_errors > baseline_errors * 1.5:
            return "ROLLBACK", f"Error rate too high: {canary_errors}%"
        
        if canary_latency > baseline_latency * 1.2:
            return "ROLLBACK", f"Latency regression: {canary_latency}ms"
        
        time.sleep(60)
    
    return "PROMOTE", "All thresholds passed"

result, reason = analyze_canary("v1.2.3", "v1.3.0")
print(f"Decision: {result} - {reason}")

Monitoring checklist:

  • Compare canary metrics to baseline, not just absolute values
  • Look for error rate spikes, latency regressions, and resource exhaustion
  • Monitor business metrics (revenue, conversion) alongside technical metrics
  • Set up alerts for canary-specific issues

4. Promote or Rollback

Based on analysis, either increase traffic or revert:

# Example: Automated promotion script
#!/bin/bash
CANARY_WEIGHT=$1

if [ "$CANARY_WEIGHT" -eq 100 ]; then
  echo "Canary fully promoted. Removing old version."
  kubectl scale deployment myapp-stable --replicas=0
  exit 0
fi

# Update traffic split
kubectl patch virtualservice myapp -p \
  '{"spec":{"http":[{"route":[{"destination":{"host":"myapp","subset":"stable"},"weight":'$((100 - CANARY_WEIGHT))'},
  {"destination":{"host":"myapp","subset":"canary"},"weight":'$CANARY_WEIGHT'}]}]}'

echo "Traffic updated: $CANARY_WEIGHT% canary"
# Example: Instant rollback
#!/bin/bash
echo "Rolling back canary..."

# Set canary weight to 0
kubectl patch virtualservice myapp -p \
  '{"spec":{"http":[{"route":[{"destination":{"host":"myapp","subset":"stable"},"weight":100},
  {"destination":{"host":"myapp","subset":"canary"},"weight":0}]}]}'

# Scale canary to zero
kubectl scale deployment myapp-canary --replicas=0

echo "Rollback complete. All traffic on stable."

Promotion best practices:

  • Never skip bake time — even if metrics look good
  • Double traffic in stages (1% → 5% → 10% → 25% → 50% → 100%)
  • Require manual approval for stages above 50%
  • Keep the old version scaled up until 100% promotion

Automated Canary Analysis Tools

ToolPlatformKey Features
FlaggerKubernetesAutomated canary, A/B testing, progressive delivery
SpinnakerMulti-cloudPipeline-driven canary with metric analysis
Argo RolloutsKubernetesBlue-green, canary, and analysis templates
AWS App MeshAWSTraffic shifting with CloudWatch metrics
Google Cloud Traffic DirectorGCPPercentage-based traffic splitting

Best Practices

  • Start small. 1% canary catches most issues without significant user impact.
  • Use meaningful metrics. Business metrics often detect issues that technical metrics miss.
  • Keep sessions sticky. Route the same user to the same version to avoid inconsistency.
  • Have an instant rollback. Canary should revert in seconds, not minutes.
  • Practice the rollback. Test your rollback procedure before you need it.
  • Document every canary. Note what changed, what was observed, and the final decision.

Common Mistakes

  • Rushing promotion. Skipping bake time because “it looks fine” leads to incidents.
  • Monitoring only technical metrics. A feature bug may not show in error rates but will affect conversions.
  • Inconsistent routing. Users bouncing between versions creates confusion and bugs.
  • Forgetting database compatibility. Both versions must work with the current schema.
  • Not scaling canary properly. Under-provisioned canaries fail under load, causing false rollbacks.

Variants

  • Shadow canary: Send duplicate traffic to canary without user impact (no risk, but doubles load)
  • Dark launch: Deploy to production but hide behind feature flags
  • Geographic canary: Roll out region by region (US-East first, then Europe, then Asia)
  • Time-based canary: Route internal users during business hours, then external users after validation

FAQ

Q: What percentage should I start with for a canary? Start with 1% for high-traffic services, 5-10% for lower traffic. The goal is enough traffic for statistically significant metrics.

Q: How long should each canary stage last? Minimum 10-15 minutes per stage for high-traffic services. For low-traffic, extend to 30-60 minutes to gather enough data.

Q: What is the difference between canary and A/B testing? Canary tests infrastructure health and regression. A/B tests user behavior and feature effectiveness. They can be combined.

Q: Should I use canary for every deployment? For critical services, yes. For internal tools or low-risk changes, direct deployment may be acceptable.

Conclusion

Canary deployment is the safest way to release software at scale. By exposing changes to a small, controlled audience first, you catch issues early, minimize blast radius, and build confidence in every release. Combine automated metric analysis with gradual promotion for a world-class deployment process.