Skip to content
SP StackPractices
intermediate

Blue-Green and Canary Deployments

A practical guide to deployment strategies: blue-green, canary, rolling, and feature flags. Minimize risk and rollback time when releasing to production.

Topics: devops

Blue-Green and Canary Deployments

Introduction

Deploying to production is risky. A bad deployment can take down your service, corrupt data, or degrade user experience for hours. Deployment strategies exist to reduce this risk by controlling how new code reaches users and how quickly you can revert if things go wrong.

Deployment Strategies Compared

StrategyRisk LevelRollback TimeComplexityBest For
RecreateHighSlow (redeploy)LowDev/test environments only
RollingMediumMedium (stop rolling)LowSimple stateless services
Blue-GreenLowInstant (switch traffic)MediumWhen instant rollback is critical
CanaryVery LowFast (shift traffic back)HighHigh-risk changes, gradual rollouts
Feature FlagsMinimalInstant (toggle off)MediumDecoupling deploy from release

Rolling Deployment

Replace old instances gradually with new ones.

Phase 1: [Old] [Old] [Old] [Old] [Old]
Phase 2: [New] [Old] [Old] [Old] [Old]
Phase 3: [New] [New] [Old] [Old] [Old]
Phase 4: [New] [New] [New] [New] [New]
# Kubernetes rolling update
kubectl set image deployment/api api=myapp:v2.4.1
kubectl rollout status deployment/api

Trade-off: During rollout, old and new versions coexist. If v2 breaks a data contract, v1 instances may fail when reading v2-written data.

Blue-Green Deployment

Maintain two identical environments. One is live (blue), one is idle (green). Deploy to green, test, then switch traffic instantly.

Before:  Users → [Load Balancer] → [Blue: v2.4.0]
                                    [Green: v2.4.0 idle]

After:   Users → [Load Balancer] → [Blue: v2.4.0 idle]
                                    [Green: v2.4.1 live]
# Terraform example: blue-green with AWS ALB target groups
# Switch traffic by changing ALB listener rule
aws_lb_target_group_attachment "blue" { target_group_arn = blue_tg.arn }
aws_lb_target_group_attachment "green" { target_group_arn = green_tg.arn }

# Instant rollback: point ALB back to blue

Trade-off: Doubles infrastructure cost. Requires handling of database schema changes carefully (both versions must work with the same schema).

Database Considerations

Change TypeBlue-Green Compatible?
Add column (nullable)Yes — old code ignores it
Add column (non-nullable)No — old code cannot insert without it
Rename columnNo — old code references old name
Drop columnNo — old code may still read it
Add indexYes — both versions benefit

Rule: Blue-green requires backward-compatible database changes. Use expand-contract pattern: add new column (expand), deploy new code, remove old column (contract).

Canary Deployment

Route a small percentage of traffic to the new version, monitor metrics, then gradually increase.

Step 1: 1%  → [Canary v2.4.1], 99% → [Stable v2.4.0]
Step 2: 5%  → [Canary v2.4.1], 95% → [Stable v2.4.0]
Step 3: 25% → [Canary v2.4.1], 75% → [Stable v2.4.0]
Step 4: 100% → [Canary v2.4.1 becomes stable]
# Kubernetes with Flagger (automated canary)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  service:
    port: 80
  analysis:
    interval: 30s
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
      - name: request-duration
        thresholdRange:
          max: 500

Abort criteria: If error rate spikes or latency exceeds threshold, Flagger automatically rolls back to 0% canary.

Feature Flags (Decoupling Deploy from Release)

Deploy code to production but keep it hidden. Enable for specific users when ready.

# LaunchDarkly-style feature flag
if client.variation("new-checkout-flow", user_context, False):
    return new_checkout.handle(request)
return old_checkout.handle(request)
Use Feature Flags ForDo NOT Use Feature Flags For
New UI featuresSecurity fixes (should not be toggleable)
A/B testsCritical bug patches
Gradual feature rolloutsData migration code
Kill switches for risky features

Metrics to Watch During Deployment

MetricCanary ThresholdAction If Breached
Error rate< 0.1%Rollback canary
Latency p99< baseline + 20%Rollback canary
ThroughputNo drop > 10%Rollback canary
Custom business metricNo dropRollback canary

Best Practices

  • Automate rollback — a human pressing a button at 3 AM is unreliable
  • Use synthetic traffic — hit the canary with automated tests before real users
  • Keep deployments small — smaller changes are easier to debug and faster to rollback
  • One change at a time — do not combine a deploy with a database migration and a config change
  • Test rollback — a rollback you have never practiced is a gamble

Common Mistakes

  • Deploying on Friday afternoon — you will be debugging all weekend
  • Not having automated rollback — manual rollbacks take 10x longer
  • Combining multiple changes in one deploy — when it breaks, you do not know which change caused it
  • Ignoring canary metrics because “the tests passed” — production traffic is the only real test
  • Forgetting database schema compatibility in blue-green — old and new code must coexist during the switch

Frequently Asked Questions

Should every deploy use canary?

No. Low-risk changes (dependency updates, typo fixes) can use rolling deploys. Reserve canary for user-facing features, risky refactors, and changes that touch critical paths (payments, authentication).

How long should a canary run?

Until you have statistical confidence. For high-traffic services, 15-30 minutes may suffice. For low-traffic services, hours or a full business cycle may be needed. Use error budgets and SLOs to define “done.”

What if the database schema needs to change?

Use the expand-contract pattern. Step 1: deploy schema change (add new column, keep old). Step 2: deploy code that writes to both. Step 3: backfill data. Step 4: deploy code that reads from new column only. Step 5: drop old column. This takes multiple deploys but ensures zero downtime.