Skip to content
SP StackPractices
intermediate By StackPractices

Zero-Downtime Deployment Checklist

A checklist to ensure production deployments complete without service interruptions using safe rollout patterns.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Zero-downtime deployments update production services without interrupting users. This checklist helps teams verify that health checks, traffic routing, database migrations, and rollback plans are in place before and during a release.

When to Use

  • Releasing a new version of a user-facing service.
  • Deploying schema or data migrations that affect multiple instances.
  • Changing infrastructure that could impact availability.
  • Introducing a new rollout strategy like canary or blue-green.
  • Preparing for a high-traffic event where stability matters most.

Prerequisites

  • A deployment pipeline with automated build, test, and publish stages.
  • Health check endpoints that represent real application readiness.
  • Load balancer, ingress, or traffic controller that supports gradual rollout.
  • Database migration strategy that is backward compatible.
  • Rollback plan with known good artifact and data state.
  • Monitoring and alerting for error rate, latency, and business metrics.
  • A communication plan for stakeholders and customers.

Solution

Checklist

1. Pre-Deployment Readiness

  • Deployment change is approved and documented.
  • Code is merged and artifact is built and tagged.
  • Automated unit, integration, and contract tests pass.
  • Database migrations are reviewed for backward compatibility.
  • Feature flags are configured for safe enablement.
  • Capacity and scaling limits are sufficient for expected traffic.
  • Monitoring dashboards and alerts are active.
  • On-call rotation is aware of the deployment window.
  • Rollback steps are documented and tested in a non-production environment.
  • Customer-facing communication is prepared if needed.

2. Health Check Configuration

CheckEndpointSuccess CriteriaFailure Action
Liveness/health/liveHTTP 200Restart container
Readiness/health/readyHTTP 200 and dependencies upStop traffic routing
Startup/health/startupHTTP 200Delay rollout
Dependency/health/depsDatabase, cache, queue reachableAlert and halt
Business/health/businessCritical flow returns expected valuePage on-call

3. Rollout Strategy Selection

StrategyUse CaseRisk LevelRollback Speed
Rolling updateStateless services, low riskLowMedium (terminate new pods)
Blue-greenStateful sessions, predictable releasesMediumFast (switch traffic back)
CanaryHigh risk, measurable metricsMediumFast (drain canary)
Feature flagGradual user exposureLowInstant (toggle off)
A/B deploymentValidate user behaviorMediumFast (re-route traffic)

4. Deployment Execution Steps

StepActionVerification
1Deploy to staging and run smoke testsStaging tests pass
2Deploy canary or small subsetHealth checks pass, error rate stable
3Monitor key metrics for canary durationLatency, error rate, business metrics within baseline
4Increase traffic percentage graduallyEach stage passes health and metric checks
5Complete rollout to 100%All instances healthy and serving traffic
6Validate production endpointsSmoke tests and critical user flows pass
7Keep old version available for rollbackRetain for defined rollback window
8Confirm rollback window has passedRemove old version or update artifact baseline

5. Database Migration Safety

  • Migrations are additive and backward compatible with the previous version.
  • Old code can read new schema without errors.
  • New code can read old schema if a rollback is needed.
  • Indexes are created concurrently where supported.
  • Large migrations are split into smaller batches.
  • Data backfill or migration jobs are idempotent.
  • Rollback script or compensating operation is available.
  • Database changes are tested in staging with production-like data.

6. Rollback Triggers

TriggerThresholdAction
Error rate spike> 0.5% for 2 minutesPause rollout and investigate
Latency increasep99 > baseline + 30% for 5 minutesRoll back traffic
Business metric dropConversion rate drops > 5%Roll back immediately
Health check failure> 10% failingRoll back immediately
Critical alertAny P1 incidentRoll back and page on-call
Canary timeoutCanary stage exceeds duration without passingRoll back canary

7. Post-Deployment Validation

  • Application logs show no unexpected errors.
  • Error rate and latency are within baseline.
  • Business metrics are stable or improving.
  • All feature flags are in the intended state.
  • Old resources are cleaned up after the rollback window.
  • Deployment summary is shared with the team.
  • Any issues are logged in the issue tracker with owners.

Explanation

Zero-downtime deployments rely on three things: safe rollout mechanics, reliable health signals, and fast rollback. A checklist ensures that each release considers traffic routing, data compatibility, and observability before any user is exposed. Combining this discipline with automation reduces the risk of production incidents and improves release confidence.

Variants

  • Kubernetes rolling update checklist: Focus on readiness probes, max surge, max unavailable, and pod disruption budgets.
  • Blue-green deployment checklist: Focus on traffic switch, database compatibility, and version retention.
  • Canary deployment checklist: Focus on metric thresholds, progressive traffic weights, and automated rollback gates.
  • Serverless deployment checklist: Focus on function versioning, alias routing, and API Gateway stage management.
  • Database-heavy deployment checklist: Focus on schema compatibility, migration order, and rollback scripts.
  • Mobile or client deployment checklist: Focus on staged rollout, forced update handling, and API compatibility.

Best Practices

  • Keep deployments small and frequent to reduce risk.
  • Make database changes backward compatible with both old and new code.
  • Use health checks that verify real dependencies, not just process liveness.
  • Automate rollback based on metrics, not just manual decision gates.
  • Monitor business metrics, not just technical metrics.
  • Maintain a known-good baseline artifact for fast rollback.
  • Practice rollbacks in staging or during game days.
  • Document deployment decisions and outcomes for future reviews.

Common Mistakes

  • Skipping health checks or using trivial HTTP 200 checks.
  • Deploying database changes that are not backward compatible.
  • Rolling out 100% traffic before validating metrics.
  • Not having a rollback plan before starting the deployment.
  • Ignoring increased latency in favor of error rate alone.
  • Cleaning up old versions too early.
  • Deploying during peak traffic without traffic capacity planning.

FAQs

What is the difference between rolling and canary deployment?

A rolling update replaces old instances one at a time across the whole fleet. A canary deploys a small subset first, validates metrics, and then gradually increases traffic to the new version.

How do we make database changes safe for zero downtime?

Use additive changes first (add columns, tables, indexes), deploy code that reads both old and new schema, then remove old schema in a later release. This is often called the expand-contract pattern.

When should we roll back immediately?

Roll back when health checks fail broadly, error rate spikes, critical business metrics drop, or a P1 alert fires. Faster rollback saves user trust and revenue.