Skip to content
SP StackPractices
advanced By StackPractices

Data Migration Runbook Template

A runbook template for safely migrating data between systems including pre-migration checks, rollback procedures, and post-migration validation.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Data migrations are among the riskiest operations in software engineering. Unlike code deployments, data migrations cannot be rolled back with a simple kubectl rollout undo. A failed migration can corrupt production data, violate compliance requirements, and cause extended outages. This runbook template structures the migration into verifiable phases: preparation, dry run, execution, validation, and rollback.

When to Use

Use this resource when:

  • Moving data between database versions or engines (MySQL 5.7 to 8.0, PostgreSQL to Aurora)
  • Migrating from monolith to microservices (database per service)
  • Consolidating multiple data sources into a data warehouse
  • Executing large-scale schema changes that require data transformation
  • Migrating between cloud providers (AWS RDS to GCP Cloud SQL)

Prerequisites

Before starting:

  • Full backup of source and target systems completed and verified
  • Migration script tested on a dataset with production-like volume
  • Downtime window approved by stakeholders (if applicable)
  • Rollback plan documented and tested
  • Monitoring and alerting configured for both source and target

Solution

# Data Migration Runbook: `<Migration Name>`

## 1. Pre-Migration Checklist

### Source System
```bash
# Verify backup integrity
pg_dump -h source.db.internal -U admin mydb | gzip > /backups/pre-migration.sql.gz
gunzip -t /backups/pre-migration.sql.gz

# Record baseline metrics
psql -h source.db.internal -c "SELECT pg_size_pretty(pg_database_size('mydb'));"
psql -h source.db.internal -c "SELECT COUNT(*) FROM orders;"
psql -h source.db.internal -c "SELECT MAX(updated_at) FROM orders;"
MetricValueNotes
Database size______
Table row counts______
Latest update timestamp______
Active connections______
Replication lag______

Target System

  • Target schema created and matches source structure
  • Target indexes built and validated
  • Target storage capacity > 2x expected data size
  • Network connectivity verified between source and target
  • Target performance baseline established

Application

  • Feature flags configured for dual-write or read-after-write
  • Application code deployed that supports both old and new systems
  • Monitoring dashboards updated with target system metrics

2. Migration Strategy Selection

StrategyDowntimeComplexityUse Case
Big BangMinutes to hoursLowSmall datasets (< 100GB), simple schema
Incremental / BatchNear-zeroMediumLarge datasets, can tolerate eventual consistency
Dual WriteZeroHighLive systems requiring 100% availability
CDC (Change Data Capture)Near-zeroHighContinuous replication, minimal downtime

Decision Record

Selected strategy: ______

Justification: ______

3. Dry Run Execution

# Run migration on a copy of production data
# Do NOT connect to production systems

cp /backups/pre-migration.sql.gz /tmp/dry-run.sql.gz
gunzip /tmp/dry-run.sql.gz

# Execute migration script
psql -h target-staging.db.internal -f /tmp/dry-run.sql

# Validate dry run
./scripts/validate-migration.sh \
  --source source-staging.db.internal \
  --target target-staging.db.internal
Dry Run ResultStatus
Duration______
Rows migrated______
Errors encountered______
Validation passed[ ]

Decision Gate: Only proceed to production if dry run completed without errors and validation passed.

4. Production Migration Execution

Step 4a: Final Backup

# Create point-in-time backup immediately before migration
aws rds create-db-snapshot \
  --db-instance-identifier source-db \
  --db-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M%S)

Step 4b: Stop Writes (if using Big Bang)

# Set application to read-only
curl -X POST http://app.internal/admin/maintenance-mode

# Verify no active writes
psql -h source.db.internal -c "SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active';"

Step 4c: Execute Migration

# Log migration start time
MIGRATION_START=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo "Migration started: $MIGRATION_START"

# Execute migration
psql -h target.db.internal -f migration-script.sql 2>&1 | tee migration.log

# Log migration end time
MIGRATION_END=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo "Migration ended: $MIGRATION_END"

Step 4d: Resume Writes (if applicable)

# Verify target is healthy before switching writes
curl -X POST http://app.internal/admin/target-health-check

# Switch application to target
curl -X POST http://app.internal/admin/switch-datastore \
  -H "Content-Type: application/json" \
  -d '{"target": "new-database"}'

# Resume normal operations
curl -X POST http://app.internal/admin/normal-mode

5. Post-Migration Validation

Row Count Verification

-- Compare row counts for all major tables
SELECT 'source_orders' as table_name, COUNT(*) as row_count FROM source.orders
UNION ALL
SELECT 'target_orders', COUNT(*) FROM target.orders
UNION ALL
SELECT 'source_users', COUNT(*) FROM source.users
UNION ALL
SELECT 'target_users', COUNT(*) FROM target.users;

Data Integrity Checks

-- Checksum comparison for critical tables
SELECT 'source', SUM(CHECKSUM(id, amount, created_at)) FROM source.payments
UNION ALL
SELECT 'target', SUM(CHECKSUM(id, amount, created_at)) FROM target.payments;

-- Verify no NULL values in required columns
SELECT COUNT(*) FROM target.orders WHERE customer_id IS NULL;
SELECT COUNT(*) FROM target.orders WHERE created_at IS NULL;

Application Smoke Tests

# Critical user flows
./scripts/smoke-test.sh --environment=production

# Performance baseline comparison
./scripts/performance-test.sh --target=new-db --baseline=old-db
Validation CheckSourceTargetMatchTime
Total row count____________[ ]______
Table-level counts____________[ ]______
Checksum for payments____________[ ]______
NULL constraint checksN/A______[ ]______
Smoke tests passN/A______[ ]______
Performance within 10%____________[ ]______

6. Rollback Procedure

Trigger Conditions

Rollback if ANY of the following occur:

  • Error rate > 1% after migration
  • Data integrity check fails
  • Performance degradation > 50%
  • Customer-facing feature broken

Rollback Steps

# 1. Stop writes to target immediately
curl -X POST http://app.internal/admin/maintenance-mode

# 2. Switch application back to source
curl -X POST http://app.internal/admin/switch-datastore \
  -d '{"target": "source-database"}'

# 3. Resume operations on source
curl -X POST http://app.internal/admin/normal-mode

# 4. DO NOT DELETE target data until root cause is resolved
# 5. Document all findings for postmortem
Rollback StepStatusTime
Maintenance mode activated[ ]______
Source restored as primary[ ]______
Application switched[ ]______
Smoke tests passed on source[ ]______
Target data preserved[ ]______

7. Post-Migration Actions

  • Monitor target system for 24 hours minimum
  • Compare error rates between pre and post migration
  • Validate backup of target system
  • Update runbook with actual duration and issues encountered
  • Schedule cleanup of source data (after 30-day retention)
  • Document lessons learned
  • Close incident channel when stable

## Explanation

The runbook separates **preparation** (backups, dry runs) from **execution** (the actual migration) and **validation** (data integrity checks). The critical insight is the **decision gate** after the dry run — never run an untested migration in production. The rollback procedure is designed to be fast (minutes, not hours) because data issues compound over time.

## Variants

| Context | Approach | Notes |
|---------|----------|-------|
| Database version upgrade | `pg_dumpall` / `pg_upgrade` | Test on identical OS and PostgreSQL versions |
| Cloud provider migration | AWS DMS / GCP Database Migration Service | Built-in validation, but monitor replication lag |
| Microservices extraction | Dual-write pattern | Complex, but zero downtime; requires application changes |
| Data warehouse ETL | Batch loads with Airflow | Schedule during low-traffic windows |
| NoSQL to SQL | Custom transformation scripts | Schema design is the hardest part; test queries thoroughly |

## Best Practices

1. **Always run a dry run** on production-scale data in an isolated environment
2. **Never modify the source** during migration — read-only access prevents accidental corruption
3. **Validate incrementally** — check row counts per table, not just totals
4. **Preserve both systems** until validation is complete and stable
5. **Document actual vs. estimated duration** — improves future planning

## Common Mistakes

1. **Not testing with production data volume** — small datasets hide performance issues
2. **Modifying source data during migration** — creates inconsistency that cannot be reconciled
3. **Skipping rollback rehearsal** — discovers rollback doesn't work when it's needed most
4. **Deleting source data too early** — validation may reveal issues hours after migration
5. **Not monitoring application behavior** — database migration success != application success

## Frequently Asked Questions

### How do I handle very large migrations (TB+)?

Use an incremental approach: migrate historical data in batches during low-traffic periods, then use CDC for the final delta. Tools like AWS DMS, Debezium, or custom batch scripts work well. Plan for days or weeks, not hours.

### What if source and target schemas differ?

Document the transformation in the migration script and validate every transformed field. Common issues: timezone conversions, character encodings, enum values, and nullable columns. Test edge cases in the dry run.

### How long should I keep source data after migration?

Minimum 30 days for most systems. For compliance-regulated data, follow your retention policy (often 90 days or longer). Keep until you're confident the migration is stable and all downstream consumers have verified their integrations.