Data Migration Runbook Template

Q: How do I handle very large migrations (TB+)?

Use an incremental approach: migrate historical data in batches during low-traffic periods, then use CDC for the final delta. Tools like AWS DMS, Debezium, or custom batch scripts work well. Plan for days or weeks, not hours.

Q: What if source and target schemas differ?

Document the transformation in the migration script and validate every transformed field. Common issues: timezone conversions, character encodings, enum values, and nullable columns. Test edge cases in the dry run.

Q: How long should I keep source data after migration?

Minimum 30 days for most systems. For compliance-regulated data, follow your retention policy (often 90 days or longer). Keep until you're confident the migration is stable and all downstream consumers have verified their integrations.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Data migrations are among the riskiest operations in software engineering. Unlike code deployments, data migrations cannot be rolled back with a simple kubectl rollout undo. A failed migration can corrupt production data, violate compliance requirements, and cause extended outages. This runbook template structures the migration into verifiable phases: preparation, dry run, execution, validation, and rollback.

When to Use

Use this resource when:

Moving data between database versions or engines (MySQL 5.7 to 8.0, PostgreSQL to Aurora)
Migrating from monolith to microservices (database per service)
Consolidating multiple data sources into a data warehouse
Executing large-scale schema changes that require data transformation
Migrating between cloud providers (AWS RDS to GCP Cloud SQL)

Prerequisites

Before starting:

Full backup of source and target systems completed and verified
Migration script tested on a dataset with production-like volume
Downtime window approved by stakeholders (if applicable)
Rollback plan documented and tested
Monitoring and alerting configured for both source and target

Solution

# Data Migration Runbook: `<Migration Name>`

## 1. Pre-Migration Checklist

### Source System
```bash
# Verify backup integrity
pg_dump -h source.db.internal -U admin mydb | gzip > /backups/pre-migration.sql.gz
gunzip -t /backups/pre-migration.sql.gz

# Record baseline metrics
psql -h source.db.internal -c "SELECT pg_size_pretty(pg_database_size('mydb'));"
psql -h source.db.internal -c "SELECT COUNT(*) FROM orders;"
psql -h source.db.internal -c "SELECT MAX(updated_at) FROM orders;"

Metric	Value	Notes
Database size	______
Table row counts	______
Latest update timestamp	______
Active connections	______
Replication lag	______

Target System

Target schema created and matches source structure
Target indexes built and validated
Target storage capacity > 2x expected data size
Network connectivity verified between source and target
Target performance baseline established

Application

Feature flags configured for dual-write or read-after-write
Application code deployed that supports both old and new systems
Monitoring dashboards updated with target system metrics

2. Migration Strategy Selection

Strategy	Downtime	Complexity	Use Case
Big Bang	Minutes to hours	Low	Small datasets (< 100GB), simple schema
Incremental / Batch	Near-zero	Medium	Large datasets, can tolerate eventual consistency
Dual Write	Zero	High	Live systems requiring 100% availability
CDC (Change Data Capture)	Near-zero	High	Continuous replication, minimal downtime

Decision Record

Selected strategy: ______

Justification: ______

3. Dry Run Execution

# Run migration on a copy of production data
# Do NOT connect to production systems

cp /backups/pre-migration.sql.gz /tmp/dry-run.sql.gz
gunzip /tmp/dry-run.sql.gz

# Execute migration script
psql -h target-staging.db.internal -f /tmp/dry-run.sql

# Validate dry run
./scripts/validate-migration.sh \
  --source source-staging.db.internal \
  --target target-staging.db.internal

Dry Run Result	Status
Duration	______
Rows migrated	______
Errors encountered	______
Validation passed	[ ]

Decision Gate: Only proceed to production if dry run completed without errors and validation passed.

4. Production Migration Execution

Step 4a: Final Backup

# Create point-in-time backup immediately before migration
aws rds create-db-snapshot \
  --db-instance-identifier source-db \
  --db-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M%S)

Step 4b: Stop Writes (if using Big Bang)

# Set application to read-only
curl -X POST http://app.internal/admin/maintenance-mode

# Verify no active writes
psql -h source.db.internal -c "SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active';"

Step 4c: Execute Migration

# Log migration start time
MIGRATION_START=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo "Migration started: $MIGRATION_START"

# Execute migration
psql -h target.db.internal -f migration-script.sql 2>&1 | tee migration.log

# Log migration end time
MIGRATION_END=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo "Migration ended: $MIGRATION_END"

Step 4d: Resume Writes (if applicable)

# Verify target is healthy before switching writes
curl -X POST http://app.internal/admin/target-health-check

# Switch application to target
curl -X POST http://app.internal/admin/switch-datastore \
  -H "Content-Type: application/json" \
  -d '{"target": "new-database"}'

# Resume normal operations
curl -X POST http://app.internal/admin/normal-mode

5. Post-Migration Validation

Row Count Verification

-- Compare row counts for all major tables
SELECT 'source_orders' as table_name, COUNT(*) as row_count FROM source.orders
UNION ALL
SELECT 'target_orders', COUNT(*) FROM target.orders
UNION ALL
SELECT 'source_users', COUNT(*) FROM source.users
UNION ALL
SELECT 'target_users', COUNT(*) FROM target.users;

Data Integrity Checks

-- Checksum comparison for critical tables
SELECT 'source', SUM(CHECKSUM(id, amount, created_at)) FROM source.payments
UNION ALL
SELECT 'target', SUM(CHECKSUM(id, amount, created_at)) FROM target.payments;

-- Verify no NULL values in required columns
SELECT COUNT(*) FROM target.orders WHERE customer_id IS NULL;
SELECT COUNT(*) FROM target.orders WHERE created_at IS NULL;

Application Smoke Tests

# Critical user flows
./scripts/smoke-test.sh --environment=production

# Performance baseline comparison
./scripts/performance-test.sh --target=new-db --baseline=old-db

Validation Check	Source	Target	Match	Time
Total row count	______	______	[ ]	______
Table-level counts	______	______	[ ]	______
Checksum for payments	______	______	[ ]	______
NULL constraint checks	N/A	______	[ ]	______
Smoke tests pass	N/A	______	[ ]	______
Performance within 10%	______	______	[ ]	______

6. Rollback Procedure

Trigger Conditions

Rollback if ANY of the following occur:

Error rate > 1% after migration
Data integrity check fails
Performance degradation > 50%
Customer-facing feature broken

Rollback Steps

# 1. Stop writes to target immediately
curl -X POST http://app.internal/admin/maintenance-mode

# 2. Switch application back to source
curl -X POST http://app.internal/admin/switch-datastore \
  -d '{"target": "source-database"}'

# 3. Resume operations on source
curl -X POST http://app.internal/admin/normal-mode

# 4. DO NOT DELETE target data until root cause is resolved
# 5. Document all findings for postmortem

Rollback Step	Status	Time
Maintenance mode activated	[ ]	______
Source restored as primary	[ ]	______
Application switched	[ ]	______
Smoke tests passed on source	[ ]	______
Target data preserved	[ ]	______

7. Post-Migration Actions

Monitor target system for 24 hours minimum
Compare error rates between pre and post migration
Validate backup of target system
Update runbook with actual duration and issues encountered
Schedule cleanup of source data (after 30-day retention)
Document lessons learned
Close incident channel when stable


## Explanation

The runbook separates **preparation** (backups, dry runs) from **execution** (the actual migration) and **validation** (data integrity checks). The critical insight is the **decision gate** after the dry run — never run an untested migration in production. The rollback procedure is designed to be fast (minutes, not hours) because data issues compound over time.

## Variants

| Context | Approach | Notes |
|---------|----------|-------|
| Database version upgrade | `pg_dumpall` / `pg_upgrade` | Test on identical OS and PostgreSQL versions |
| Cloud provider migration | AWS DMS / GCP Database Migration Service | Built-in validation, but monitor replication lag |
| Microservices extraction | Dual-write pattern | Complex, but zero downtime; requires application changes |
| Data warehouse ETL | Batch loads with Airflow | Schedule during low-traffic windows |
| NoSQL to SQL | Custom transformation scripts | Schema design is the hardest part; test queries thoroughly |

## Best Practices

1. **Always run a dry run** on production-scale data in an isolated environment
2. **Never modify the source** during migration — read-only access prevents accidental corruption
3. **Validate incrementally** — check row counts per table, not just totals
4. **Preserve both systems** until validation is complete and stable
5. **Document actual vs. estimated duration** — improves future planning

## Common Mistakes

1. **Not testing with production data volume** — small datasets hide performance issues
2. **Modifying source data during migration** — creates inconsistency that cannot be reconciled
3. **Skipping rollback rehearsal** — discovers rollback doesn't work when it's needed most
4. **Deleting source data too early** — validation may reveal issues hours after migration
5. **Not monitoring application behavior** — database migration success != application success

## Frequently Asked Questions

### How do I handle very large migrations (TB+)?

Use an incremental approach: migrate historical data in batches during low-traffic periods, then use CDC for the final delta. Tools like AWS DMS, Debezium, or custom batch scripts work well. Plan for days or weeks, not hours.

### What if source and target schemas differ?

Document the transformation in the migration script and validate every transformed field. Common issues: timezone conversions, character encodings, enum values, and nullable columns. Test edge cases in the dry run.

### How long should I keep source data after migration?

Minimum 30 days for most systems. For compliance-regulated data, follow your retention policy (often 90 days or longer). Keep until you're confident the migration is stable and all downstream consumers have verified their integrations.