Data Migration — Zero-Downtime Strategies That Work
A practical guide to data migration: planning, dual-write patterns, backfill strategies, schema evolution, validation, and rollback procedures for moving data without service interruption.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Data migration is the process of moving data from one system, schema, or format to another. Unlike code deployments, data migrations are irreversible: once data is transformed or moved, rolling back requires another migration. Poorly executed migrations cause data loss, corruption, or extended downtime.
This guide covers proven patterns for migrating data safely, including dual writes, backfills, schema evolution, and validation strategies.
When to Use
- You are moving from one database to another (MySQL → PostgreSQL, on-prem → cloud)
- You are restructuring tables or normalizing/denormalizing data
- You are introducing a new data store (adding Elasticsearch, Redis, or a data warehouse)
- You are sharding an existing database
- You are migrating from a legacy system to a modern platform
- You need to split or merge services with their own datastores
Core Concepts
| Concept | Description |
|---|---|
| Dual Write | Writing to both old and new systems simultaneously |
| Backfill | Populating a new datastore with historical data |
| Shadow Read | Reading from the new system and comparing with the old |
| Cutover | Switching reads and writes from old to new system |
| Rollback Window | The time during which you can revert without data loss |
| Idempotency | Running the same migration twice produces the same result |
Migration Strategies
Choose the right approach based on risk tolerance and system constraints:
| Strategy | Downtime | Risk | Best For |
|---|---|---|---|
| Dual-write + backfill | None | Low | New datastore, schema changes |
| Expand-contract (column) | None | Low | Adding/removing columns |
| Snapshot + CDC | Brief | Medium | Database engine migrations |
| Blue/green with migration | Brief | Medium | Major schema restructures |
| Stop-the-world | Hours | High | Small databases, maintenance windows |
| Strangler fig | None | Low | Gradual legacy system migration |
Step-by-Step Zero-Downtime Migration
1. Plan the Migration
Document every step before touching production data:
## Migration Plan: Users Table Normalization
**Goal:** Split `users` table into `users` + `user_profiles`
**Timeline:** 3 weeks
**Rollback window:** 48 hours after cutover
### Phase 1: Schema Changes (Week 1)
- [ ] Add `user_profiles` table
- [ ] Add foreign key `users.profile_id`
- [ ] Deploy application code that dual-writes to both tables
- [ ] Verify writes are succeeding to both tables
### Phase 2: Backfill (Week 1-2)
- [ ] Run backfill script in batches (1000 rows/batch)
- [ ] Monitor script progress and error rate
- [ ] Verify backfill completeness with row counts and checksums
### Phase 3: Shadow Reads (Week 2)
- [ ] Enable reading from `user_profiles` in parallel
- [ ] Compare results: old vs new (log mismatches)
- [ ] Fix any data discrepancies
### Phase 4: Cutover (Week 3)
- [ ] Switch reads to `user_profiles`
- [ ] Monitor error rates for 24 hours
- [ ] Remove dual-write code
- [ ] Drop old columns (after rollback window)
### Validation Checklist
- [ ] Row count matches: `SELECT COUNT(*) FROM users` == `SELECT COUNT(*) FROM user_profiles`
- [ ] Sample data comparison: 100 random users compared field-by-field
- [ ] Application integration tests pass
- [ ] Performance tests pass (new queries are fast enough)
### Rollback Plan
- [ ] If issues within 48h: revert read path to old schema
- [ ] If issues after 48h: write forward-fix migration (no rollback possible)
2. Implement Dual Writes
Write to both old and new systems during the transition:
# Example: Dual-write during migration
class UserRepository:
def __init__(self, old_db, new_db):
self.old_db = old_db
self.new_db = new_db
def create_user(self, user_data):
# Write to old system (source of truth during migration)
user_id = self.old_db.users.insert(user_data)
# Write to new system (best effort, log failures)
try:
self.new_db.user_profiles.insert({
'user_id': user_id,
'display_name': user_data['name'],
'bio': user_data.get('bio', ''),
'created_at': user_data['created_at']
})
except Exception as e:
logger.error("Dual write failed", extra={
'user_id': user_id,
'error': str(e)
})
# Do NOT fail the request — old system is still source of truth
return user_id
def get_user(self, user_id):
# During shadow read phase: read from new, fallback to old
try:
profile = self.new_db.user_profiles.find_by_user_id(user_id)
if profile:
return self._convert_profile_to_user(profile)
except Exception:
pass
return self.old_db.users.find_by_id(user_id)
# Example: Backfill script with batching and resumability
import time
class BackfillUsers:
def __init__(self, old_db, new_db):
self.old_db = old_db
self.new_db = new_db
self.batch_size = 1000
self.checkpoint_table = 'migration_checkpoints'
def run(self):
last_id = self._get_checkpoint()
while True:
batch = self.old_db.users.find_after_id(last_id, limit=self.batch_size)
if not batch:
break
for user in batch:
self._migrate_user(user)
last_id = batch[-1]['id']
self._save_checkpoint(last_id)
# Throttle to avoid overwhelming the database
time.sleep(0.1)
def _migrate_user(self, user):
"""Idempotent user migration."""
# Upsert ensures idempotency
self.new_db.user_profiles.upsert(
{'user_id': user['id']},
{
'display_name': user['name'],
'bio': user.get('bio', ''),
'created_at': user['created_at']
}
)
def _get_checkpoint(self):
row = self.old_db.execute(
f"SELECT last_id FROM {self.checkpoint_table} WHERE migration = 'users_to_profiles'"
)
return row['last_id'] if row else 0
def _save_checkpoint(self, last_id):
self.old_db.execute(f"""
INSERT INTO {self.checkpoint_table} (migration, last_id)
VALUES ('users_to_profiles', %s)
ON CONFLICT (migration) DO UPDATE SET last_id = EXCLUDED.last_id
""", (last_id,))
3. Validate Data Integrity
Never assume a migration succeeded. Verify everything:
# Example: Validation after backfill
class MigrationValidator:
def __init__(self, old_db, new_db):
self.old_db = old_db
self.new_db = new_db
def validate_counts(self):
"""Verify row counts match."""
old_count = self.old_db.execute("SELECT COUNT(*) as c FROM users")['c']
new_count = self.new_db.execute("SELECT COUNT(*) as c FROM user_profiles")['c']
assert old_count == new_count, f"Count mismatch: {old_count} != {new_count}"
print(f"Row counts match: {old_count}")
def validate_sample(self, sample_size=1000):
"""Compare random samples field-by-field."""
users = self.old_db.execute(f"""
SELECT * FROM users
ORDER BY RANDOM()
LIMIT {sample_size}
""")
mismatches = 0
for user in users:
profile = self.new_db.user_profiles.find_by_user_id(user['id'])
if not profile:
print(f"Missing profile for user {user['id']}")
mismatches += 1
continue
# Field-by-field comparison
if user['name'] != profile['display_name']:
print(f"Name mismatch: user={user['id']}")
mismatches += 1
assert mismatches == 0, f"Found {mismatches} mismatches in sample"
print(f"Sample validation passed ({sample_size} rows)")
def validate_checksums(self):
"""Compare aggregate checksums."""
old_checksum = self.old_db.execute("""
SELECT MD5(string_agg(name || bio, ',' ORDER BY id)) as checksum
FROM users
""")
new_checksum = self.new_db.execute("""
SELECT MD5(string_agg(display_name || bio, ',' ORDER BY user_id)) as checksum
FROM user_profiles
""")
assert old_checksum == new_checksum, "Checksum mismatch!"
print("Checksum validation passed")
4. Execute the Cutover
Switch traffic from old to new system:
## Cutover Checklist
### Before Cutover
- [ ] Backfill is 100% complete
- [ ] Validation passed (counts, samples, checksums)
- [ ] Shadow reads show <0.1% mismatch rate
- [ ] New system performance is acceptable under load
- [ ] Rollback procedure is documented and tested
- [ ] Team is on standby during cutover window
### During Cutover
1. **Pause non-critical writes** (optional, reduces risk)
2. **Enable feature flag** to route reads to new system
3. **Monitor error rates for 15 minutes**
4. **If errors spike:** disable feature flag (instant rollback)
5. **If stable:** proceed to write cutover
6. **Enable writes to new system**
7. **Monitor for 1 hour**
### After Cutover
- [ ] Error rates within normal range for 24 hours
- [ ] New system handling 100% of traffic
- [ ] Old system still receiving dual writes (for safety)
- [ ] Rollback window countdown started (48 hours)
# Example: Feature flag-based cutover
class UserService:
def __init__(self, config):
self.use_new_schema = config.get('use_new_user_schema', False)
def get_user(self, user_id):
if self.use_new_schema:
return self._get_from_new_schema(user_id)
return self._get_from_old_schema(user_id)
def create_user(self, user_data):
# Always dual-write during migration
old_id = self._create_in_old(user_data)
self._create_in_new(user_data, old_id)
return old_id
Schema Evolution Patterns
Evolve schemas without breaking existing code:
1. Expand-Contract for Columns
-- Step 1: Add new column (nullable)
ALTER TABLE users ADD COLUMN email_normalized VARCHAR(255);
-- Step 2: Backfill new column
UPDATE users SET email_normalized = LOWER(email) WHERE email_normalized IS NULL;
-- Step 3: Deploy code that writes to both columns
-- Application code: set email_normalized on every insert/update
-- Step 4: Make new column non-nullable, add constraint
ALTER TABLE users ALTER COLUMN email_normalized SET NOT NULL;
-- Step 5: Deploy code that reads from new column
-- Step 6: Drop old column (after rollback window)
ALTER TABLE users DROP COLUMN email;
2. Table Splitting
-- Step 1: Create new table
CREATE TABLE user_profiles (
user_id BIGINT PRIMARY KEY REFERENCES users(id),
bio TEXT,
preferences JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- Step 2: Dual-write trigger
CREATE OR REPLACE FUNCTION sync_user_profile() RETURNS trigger AS $$
BEGIN
INSERT INTO user_profiles (user_id, bio, preferences, created_at)
VALUES (NEW.id, NEW.bio, NEW.preferences, NEW.created_at)
ON CONFLICT (user_id) DO UPDATE SET
bio = EXCLUDED.bio,
preferences = EXCLUDED.preferences;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER users_profile_sync
AFTER INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_user_profile();
-- Step 3: Backfill
INSERT INTO user_profiles (user_id, bio, preferences, created_at)
SELECT id, bio, preferences, created_at FROM users
ON CONFLICT (user_id) DO NOTHING;
-- Step 4: Cutover reads to user_profiles
-- Step 5: Remove columns from users table (after rollback window)
Best Practices
- Always test migrations on a copy of production data. Staging data rarely matches production volume or edge cases.
- Make migrations idempotent. If a script crashes at row 500,000, restarting it should not create duplicates.
- Throttle backfills. Running at maximum speed starves production queries. Use rate limiting.
- Validate with more than row counts. Compare checksums, sample random rows, and run integration tests.
- Never drop old data immediately. Keep the rollback window open (24-72 hours minimum).
- Monitor during the entire process. Set up dashboards specifically for the migration.
- Communicate broadly. Data migrations affect every team that touches the database.
Common Mistakes
- No rollback plan. Once you delete old columns, rolling back requires another complex migration.
- Running migrations during peak hours. Schedule backfills during low-traffic windows.
- Forgetting about foreign keys. Migrating a parent table without updating child table references breaks constraints.
- No validation. Assuming the migration worked because it finished without errors.
- Deleting data too early. The “expand-contract” pattern exists because rollbacks are necessary.
- Underestimating duration. A migration that takes 2 hours in staging may take 20 hours in production.
Variants
- Online schema change tools:
pt-online-schema-change(Percona),gh-ost(GitHub) for MySQL;pg_repackfor PostgreSQL - CDC-based migration: Debezium captures changes and streams them to the new system in real time
- Dump and restore:
pg_dump/pg_restorefor smaller databases with maintenance windows - Cloud migration services: AWS DMS, Azure Database Migration Service, Google Database Migration Service
FAQ
Q: How long should I keep the old schema after cutover? At least 48 hours for low-risk migrations, up to 2 weeks for high-risk changes. The longer the window, the safer you are.
Q: What if my migration fails halfway through? If the migration is idempotent, restart it. If not, restore from backup and retry. This is why checkpoints and batching are critical.
Q: How do I migrate without dual-write support? Use Change Data Capture (Debezium) or snapshot + incremental sync. These require more infrastructure but work without application changes.
Q: Can I migrate a database while it is under heavy load?
Yes, but throttle the backfill. Use pg_sleep between batches, run during off-peak hours, and monitor replication lag.
Conclusion
Data migration is not a single event but a process: plan, dual-write, backfill, validate, shadow-read, cutover, and clean up. By following structured patterns and never skipping validation, you move data safely while keeping systems online.