intermediate
Database Migration Runbook Template
A database migration runbook template for executing schema changes safely with rollback procedures, verification steps, and communication plans.
Topics: databases
Database Migration Runbook Template
Use this template to execute database schema changes without downtime or data loss.
Template
# Database Migration Runbook: [Migration Name]
## Overview
| Field | Value |
|-------|-------|
| **Migration ID** | [timestamp or sequential number] |
| **Author** | [name] |
| **Reviewed by** | [name] |
| **Databases affected** | [list] |
| **Estimated duration** | [minutes / hours] |
| **Risk level** | [Low / Medium / High] |
## Pre-Migration Checklist
- [ ] Schema change reviewed by senior engineer
- [ ] Migration script tested on copy of production data
- [ ] Rollback script tested and timed
- [ ] Backups verified (last successful backup < 24 hours)
- [ ] Maintenance window scheduled (if needed)
- [ ] On-call notified
- [ ] Monitoring dashboards bookmarked
## Migration Steps
### Step 1: [Action]
```sql
-- Example: add nullable column
ALTER TABLE orders ADD COLUMN tracking_url VARCHAR(500) NULL;
Step 2: [Action]
-- Example: create index concurrently
CREATE INDEX CONCURRENTLY idx_orders_tracking ON orders(tracking_url);
Step 3: [Action]
-- Example: backfill data
UPDATE orders SET tracking_url = 'https://...' WHERE shipped_at IS NOT NULL;
Verification
| Check | Query | Expected Result |
|---|---|---|
| Schema applied | \d orders | Column tracking_url exists |
| Index created | \di idx_orders_tracking | Index is valid |
| No locks held | pg_locks | No long-running locks |
| App health | Dashboard | Error rate < baseline |
Rollback Procedure
-- Step 1: drop index
DROP INDEX CONCURRENTLY IF EXISTS idx_orders_tracking;
-- Step 2: drop column
ALTER TABLE orders DROP COLUMN IF EXISTS tracking_url;
| Rollback Step | Time to Complete | Verification |
|---|---|---|
| Drop index | < 1 minute | Query plan reverts |
| Drop column | < 1 minute | Schema matches pre-migration |
Post-Migration
- Application error rate normal
- Latency within baseline
- Replication lag acceptable
- On-call handoff notes updated
- Runbook archived with actual duration
Communication
| Audience | Timing | Message |
|---|---|---|
| Engineering | Before | Maintenance window announced |
| On-call | During | Real-time status updates |
| Stakeholders | After | All-clear + any issues encountered |
## Migration Safety Rules
| Rule | Why | Exception |
|------|-----|-----------|
| **Add columns as nullable** | Existing rows need a value | Supply default in same transaction |
| **Create indexes concurrently** | Avoids table locks | Not available on all databases |
| **Backfill in batches** | Prevents lock escalation | Small tables (< 1M rows) |
| **Test rollback first** | Rollback you have never practiced is a guess | None |
| **Run during low traffic** | Reduces blast radius | Emergency fixes |
## Best Practices
- **Use expand-contract for breaking changes** — add new schema, deploy code, remove old schema in separate migrations
- **Batch large updates** — `UPDATE ... WHERE id BETWEEN 1 AND 10000` in a loop, with sleeps
- **Monitor replication lag** — large DDL can block replication; pause if lag exceeds thresholds
- **Keep migrations idempotent** — `IF NOT EXISTS` and `IF EXISTS` let you re-run safely
- **Document actual duration** — future estimates improve when you track reality
## Common Mistakes
- Running untested migrations in production — test on a copy with realistic data size
- Forgetting to use `CONCURRENTLY` — locks the table for writes, causing outages
- Large transactions without batching — a single `UPDATE` on 100M rows will lock and rollback slowly
- No rollback plan — "we will figure it out" is not a plan
- Migrating during peak traffic — even safe migrations add load; schedule off-peak
## Frequently Asked Questions
### Should I use a migration tool or raw SQL?
Use a tool (Flyway, Liquibase, Django migrations, Rails migrations). Tools track applied migrations, enforce ordering, and provide rollback hooks. Raw SQL scripts require manual tracking and are error-prone.
### How do I handle a failed migration in production?
Stop immediately. Do not apply subsequent migrations. Assess whether to rollback or fix forward. Rollback if data integrity is at risk. Fix forward if the fix is a small, well-understood script. Always have the rollback script ready before you start.
### Can I run migrations in a transaction?
Yes, for DDL-safe databases (PostgreSQL). For MySQL, DDL is implicitly committed, so transactions do not protect you. Know your database's behavior before you plan the migration strategy.