Skip to content
SP StackPractices
advanced By StackPractices

Disaster Recovery Test Plan

A template for planning and executing disaster recovery tests including failover validation, data integrity checks, and recovery time measurement.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Disaster recovery plans that have never been tested are merely wishful thinking. A real disaster exposes gaps in documentation, missing dependencies, and unrealistic assumptions about recovery times. This test plan provides a structured approach to validating your DR procedures through controlled failover exercises, data integrity verification, and RTO/RPO measurement.

When to Use

Use this resource when:

  • Annual compliance requirements mandate DR testing (SOC2, ISO 27001)
  • You have recently changed infrastructure or disaster recovery procedures
  • A previous incident revealed gaps in your DR capabilities
  • You need to validate RTO and RPO commitments to stakeholders

Prerequisites

Before starting:

  • DR runbooks reviewed and updated within the last 90 days
  • Test environment available (isolated from production)
  • Backup snapshots confirmed restorable within the last 7 days
  • Stakeholders notified of test window and expected impact
  • Rollback plan documented in case the test goes wrong

Solution

# Disaster Recovery Test Plan: `<Service Name>`

## 1. Test Objectives

| Objective | Target | Measurement |
|-----------|--------|-------------|
| Recovery Time Objective (RTO) | < 4 hours | Time from disaster declaration to service availability |
| Recovery Point Objective (RPO) | < 15 minutes | Maximum acceptable data loss in recovery |
| Data Integrity | 100% | All transactions verified against source |
| Communication | < 30 minutes | All stakeholders notified of test start |

## 2. Scope & Assumptions

### In Scope
- Database failover and restoration from backup
- Application redeployment to DR region
- DNS cutover and traffic routing
- Smoke test validation of core user flows

### Out of Scope
- Third-party service dependencies (assumed available)
- Physical data center failures (cloud-based test)
- Complete network isolation scenarios

### Assumptions
- Latest backup snapshot is valid and complete
- DR region has sufficient capacity
- Network connectivity between regions is functional
- Team members are available during the test window

## 3. Test Scenarios

### Scenario A: Primary Region Complete Failure

**Trigger:** Simulated complete unavailability of the primary AWS region.

**Steps:**
1. **Declare disaster** (0:00) — Lead engineer announces DR test in incident channel
2. **Restore database** (0:30) — Restore from latest snapshot in DR region
3. **Deploy applications** (1:00) — Deploy application stack to DR Kubernetes cluster
4. **Update DNS** (1:30) — Cutover traffic to DR load balancer
5. **Validate** (2:00) — Run smoke tests and verify data integrity
6. **Measure RTO** — Record time from declaration to passing validation

### Scenario B: Database Corruption

**Trigger:** Primary database has corrupt data requiring point-in-time recovery.

**Steps:**
1. Identify corruption point from monitoring alerts
2. Restore database to 5 minutes before corruption timestamp
3. Verify transactional integrity with checksum queries
4. Promote restored instance to primary
5. Redirect application traffic

## 4. Test Execution

### Pre-Test Checklist

| Item | Status | Owner |
|------|--------|-------|
| Backup snapshot created | [ ] | DBA |
| DR environment provisioned | [ ] | Platform |
| Runbooks printed / accessible offline | [ ] | SRE |
| Stakeholders notified | [ ] | Incident Lead |
| Rollback plan confirmed | [ ] | SRE |
| Monitoring dashboards ready | [ ] | Observability |

### Execution Timeline

| Time | Activity | Expected Result |
|------|----------|-----------------|
| T+0:00 | Declare test start | Incident channel notified |
| T+0:05 | Initiate database restore | RDS restore job started |
| T+1:00 | Database available | Connection test passes |
| T+1:30 | Deploy application | All pods healthy |
| T+2:00 | DNS cutover | Traffic hitting DR region |
| T+2:30 | Smoke tests | 100% pass rate |
| T+3:00 | Data integrity check | Transaction count matches source |
| T+4:00 | RTO measurement | Record actual vs target |

### Rollback Procedure

If any critical step fails:
1. Pause test immediately
2. Revert DNS to primary region
3. Do NOT delete DR resources until post-test review
4. Escalate to engineering manager
5. Schedule follow-up test after fixes

## 5. Data Integrity Verification

```sql
-- Transaction count comparison
SELECT 'primary' as source, COUNT(*) as tx_count FROM orders
UNION ALL
SELECT 'dr', COUNT(*) FROM dr_orders;

-- Checksum comparison
SELECT 'primary', SUM(CHECKSUM(*)) FROM payments
UNION ALL
SELECT 'dr', SUM(CHECKSUM(*)) FROM dr_payments;

-- Latest transaction timestamp
SELECT MAX(created_at) FROM orders;
CheckPrimaryDRMatch
Transaction count____________[ ]
Row-level checksum____________[ ]
Latest timestamp____________[ ]

6. Post-Test Report

Results Summary

MetricTargetActualStatus
RTO< 4 hours______Pass / Fail
RPO< 15 minutes______Pass / Fail
Data integrity100%______Pass / Fail
Smoke tests100%______Pass / Fail

Issues Found

IssueSeverityOwnerFix Deadline
______Critical / High / Medium / Low____________

Lessons Learned

  • What worked well:
  • What took longer than expected:
  • What failed unexpectedly:
  • What should be automated:

Sign-Off

RoleNameDateSignature
Test Lead__________________
Engineering Manager__________________
Compliance (if required)__________________

## Explanation

The plan structures DR testing into **declared objectives** (RTO/RPO targets), **controlled scenarios** (specific failure modes), and **measurable results** (pass/fail criteria). The pre-test checklist prevents tests from failing due to missing prerequisites rather than actual DR gaps. The rollback procedure acknowledges that tests can fail and protects production from test-induced outages.

## Variants

| Context | Approach | Notes |
|---------|----------|-------|
| Annual compliance test | Full scenario with auditors | Document everything, maintain sign-offs |
| Quarterly internal drill | Abbreviated scenario, no auditor | Focus on team coordination and timing |
| After infrastructure change | Targeted scenario for changed component | Validate only what changed |
| Game day / chaos engineering | Unannounced test | Most realistic, requires mature automation |

## Best Practices

1. **Schedule tests during business hours** — the people who need to learn from them must participate
2. **Measure, don't estimate** — actual RTO is almost always longer than the documented estimate
3. **Test from backup, not from live replication** — validates your backup restoration, not just failover
4. **Document every deviation** — even minor timing differences indicate process gaps
5. **Automate smoke tests** — manual verification is too slow during a real disaster

## Common Mistakes

1. **Testing only failover, not restoration from backup** — replication may be healthy while backups are corrupt
2. **Not testing with realistic data volumes** — restoring 1TB takes longer than restoring 1GB
3. **Skipping the rollback procedure test** — getting back to normal is part of DR
4. **Testing during low-traffic periods only** — doesn't validate capacity assumptions
5. **Not updating runbooks after test findings** — the same gaps appear in next year's test

## Frequently Asked Questions

### How often should we run DR tests?

At minimum: annually for compliance, quarterly for internal validation. After every infrastructure change that affects recovery paths. Mature organizations test monthly with automated chaos engineering.

### What if the DR test fails catastrophically?

That's valuable information. The test has revealed that your DR plan doesn't work — better to discover this in a controlled test than during a real disaster. Pause the test, restore production, fix the issues, and reschedule.

### Do we need to notify customers about DR tests?

Only if the test impacts customer-facing services. Internal-only tests need no external notification. Customer-impacting tests should be scheduled during maintenance windows with advance notice.