Chaos Engineering
Build resilient systems by intentionally injecting failures and observing how your distributed services respond and recover.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Chaos engineering is the discipline of experimenting on distributed systems to build confidence in their resilience. By intentionally injecting failures — killing instances, injecting latency, corrupting packets — teams discover weaknesses before customers do. Netflix pioneered this with Chaos Monkey; today, tools like Litmus, Gremlin, and AWS Fault Injection Simulator make it accessible to any team.
When to Use
Use this resource when:
- Operating distributed systems where failures are inevitable
- Preparing for disaster recovery drills and game days
- Validating auto-scaling, failover, and self-healing mechanisms
- Building confidence before high-traffic events (launches, Black Friday)
Solution
Kubernetes Pod Chaos (Litmus)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-experiment
spec:
appinfo:
appns: 'production'
applabel: 'app=payment-service'
appkind: 'deployment'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
Network Latency Injection (tc + Bash)
#!/bin/bash
# Add 500ms latency to egress traffic on eth0
echo "Injecting 500ms latency for 60 seconds..."
tc qdisc add dev eth0 root netem delay 500ms 50ms distribution normal
sleep 60
echo "Removing latency..."
tc qdisc del dev eth0 root
# Verify with ping
ping -c 5 api.example.com
AWS Fault Injection Simulator (Python)
import boto3
fis = boto3.client('fis')
response = fis.start_experiment(
experimentTemplateId='EXT-12345678',
tags={'Environment': 'staging'}
)
print(f"Experiment started: {response['experiment']['id']}")
Explanation
Five chaos experiment types:
- Infrastructure: Kill VMs, terminate containers, detach volumes
- Network: Inject latency, drop packets, partition zones
- Application: Throw exceptions, return 503s, trigger memory leaks
- State: Fill disks, corrupt databases, expire certificates
- Dependency: Make downstream APIs timeout or return errors
The blast radius principle:
- Start in staging, then move to production with minimal traffic
- Always have an abort button (automatic rollback on SLO violation)
- Run during business hours when the team is available
- Measure against SLOs, not just “does it crash”
Variants
| Tool | Platform | Experiment Types |
|---|---|---|
| Chaos Monkey | AWS/Netflix | Instance termination |
| Litmus | Kubernetes | Pod, network, disk, stress |
| Gremlin | Multi-cloud | CPU, memory, network, state |
| AWS FIS | AWS | EC2, ECS, EKS, RDS failures |
| Toxiproxy | Any | Network latency, timeouts |
Best Practices
- Define steady state first: Know your normal error rate, latency, and throughput
- Hypothesis-driven: “If we kill the primary database, failover completes in <30s”
- Automate rollback: Stop experiments automatically if error rate exceeds 1%
- Run game days: Quarterly scheduled chaos events with the whole team
- Document findings: Every experiment produces a runbook update or architecture fix
Common Mistakes
- Chaos without monitoring: You can’t observe effects if dashboards are incomplete
- Production first: Never run chaos in production before proving it safe in staging
- No rollback plan: Experiments that can’t be stopped quickly become outages
- Testing only failures: Also test recovery (does auto-healing actually heal?)
- Ignoring blast radius: One experiment shouldn’t affect all customers
Frequently Asked Questions
Q: Is chaos engineering just breaking things randomly? A: No. It’s hypothesis-driven experimentation with measured outcomes and automatic safety guards.
Q: How do I convince leadership to allow production chaos? A: Start with staging, show findings, quantify prevented outages. Frame it as proactive insurance.
Q: What’s the difference between chaos engineering and load testing? A: Load testing checks behavior under high traffic. Chaos engineering checks behavior under failures.
Related Resources
CI/CD Pipeline Guide
A practical guide to building CI/CD pipelines with GitHub Actions, testing, deployment strategies, and rollback procedures.
DocAPI Status Page Template
A template for a public API status page that communicates uptime, incidents, and maintenance windows to consumers.
DocBug Report Template
A structured bug report template to help teams reproduce, triage, and resolve defects faster with clear reproduction steps and expected behavior.
DocCapacity Planning Template
A reusable template for planning system capacity, estimating growth, and preventing performance bottlenecks before they happen.
DocChangelog Template
A structured changelog template following Keep a Changelog conventions for tracking project releases.