Chaos Engineering — Principles, Tools, and Safe Experiments
A practical guide to chaos engineering: build resilient systems by intentionally injecting failures. Learn the five principles, Litmus, Gremlin, and Chaos Mesh.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions. Instead of waiting for failures to occur in production, you intentionally inject them — pod kills, network latency, CPU exhaustion, disk fill — to validate that your system degrades gracefully and recovers automatically. Originated at Netflix with Chaos Monkey, it has evolved into a structured practice with principles, tools, and safety guardrails.
When to Use
- Your system claims to be “highly available” but has never been tested under failure
- You want to validate autoscaling, failover, and circuit breakers
- You need to discover unknown dependencies and single points of failure
- Incident response runbooks exist but are untested
- You are running on Kubernetes and want to validate pod resilience
The Five Principles of Chaos Engineering
- Build a hypothesis around steady-state behavior — define normal metrics (error rate < 0.1%, p99 latency < 200ms)
- Vary real-world events — inject failures that actually happen: network partitions, disk failures, dependency outages
- Run experiments in production — staging rarely matches production topology and load
- Automate experiments to run continuously — manual game days are valuable but not scalable
- Minimize blast radius — start small (one pod, one AZ), abort if SLOs are breached
Experiment Design
┌─────────────────┐
│ 1. Steady state │ ← Define normal via metrics
│ 2. Hypothesis │ ← "If X fails, Y autoscales in < 60s"
│ 3. Inject fault │ ← Kill pod, add latency, fill disk
│ 4. Observe │ ← Compare actual vs hypothesis
│ 5. Rollback │ ← Abort if blast radius exceeds bounds
│ 6. Learn │ ← Fix weaknesses, automate fix
└─────────────────┘
Chaos Mesh Example (Kubernetes)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-api
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api
duration: 30s
scheduler:
cron: "@every 10m"
LitmusChaos Example
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-pod-delete
namespace: litmus
spec:
appinfo:
appns: production
applabel: "app=api"
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
Common Experiment Types
| Experiment | Validates | Tool |
|---|---|---|
| Pod kill | Kubernetes rescheduling, readiness probes | Chaos Mesh, Litmus |
| Network latency | Timeout handling, circuit breakers | Chaos Mesh, Gremlin |
| CPU/memory stress | Autoscaling triggers, resource limits | Stress-ng, Gremlin |
| Disk fill | Log rotation, storage alerts | Litmus, Gremlin |
| Zone outage | Multi-AZ failover | AWS FIS, Gremlin |
Safety Guardrails
- Abort conditions — auto-stop experiment if error rate > 1% or p99 > 500ms
- Time-bound — limit experiment duration (30s, 5m, not indefinite)
- Small scope — one pod → one deployment → one namespace → one AZ
- Business hours — run experiments when engineers are available
- Clear communication — announce experiments to avoid incident duplication
Common Mistakes
- No steady-state definition — you cannot detect degradation if you do not know what normal looks like
- Blast radius too large — starting with a full region outage can cause real customer impact
- No abort mechanism — experiments must auto-terminate if SLOs are breached
- Blaming individuals for failures found — chaos engineering finds system weaknesses, not human errors
- Running experiments without runbooks — if the experiment finds a bug, you need a remediation plan
FAQ
Is chaos engineering safe for production? Yes, if done with guardrails. Start with the smallest possible blast radius and abort conditions. The risk of an untested system failing in production is often higher than a controlled experiment.
What is the difference between chaos engineering and testing? Testing validates that code behaves correctly under known conditions. Chaos engineering validates that the system as a whole behaves under unknown, real-world failure conditions.
Do I need Kubernetes to do chaos engineering? No. Gremlin supports VMs, containers, and serverless. AWS Fault Injection Simulator works with EC2 and RDS. Kubernetes just makes pod-level experiments easier.
Related Resources
Site Reliability Engineering — SRE Practices and Error Budgets
A practical guide to SRE: defining SLIs, SLOs, and SLAs, managing error budgets, toil reduction, on-call rotations, and building a culture of reliability.
GuideObservability — Metrics, Logs, and Traces Complete Guide
A practical guide to observability: the three pillars (metrics, logs, traces), implementing with Prometheus, Grafana, Loki, Tempo/Jaeger, and building SLO-driven alerting.
GuideService Mesh — Istio, Linkerd, and Sidecar Architecture
A practical guide to service mesh: what it is, when to adopt it, core concepts (sidecar, mTLS, traffic management), and comparing Istio vs Linkerd.