Chaos Engineering

Q: Is chaos engineering just breaking things randomly?

No. It's hypothesis-driven experimentation with measured outcomes and automatic safety guards.

Q: How do I convince leadership to allow production chaos?

Start with staging, show findings, quantify prevented outages. Frame it as proactive insurance.

Q: What's the difference between chaos engineering and load testing?

Load testing checks behavior under high traffic. Chaos engineering checks behavior under failures.

Overview

Chaos engineering is the discipline of experimenting on distributed systems to build confidence in their resilience. By intentionally injecting failures — killing instances, injecting latency, corrupting packets — teams discover weaknesses before customers do. Netflix pioneered this with Chaos Monkey; today, tools like Litmus, Gremlin, and AWS Fault Injection Simulator make it accessible to any team.

When to Use

Use this resource when:

Operating distributed systems where failures are inevitable
Preparing for disaster recovery drills and game days
Validating auto-scaling, failover, and self-healing mechanisms
Building confidence before high-traffic events (launches, Black Friday)

Solution

Kubernetes Pod Chaos (Litmus)

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-experiment
spec:
  appinfo:
    appns: 'production'
    applabel: 'app=payment-service'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'

Network Latency Injection (tc + Bash)

#!/bin/bash
# Add 500ms latency to egress traffic on eth0

echo "Injecting 500ms latency for 60 seconds..."
tc qdisc add dev eth0 root netem delay 500ms 50ms distribution normal

sleep 60

echo "Removing latency..."
tc qdisc del dev eth0 root

# Verify with ping
ping -c 5 api.example.com

AWS Fault Injection Simulator (Python)

import boto3

fis = boto3.client('fis')

response = fis.start_experiment(
    experimentTemplateId='EXT-12345678',
    tags={'Environment': 'staging'}
)

print(f"Experiment started: {response['experiment']['id']}")

Explanation

Five chaos experiment types:

Infrastructure: Kill VMs, terminate containers, detach volumes
Network: Inject latency, drop packets, partition zones
Application: Throw exceptions, return 503s, trigger memory leaks
State: Fill disks, corrupt databases, expire certificates
Dependency: Make downstream APIs timeout or return errors

The blast radius principle:

Start in staging, then move to production with minimal traffic
Always have an abort button (automatic rollback on SLO violation)
Run during business hours when the team is available
Measure against SLOs, not just “does it crash”

Variants

Tool	Platform	Experiment Types
Chaos Monkey	AWS/Netflix	Instance termination
Litmus	Kubernetes	Pod, network, disk, stress
Gremlin	Multi-cloud	CPU, memory, network, state
AWS FIS	AWS	EC2, ECS, EKS, RDS failures
Toxiproxy	Any	Network latency, timeouts

Best Practices

Define steady state first: Know your normal error rate, latency, and throughput
Hypothesis-driven: “If we kill the primary database, failover completes in <30s”
Automate rollback: Stop experiments automatically if error rate exceeds 1%
Run game days: Quarterly scheduled chaos events with the whole team
Document findings: Every experiment produces a runbook update or architecture fix

Common Mistakes

Chaos without monitoring: You can’t observe effects if dashboards are incomplete
Production first: Never run chaos in production before proving it safe in staging
No rollback plan: Experiments that can’t be stopped quickly become outages
Testing only failures: Also test recovery (does auto-healing actually heal?)
Ignoring blast radius: One experiment shouldn’t affect all customers

Frequently Asked Questions

Q: Is chaos engineering just breaking things randomly? A: No. It’s hypothesis-driven experimentation with measured outcomes and automatic safety guards.

Q: How do I convince leadership to allow production chaos? A: Start with staging, show findings, quantify prevented outages. Frame it as proactive insurance.

Q: What’s the difference between chaos engineering and load testing? A: Load testing checks behavior under high traffic. Chaos engineering checks behavior under failures.

Overview

When to Use

Solution

Kubernetes Pod Chaos (Litmus)

Network Latency Injection (tc + Bash)

AWS Fault Injection Simulator (Python)

Explanation

Variants

Best Practices

Common Mistakes

Frequently Asked Questions

CI/CD Pipeline Guide

API Status Page Template

Bug Report Template

Capacity Planning Template

Changelog Template

Overview

When to Use

Solution

Kubernetes Pod Chaos (Litmus)

Network Latency Injection (tc + Bash)

AWS Fault Injection Simulator (Python)

Explanation

Variants

Best Practices

Common Mistakes

Frequently Asked Questions

Related Resources

CI/CD Pipeline Guide

API Status Page Template

Bug Report Template

Capacity Planning Template

Changelog Template