← Week 1: Circuit Breakers & Bulkheads

Day 6: Chaos Engineering

Phase 4 · Jul 27, 2026

← Week 1: Circuit Breakers & Bulkheads

Agenda (2–3 hours)

  • Read (45 min): Netflix "Chaos Engineering" principles (principlesofchaos.org); "Chaos Engineering" O'Reilly book Chapter 1; AWS Fault Injection Simulator documentation
  • Study (45 min): Design a chaos experiment for the echo server from Phase 2 Day 7; define steady-state, hypothesis, and blast radius
  • Practice (45 min): Use AWS FIS (or tc netem locally) to inject 100ms latency and 5% packet loss; observe how the service responds
  • Challenge (30 min): What is the difference between chaos engineering and load testing? When would you use each?
← Week 1: Circuit Breakers & Bulkheads

Chaos Engineering Principles

From principlesofchaos.org:

  1. Define steady state: measurable normal behavior (error rate, latency P99)
  2. Hypothesize: "the system will maintain steady state when X fails"
  3. Introduce variables: latency, CPU pressure, network partition, pod kill
  4. Disprove the hypothesis: if the experiment reveals a gap, fix it before it happens in production

Key principle: run experiments in production (on a small blast radius first). Staging never has the full traffic patterns that expose the real failure modes.

← Week 1: Circuit Breakers & Bulkheads

Experiment Design

Steady state:
  - Error rate < 0.1%
  - P99 latency < 200ms

Hypothesis:
  - If we kill one of three backend pods, error rate stays < 0.5% and P99 stays < 500ms

Blast radius:
  - One pod out of three (33% of capacity)
  - Run during business hours with oncall available

Method:
  - kubectl delete pod backend-2
  - Observe metrics for 5 minutes
  - Restore (pod is recreated by ReplicaSet)
← Week 1: Circuit Breakers & Bulkheads

Network Fault Injection with tc netem

# Add 100ms latency + 10ms jitter to all traffic on eth0
tc qdisc add dev eth0 root netem delay 100ms 10ms

# Add 5% packet loss
tc qdisc change dev eth0 root netem loss 5%

# Add 1% packet corruption
tc qdisc change dev eth0 root netem corrupt 1%

# Remove all faults
tc qdisc del dev eth0 root

AWS FIS (Fault Injection Simulator) provides a managed alternative: inject faults via an API call, define rollback conditions, and get CloudWatch-integrated experiment results.

← Week 1: Circuit Breakers & Bulkheads

AWS Fault Injection Simulator

{
  "description": "Kill one ECS task",
  "targets": {
    "tasks": { "resourceType": "aws:ecs:task",
               "selectionMode": "PERCENT(33)" }
  },
  "actions": {
    "kill-tasks": {
      "actionId": "aws:ecs:stop-task",
      "targets": { "Tasks": "tasks" }
    }
  },
  "stopConditions": [
    { "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:...:alarm:high-error-rate" }
  ]
}

stopConditions: automatically halt the experiment if a CloudWatch alarm fires — your safety net.

← Week 1: Circuit Breakers & Bulkheads

Key Takeaways

  • Chaos engineering is structured hypothesis-driven experimentation, not random destruction
  • Define steady state first; the experiment is only meaningful if you know what "normal" looks like
  • Start small: one instance, one failure, controlled blast radius
  • Automate with stop conditions — if an alarm fires, the experiment rolls back automatically

Tomorrow: Phase 4 Week 1 Challenge — Rust service with circuit breaker + bulkhead.