← Week 1: Circuit Breakers & Bulkheads

Day 1: Failure Modes — Cascading Failures and Retry Storms

Phase 4 · Jul 22, 2026

← Week 1: Circuit Breakers & Bulkheads

Agenda (2–3 hours)

  • Read (45 min): Nygard "Release It!" Chapter 3 (stability patterns overview); AWS "Avoiding fallback in distributed systems" blog post
  • Study (45 min): Map each failure mode to a pattern that addresses it; which failure mode leads to which anti-pattern?
  • Practice (45 min): Draw a dependency graph for a realistic service; identify the highest-risk failure paths
  • Challenge (30 min): Given a service with 99.9% uptime and 10 dependencies each with 99.9% uptime, what is the expected uptime of the service if any dependency failure causes a service failure?
← Week 1: Circuit Breakers & Bulkheads

The Danger of Dependencies

Availability compounds multiplicatively:

Service uptime = product of all dependency uptimes
= 0.999^10 = 0.99 ≈ 99.0%

With 50 dependencies:
= 0.999^50 ≈ 95.1%

Even "five nines" dependencies make your service significantly less available if you have many of them.

← Week 1: Circuit Breakers & Bulkheads

Cascading Failure Anatomy

A slow dependency is worse than a down dependency:

  1. Service B slows down (high latency, not errors)
  2. Service A's requests to B queue up waiting for responses
  3. A's thread pool / connection pool fills up
  4. A starts returning errors or slowing down
  5. Service A's callers (C, D) queue up
  6. Cascading failure spreads through the system

Key insight: thread pools and connection pools are finite. A slow dependency can exhaust them even at low request rates.

← Week 1: Circuit Breakers & Bulkheads

Retry Storms

Without coordination, retries amplify load on a degraded service:

  1. Service B hits capacity limit
  2. All callers start retrying immediately
  3. Retries arrive at B simultaneously → 3–10× amplification
  4. B, already struggling, gets crushed by retry load
  5. B fully fails; callers retry harder

Solutions:

  • Exponential backoff with jitter (prevents synchronized retries)
  • Circuit breaker (stops retrying entirely when failure rate is high)
  • Retry budgets (limit what fraction of total requests can be retries)
← Week 1: Circuit Breakers & Bulkheads

Thundering Herd

Variant of retry storm: triggered by a cache expiry or service restart.

Cache expires → all requests miss cache → all hit DB simultaneously
DB overloaded → slow responses → all callers time out and retry

Solutions:

  • Jittered TTL expiry (don't expire all keys at the same time)
  • Cache stampede prevention (probabilistic early re-fetch)
  • Lock-based refresh (one goroutine refreshes; others wait)
← Week 1: Circuit Breakers & Bulkheads

Key Takeaways

  • Availability is multiplicative: 10 dependencies × 99.9% = ~99.0% total
  • Slow dependencies are more dangerous than fast ones — they exhaust resource pools
  • Retry storms amplify load on degraded services; jitter + circuit breakers prevent them
  • Thundering herd is a specific retry storm triggered by synchronized state changes

Tomorrow: circuit breaker pattern — the primary defense against cascading failures.