← Week 1: Circuit Breakers & Bulkheads

Day 2: Circuit Breaker Pattern

Phase 4 · Jul 23, 2026

← Week 1: Circuit Breakers & Bulkheads

Agenda (2–3 hours)

  • Read (45 min): Nygard "Release It!" Chapter 5 (circuit breaker); Michael Nygard's "Stability Patterns" blog series
  • Study (45 min): Draw the state machine with all transitions and triggering conditions; design the metrics needed to drive the state transitions
  • Practice (45 min): Implement a threshold-based circuit breaker; test all three state transitions manually
  • Challenge (30 min): What is the difference between a "count-based" circuit breaker and a "time-window-based" one? Which is better for bursty traffic?
← Week 1: Circuit Breakers & Bulkheads

The Three States

                     failure_rate > threshold
  [CLOSED] ──────────────────────────────────→ [OPEN]
     │                                            │
     │  probe succeeds                            │ after timeout
     │                                            ↓
  [CLOSED] ←─────────────────────────── [HALF-OPEN]
              probe fails: back to OPEN ─────────┘
  • Closed: pass all requests; track failures
  • Open: reject all requests immediately (fast-fail)
  • Half-open: let exactly one probe through; update state on result
← Week 1: Circuit Breakers & Bulkheads

Failure Rate Window

Simple count-based (last N requests):

  • Maintain a circular buffer of results
  • Count failures in the last N attempts
  • Trip if failures/N > threshold

Time-window-based (sliding window over T seconds):

  • Maintain a sorted deque of (timestamp, result)
  • Evict entries older than T seconds on each request
  • Better for bursty traffic: a burst of failures 10 minutes ago doesn't affect the circuit today
← Week 1: Circuit Breakers & Bulkheads

Configuration Tradeoffs

Parameter Too low Too high
Failure threshold Opens too easily on noise Slow to protect downstream
Open duration Doesn't allow recovery Rejects valid traffic too long
Half-open probe count Slow recovery Premature close under partial recovery
Minimum request count Trips on first failure Ignores low-volume degradation

Rule of thumb:

  • Threshold: 50% failures in last 20 requests
  • Open duration: 30s
  • Half-open probes: 1 request
← Week 1: Circuit Breakers & Bulkheads

Metrics to Expose

# Counters
circuit_breaker_requests_total{state="closed",result="success"}
circuit_breaker_requests_total{state="closed",result="failure"}
circuit_breaker_requests_total{state="open",result="rejected"}

# Gauge
circuit_breaker_state{name="payments"} 0 # 0=closed, 1=open, 2=half-open

# Histogram
circuit_breaker_call_duration_seconds{state="closed"}

Alert on: circuit_breaker_state == 1 persisting for > 2 minutes.

← Week 1: Circuit Breakers & Bulkheads

Key Takeaways

  • Circuit breaker is a state machine: Closed → Open → Half-Open → Closed
  • Time-window-based tracking is more robust than count-based for variable traffic
  • The open timeout is crucial: too short → flapping; too long → unnecessary outage
  • Always expose circuit breaker state as a metric + alert on it

Tomorrow: Implementing circuit breakers in Rust — from scratch and with libraries.