← Week 3: Testing & Deployment

Day 16: Failure Injection

Phase 7 · Oct 8, 2026

← Week 3: Testing & Deployment

Agenda (2–3 hours)

  • Design (30 min): List all failure scenarios from the architecture review (Day 2); prioritize by likelihood and impact
  • Implement (60 min): Write fault injection points — DynamoDB latency injection, SQS receive failure, worker crash simulation
  • Test (90 min): Run each failure scenario; verify recovery behavior matches the design; record actual recovery time
← Week 3: Testing & Deployment

Failure Scenarios

Scenario Injection method Expected behavior
Worker crash mid-task Kill container SQS visibility timeout expires; task re-delivered; re-claimed
DynamoDB throttle AWS Fault Injection Service Retry with backoff; task delayed but not lost
SQS receive fails Drop messages at proxy Tasks stuck PENDING; queue depth metric alert fires
API service crash Kill 1 of 2 tasks ALB routes to healthy task; in-flight gRPC fails with UNAVAILABLE
AZ failure Block all traffic to AZ-A subnets ALB routes to AZ-B; ECS launches replacements in AZ-B
← Week 3: Testing & Deployment

Fault Injection Code

// Feature-flag controlled latency injection (env var: FAULT_DYNAMO_LATENCY_MS)
pub struct FaultInjector {
    dynamo_latency: Option<Duration>,
    dynamo_error_rate: f32,
}

impl FaultInjector {
    pub async fn maybe_delay(&self) {
        if let Some(d) = self.dynamo_latency {
            tokio::time::sleep(d).await;
        }
    }

    pub fn maybe_error(&self) -> bool {
        self.dynamo_error_rate > 0.0
            && rand::random::<f32>() < self.dynamo_error_rate
    }
}

Wrap every DynamoDB call with injector.maybe_delay() and injector.maybe_error().

← Week 3: Testing & Deployment

Worker Crash Recovery Test

Timeline:
 0s  Worker claims task t-123; visibility timeout = 30s
 5s  Worker container killed (simulate crash)
30s  SQS visibility timeout expires
31s  SQS re-delivers message
32s  A different worker claims t-123 (conditional update: status=PENDING)

Wait — status=PROCESSING not PENDING after first claim. The second worker's conditional update fails.
Fix: visibility timeout expiry must reset status to PENDING, OR use the SQS receive count to detect re-delivery and allow re-claim from PROCESSING state.

← Week 3: Testing & Deployment

Key Takeaways

  • The worker crash scenario reveals a state machine gap — re-delivery from PROCESSING state requires careful handling
  • AWS FIS (Fault Injection Service) can inject real DynamoDB throttles in staging without code changes
  • Every failure scenario should have a documented expected recovery time and verification method
  • Chaos testing is most valuable when the recovery behavior is actually wrong — common in first implementations

Tomorrow: security hardening — secrets management, network policies, and container security.