Day 16: Failure Injection

Scenario	Injection method	Expected behavior
Worker crash mid-task	Kill container	SQS visibility timeout expires; task re-delivered; re-claimed
DynamoDB throttle	AWS Fault Injection Service	Retry with backoff; task delayed but not lost
SQS receive fails	Drop messages at proxy	Tasks stuck PENDING; queue depth metric alert fires
API service crash	Kill 1 of 2 tasks	ALB routes to healthy task; in-flight gRPC fails with UNAVAILABLE
AZ failure	Block all traffic to AZ-A subnets	ALB routes to AZ-B; ECS launches replacements in AZ-B

Worker Crash Recovery Test

Timeline:
 0s  Worker claims task t-123; visibility timeout = 30s
 5s  Worker container killed (simulate crash)
30s  SQS visibility timeout expires
31s  SQS re-delivers message
32s  A different worker claims t-123 (conditional update: status=PENDING)

Wait — status=PROCESSING not PENDING after first claim. The second worker's conditional update fails.
Fix: visibility timeout expiry must reset status to PENDING, OR use the SQS receive count to detect re-delivery and allow re-claim from PROCESSING state.

Day 16: Failure Injection

Phase 7 · Oct 8, 2026

Agenda (2–3 hours)

Failure Scenarios

Fault Injection Code

Worker Crash Recovery Test

Key Takeaways