← Week 1: Design & Architecture

Day 2: System Design

Phase 7 · Sep 24, 2026

← Week 1: Design & Architecture

Agenda (2–3 hours)

  • Design (90 min): Full system design — services, data flows, failure modes, and how each design choice connects to Phase 1–6 concepts
  • Review (60 min): Stress-test the design against the non-functional requirements; identify bottlenecks
  • Document (30 min): Write the architecture decision records (ADRs) for the 3 most significant choices
← Week 1: Design & Architecture

System Architecture

Client
  ↓ gRPC (tonic, mTLS)
API Service (ECS Fargate, 2 tasks)
  ├── DynamoDB: write task event (EventType=SUBMITTED)
  └── SQS FIFO: enqueue task message (dedup on idempotency key)

SQS FIFO Queue
  ↓ Lambda ESM or Worker poll
Worker Pool (ECS Fargate, 4 tasks)
  ├── Claim task: DynamoDB conditional update (status PENDING→PROCESSING)
  ├── Execute task payload
  ├── DynamoDB: write event (COMPLETED or FAILED)
  └── SQS: delete message (explicit ack)

DLQ (SQS)
  └── DLQ Processor (Lambda): move to DynamoDB DEAD status, alert
← Week 1: Design & Architecture

Key Architecture Decisions

ADR-1: SQS FIFO for task ordering

  • Why: SQS FIFO guarantees per-group ordering and deduplication; standard queues could reorder high-priority tasks
  • Trade-off: 3,000 msg/s throughput limit per message group; mitigate by sharding on task_type

ADR-2: DynamoDB event store for task state

  • Why: event sourcing provides durable audit trail and supports replay/projection
  • Trade-off: current status requires reading all events (or maintaining a projection)

ADR-3: SQS visibility timeout as distributed lock

  • Why: simplest at-most-once processing window; no separate lock table needed
  • Trade-off: heartbeat required for tasks > 30s; crash leaves task invisible until timeout expires
← Week 1: Design & Architecture

Failure Mode Analysis

Failure Detection Recovery
Worker crash mid-task Visibility timeout expires SQS re-delivers; worker re-claims with idempotency check
DynamoDB throttle Retry with backoff; SQS message stays in flight Provisioned capacity + auto-scaling
API service down ALB health check; ECS replaces task In-flight gRPC fails; client retries with idempotency key
SQS queue full Queue depth metric alert DLQ absorbs; scale workers
← Week 1: Design & Architecture

Key Takeaways

  • The visibility timeout acts as a lease — the worker must heartbeat or the task is re-queued
  • Event sourcing decouples writes (events) from reads (projections), but adds read complexity
  • Each architectural choice maps to a pattern from Phases 1–6; reference the relevant day file

Tomorrow: API design — protobuf definitions for the gRPC service.