← Week 1: Design & Architecture

Day 5: Observability Plan

Phase 7 · Sep 27, 2026

← Week 1: Design & Architecture

Agenda (2–3 hours)

  • Design (60 min): Define the full observability plan — spans, metrics, logs, and SLOs for the project
  • Review (60 min): Walk through each failure mode from Day 2; verify each is detectable from the planned signals
  • Document (60 min): Write the runbook for the top 3 alert scenarios
← Week 1: Design & Architecture

Traces

Service Spans to emit
API Service grpc.server.* (per RPC); dynamodb.PutItem (event); dynamodb.UpdateItem (idempotency); sqs.SendMessage
Worker sqs.ReceiveMessage; dynamodb.UpdateItem (claim); task execution span; dynamodb.PutItem (event)
DLQ Processor sqs.ReceiveMessage; dynamodb.UpdateItem (status→DEAD)

Context propagated via SQS message attributes (traceparent).

← Week 1: Design & Architecture

Metrics

Metric Type Labels
tasks_submitted_total Counter task_type, priority
task_processing_duration_seconds Histogram task_type, status
task_retries_total Counter task_type
tasks_dead_total Counter task_type
sqs_queue_depth Gauge queue
grpc_request_duration_seconds Histogram method, status
← Week 1: Design & Architecture

SLOs and Alerts

SLO Target Alert
API availability 99.9% HighErrorRate: gRPC error > 1% for 5m
Submit latency P99 < 100ms HighSubmitLatency: P99 > 100ms for 10m
Queue drain < 5s processing start QueueDepthHigh: depth > 1000 for 5m
DLQ rate < 0.1% of submitted HighDLQRate: dead tasks > 0.1% per hour

Runbook for QueueDepthHigh:

  1. Check worker ECS service desired count vs running count
  2. Check task_processing_duration_seconds — are tasks slow?
  3. Check DynamoDB ConsumedWriteCapacityUnits — throttling?
  4. Scale worker service: aws ecs update-service --desired-count 8
← Week 1: Design & Architecture

Key Takeaways

  • Every failure mode from the architecture review should map to at least one alert
  • Runbooks written at design time (not post-incident) lead to faster incident resolution
  • SLOs frame the observability plan: what to alert on is derived from what the SLO protects
  • Trace context in SQS message attributes is the critical link for worker-side spans

Tomorrow: deployment architecture — ECS services, VPC, ALB, and CI/CD pipeline.