← Week 3: Testing & Deployment

Day 19: Production Readiness Review

Phase 7 · Oct 11, 2026

← Week 3: Testing & Deployment

Agenda (2–3 hours)

  • Review (120 min): Full production readiness checklist — reliability, security, observability, operations
  • Fix (60 min): Address any items that are not ticked
← Week 3: Testing & Deployment

Reliability Checklist

  • [ ] SLOs defined and documented (availability 99.9%, submit P99 < 100ms)
  • [ ] All failure modes from architecture review have alert coverage
  • [ ] Graceful shutdown: workers drain in-flight tasks on SIGTERM
  • [ ] Idempotency: all write operations are safe to retry
  • [ ] DLQ exists for all SQS queues; DLQ processor deployed
  • [ ] Load tested at 500 req/s; SLOs met
  • [ ] Failure injection tested; worker crash recovery verified
  • [ ] DynamoDB on-demand or provisioned with auto-scaling enabled
  • [ ] ECS service multi-AZ: tasks spread across AZ-A and AZ-B
← Week 3: Testing & Deployment

Security Checklist

  • [ ] No secrets in task definition environment variables (Secrets Manager)
  • [ ] IAM task roles: least privilege verified with simulate-principal-policy
  • [ ] VPC: API and Worker in private subnets; ALB in public subnet only
  • [ ] Security groups: tasks only accept traffic from ALB SG (not 0.0.0.0/0)
  • [ ] mTLS between gRPC clients and API service
  • [ ] Container: read-only root filesystem; non-root user; drop ALL capabilities
  • [ ] ECR: image scanning on push; no latest tag in production task definitions
  • [ ] VPC flow logs enabled for post-incident network forensics
← Week 3: Testing & Deployment

Observability Checklist

  • [ ] Traces: all three services emit spans to OTel Collector → Tempo
  • [ ] trace_id present in every log line
  • [ ] Metrics: four golden signals + SQS queue depth + DLQ depth
  • [ ] SLO recording rules deployed; error budget panel in Grafana
  • [ ] Alerts: HighErrorRate, HighLatencyP99, FastBudgetBurn, QueueDepthHigh, ScrapeTargetDown
  • [ ] Runbooks written for top 3 alert scenarios
  • [ ] Alert routing to correct team/channel in Alertmanager
← Week 3: Testing & Deployment

Operations Checklist

  • [ ] CI/CD pipeline: PR gate + rolling deploy on main merge
  • [ ] Rollback procedure documented and tested
  • [ ] Log retention: CloudWatch log group retention set (not unlimited)
  • [ ] Cost estimate documented: ECS, DynamoDB, SQS, ALB, CloudWatch
  • [ ] Oncall rotation defined; who is the primary on Day 1?
  • [ ] Post-deploy smoke test: submit a task, wait for completion, verify

Tomorrow: retrospective — what did we build and what did we learn?