← Week 2: Metrics & Alerting

Day 14: Challenge — Metrics and SLO Dashboard

Phase 6 · Sep 15, 2026

← Week 2: Metrics & Alerting

Challenge Overview

Build a full metrics + alerting stack for the task service:

  • Prometheus scraping the service /metrics endpoint
  • Four golden signal dashboard in Grafana
  • SLO recording rules and error budget panel
  • Alertmanager rules for fast burn and slow burn
  • absent() alert for missing scrape data
← Week 2: Metrics & Alerting

Golden Signal Dashboard

┌─────────────────────────────────────────────────────┐
│  $service: task-svc              [Last 1h] [Auto]   │
├────────────────┬────────────────┬────────────────────┤
│ Traffic        │ Error Rate     │ P95 Latency        │
│ 142 req/s      │ 0.3%           │ 87ms               │
│ [sparkline]    │ [sparkline]    │ [sparkline]        │
├────────────────┴────────────────┴────────────────────┤
│ Error Budget (30d SLO: 99.9%)                        │
│ Remaining: 68%  Burn rate: 0.8x  ETA: ∞             │
└──────────────────────────────────────────────────────┘
← Week 2: Metrics & Alerting

Alert Rules Checklist

# Required alerts:
- alert: HighErrorRate       # > 1% for 5m
- alert: HighLatencyP99      # > 500ms for 10m
- alert: FastBudgetBurn      # 14x burn rate, 1h window
- alert: SlowBudgetBurn      # 2x burn rate, 6h window
- alert: ScrapeTargetDown    # absent(up{job="task-svc"}) for 5m
- alert: QueueDepthHigh      # SQS queue depth > 1000 for 10m

Test by:

  1. Returning 500s from a test handler for 6+ minutes
  2. Verifying HighErrorRate fires and reaches Alertmanager
  3. Verifying Slack/PagerDuty notification received
← Week 2: Metrics & Alerting

Week 2 Recap

Topic Key Insight
Prometheus data model Counter, Gauge, Histogram; pull scrape
PromQL rate(), histogram_quantile(), sum by
Grafana Template variables; deploy annotations
Alertmanager Routing, inhibition, silences
SLOs Error budget; multi-window burn rate
CloudWatch EMF for Lambda; DatapointsToAlarm for tolerance

Next week: Log Aggregation — structured logging, CloudWatch Logs, OpenSearch, FluentBit.