← Week 2: Metrics & Alerting

Day 11: Alertmanager

Phase 6 · Sep 12, 2026

← Week 2: Metrics & Alerting

Agenda (2–3 hours)

  • Read (45 min): Prometheus alerting rules documentation; Alertmanager routing and inhibition documentation; PagerDuty integration
  • Study (45 min): What is the difference between a pending and firing alert? Why does Prometheus require for: 5m on most alert rules?
  • Practice (45 min): Write alert rules for: error rate > 1% for 5 minutes; P99 latency > 500ms for 10 minutes; no scrape data for 5 minutes (absent)
  • Challenge (30 min): A service has 3 replicas; one goes down. Error rate spikes for 30 seconds then recovers as traffic re-routes. Design alert thresholds that page on sustained failure but not transient spikes
← Week 2: Metrics & Alerting

Prometheus Alert Rules

# task-service-alerts.yml
groups:
  - name: task-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="task-svc",status=~"5.."}[5m]))
          / sum(rate(http_requests_total{service="task-svc"}[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error rate {{ $value | humanizePercentage }} for task-svc"
          runbook: "https://wiki/runbooks/task-svc-errors"

for: 5m — alert must remain true for 5 minutes before firing (avoids flapping).

← Week 2: Metrics & Alerting

Alertmanager Routing

# alertmanager.yml
route:
  group_by: [alertname, service]
  group_wait: 30s       # wait 30s to batch alerts in the same group
  group_interval: 5m    # send new alerts in existing group every 5m
  repeat_interval: 4h   # re-notify if alert still firing after 4h
  receiver: pagerduty-critical

  routes:
    - match:
        severity: warning
      receiver: slack-warnings
    - match:
        team: backend
      receiver: pagerduty-backend

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: $PAGERDUTY_KEY
← Week 2: Metrics & Alerting

Inhibition and Silences

Inhibition: suppress lower-priority alerts when a higher one fires:

inhibit_rules:
  - source_match:
      alertname: ClusterDown
    target_match:
      severity: warning
    equal: [cluster]

If ClusterDown fires, suppress all warning alerts for the same cluster.

Silences: temporary mute during planned maintenance:

amtool silence add alertname=HighErrorRate \
  --duration=2h \
  --comment="Planned DynamoDB migration"
← Week 2: Metrics & Alerting

Key Takeaways

  • for: 5m prevents flapping — alert must be continuously true before paging
  • Alertmanager groups related alerts and routes by label (severity, team)
  • Inhibition rules prevent alert storms when a root cause already has a page
  • Silence during maintenance windows to prevent noise that erodes on-call trust

Tomorrow: SLIs, SLOs, and error budgets — operationalizing reliability targets.