← Week 2: Metrics & Alerting

Day 12: SLIs, SLOs, and Error Budgets

Phase 6 · Sep 13, 2026

← Week 2: Metrics & Alerting

Agenda (2–3 hours)

  • Read (45 min): Google SRE book Chapter 4 (SLOs); SLO specification; SLOTH (SLO framework for Prometheus)
  • Study (45 min): Define SLIs and SLOs for the task management service. What is the error budget for 99.9% availability over 30 days?
  • Practice (45 min): Implement SLO-based Prometheus recording rules; build a burn rate alert using the multi-window method
  • Challenge (30 min): The error budget for a service is 50% consumed in the first week of a 30-day window. At this burn rate, when does the budget run out? What actions should the team take?
← Week 2: Metrics & Alerting

SLI → SLO → SLA

SLI (Service Level Indicator): a quantitative measure of service behavior.
availability_sli = good_requests / total_requests

SLO (Service Level Objective): target value for an SLI.
availability_slo = 99.9% over 30 days

SLA (Service Level Agreement): contractual commitment (external-facing).

Error budget = 1 − SLO = the allowed bad requests per 30-day window.
99.9% SLO → 0.1% budget → 43.2 minutes downtime allowed per month

← Week 2: Metrics & Alerting

Recording Rules for SLO Tracking

# prometheus-rules.yml
groups:
  - name: slo-task-service
    interval: 1m
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total{job="task-svc"}[5m]))

      - record: job:http_errors:rate5m
        expr: sum(rate(http_requests_total{job="task-svc",status=~"5.."}[5m]))

      - record: job:availability:ratio_rate5m
        expr: |
          1 - (job:http_errors:rate5m / job:http_requests:rate5m)
← Week 2: Metrics & Alerting

Multi-Window Burn Rate Alert

# Fast burn: 14x rate over 1h (uses 5.5% budget in 1 hour)
- alert: FastErrorBudgetBurn
  expr: |
    job:availability:ratio_rate1h < (1 - 14 * 0.001)
    AND
    job:availability:ratio_rate5m < (1 - 14 * 0.001)
  for: 2m
  labels:
    severity: critical

# Slow burn: 2x rate over 6h (uses 1% budget in 6 hours)
- alert: SlowErrorBudgetBurn
  expr: job:availability:ratio_rate6h < (1 - 2 * 0.001)
  for: 15m
  labels:
    severity: warning
← Week 2: Metrics & Alerting

Key Takeaways

  • SLOs set the reliability target; error budgets quantify the allowed failures
  • Multi-window burn rate alerts fire early on fast burns and catch slow degradation
  • When budget is exhausted, freeze feature work and invest in reliability
  • SLOs enable rational trade-offs: a feature that costs 10% of the error budget is worth discussing

Tomorrow: Phase 6 Week 2 challenge — build a full metrics + alerting stack for the task service.