← Week 2: Metrics & Alerting

Day 9: PromQL

Phase 6 · Sep 10, 2026

← Week 2: Metrics & Alerting

Agenda (2–3 hours)

  • Read (45 min): PromQL documentation — selectors, functions, operators, range vectors
  • Study (45 min): Derive the four golden signals (latency, traffic, errors, saturation) in PromQL from raw http_requests_total and request_duration_seconds metrics
  • Practice (45 min): Write PromQL queries for: request rate, error rate, P95 latency, and CPU saturation; verify results in the Prometheus Expression Browser
  • Challenge (30 min): rate() requires at least two data points in the range window. What happens to the result at service startup, and how do you handle it in alerting rules?
← Week 2: Metrics & Alerting

PromQL Fundamentals

Instant vector — current value of a metric:

http_requests_total{status="200", method="GET"}

Range vector — values over a time window (used with functions):

http_requests_total[5m]

rate() — per-second rate of increase over the window:

rate(http_requests_total[5m])

irate() — rate based on the last two data points only (more responsive):

irate(http_requests_total[1m])
← Week 2: Metrics & Alerting

Four Golden Signals

# Traffic (req/s)
sum(rate(http_requests_total[5m]))

# Error rate (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# P95 latency (seconds)
histogram_quantile(0.95,
  sum by (le) (rate(request_duration_seconds_bucket[5m]))
)

# Saturation (CPU usage %)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
← Week 2: Metrics & Alerting

Aggregation

Aggregate across instances (pod label):

sum by (method, status) (rate(http_requests_total[5m]))

Top 5 slowest endpoints:

topk(5,
  histogram_quantile(0.99,
    sum by (route, le) (rate(request_duration_seconds_bucket[5m]))
  )
)

without(pod) — aggregate everything except the pod label (useful when pod names change).

← Week 2: Metrics & Alerting

Key Takeaways

  • rate() over range vectors produces per-second rates; use [5m] for stable alerting
  • histogram_quantile() computes percentiles server-side from histogram buckets
  • Aggregate with sum by (...) to collapse instance-level metrics into service-level signals
  • Always include both sum by (le) inside and histogram_quantile() outside for multi-instance P99

Tomorrow: Grafana dashboards — panels, variables, and alert rules.