Day 19: Log-Based Alerting

Signal	Use for alerting	Latency
Metrics	High-frequency patterns (error rate, latency)	~10s
Logs	Specific error messages, panics, security events	30–120s
Traces	Specific trace IDs are not alertable directly	N/A

Grafana Loki (alternative)

Loki stores logs without full indexing — query with LogQL:

# Count of error logs per 5 minutes
sum(count_over_time({service="task-svc", level="error"}[5m]))

# Extract latency field and compute P95
quantile_over_time(0.95,
  {service="task-svc"} | json | duration_ms > 0 | unwrap duration_ms [5m]
) by (route)

Loki alert rule:

- alert: PanicDetected
  expr: count_over_time({service="task-svc"} |= "panicked" [1m]) > 0
  for: 0m     # alert immediately
  annotations:
    summary: "Panic detected in task-svc"

Day 19: Log-Based Alerting

Phase 6 · Sep 20, 2026

Agenda (2–3 hours)

When to Alert on Logs vs Metrics

CloudWatch Metric Filter → Alarm Chain

Grafana Loki (alternative)

Key Takeaways