← Week 3: Log Aggregation & Analysis

Day 19: Log-Based Alerting

Phase 6 · Sep 20, 2026

← Week 3: Log Aggregation & Analysis

Agenda (2–3 hours)

  • Read (45 min): CloudWatch Logs metric filters and alarms; OpenSearch alerting plugin; Grafana Loki LogQL alerting
  • Study (45 min): When is a log-based alert preferable to a metrics-based alert? Design a log alert for: panic detected in application logs
  • Practice (45 min): Create a CloudWatch metric filter that counts level=ERROR events; wire it to a CloudWatch alarm; trigger it with a test log event
  • Challenge (30 min): Log-based alerts have higher latency than metric alerts (log ingestion lag). Quantify the end-to-end latency from log emission to PagerDuty notification for CloudWatch Logs
← Week 3: Log Aggregation & Analysis

When to Alert on Logs vs Metrics

Signal Use for alerting Latency
Metrics High-frequency patterns (error rate, latency) ~10s
Logs Specific error messages, panics, security events 30–120s
Traces Specific trace IDs are not alertable directly N/A

Alert on logs when: the event is rare, has a specific error code or message, or lacks a corresponding metric.

← Week 3: Log Aggregation & Analysis

CloudWatch Metric Filter → Alarm Chain

App stdout → ECS awslogs driver → CloudWatch Log Stream
  → Metric Filter (pattern: { $.level = "ERROR" }) → ErrorCount metric
    → CloudWatch Alarm (ErrorCount > 0 for 1 min)
      → SNS Topic → PagerDuty / Slack

End-to-end latency:
  Log emitted → CloudWatch ingestion: 5–15s
  Metric filter evaluation: 1-min period
  Alarm evaluation period: 1 min
  SNS delivery: 1–5s
  Total: ~2–3 minutes worst case
← Week 3: Log Aggregation & Analysis

Grafana Loki (alternative)

Loki stores logs without full indexing — query with LogQL:

# Count of error logs per 5 minutes
sum(count_over_time({service="task-svc", level="error"}[5m]))

# Extract latency field and compute P95
quantile_over_time(0.95,
  {service="task-svc"} | json | duration_ms > 0 | unwrap duration_ms [5m]
) by (route)

Loki alert rule:

- alert: PanicDetected
  expr: count_over_time({service="task-svc"} |= "panicked" [1m]) > 0
  for: 0m     # alert immediately
  annotations:
    summary: "Panic detected in task-svc"
← Week 3: Log Aggregation & Analysis

Key Takeaways

  • Log-based alerts are best for rare, semantically specific events (panics, auth failures)
  • CloudWatch Logs → metric filter pipeline has ~2–3 minute end-to-end latency; account for this in SLA expectations
  • Loki's LogQL supports structured field extraction and alerting without full text indexing cost
  • Always pair log alerts with runbook links — log-based pages often require more investigation than metric pages

Tomorrow: correlating traces, metrics, and logs — unified observability.