← Week 3: Log Aggregation & Analysis

Day 20: Correlating Traces, Metrics, and Logs

Phase 6 · Sep 21, 2026

← Week 3: Log Aggregation & Analysis

Agenda (2–3 hours)

  • Read (45 min): Grafana Explore correlations documentation; OpenTelemetry correlation between signals; AWS X-Ray + CloudWatch integration
  • Study (45 min): A P99 spike appears in Grafana. Walk through the investigation path: metric → trace → log. What fields must be present in each signal to enable the jump?
  • Practice (45 min): Configure Grafana data source correlations between Prometheus, Tempo, and Loki; verify that clicking a trace ID in Tempo opens the correlated log lines in Loki
  • Challenge (30 min): An error appears in logs but no corresponding error span exists in the trace. What are 3 possible causes? How do you detect each?
← Week 3: Log Aggregation & Analysis

The Investigation Path

1. Metric alert fires: P99 latency > 500ms for task-svc

2. Grafana: metric → trace
   - Click on the spike in the time series panel
   - Grafana correlates: find Tempo traces where service=task-svc AND duration>500ms
   - Open the slowest trace in the flame graph

3. Trace → log
   - Identify the slow span: db:query (450ms)
   - Click trace_id on the span → Loki query: {service="task-svc"} | json | trace_id="abc123"
   - Read the debug log lines captured during that query

4. Root cause: DynamoDB partition throttle (RetryAttempts=3 visible in logs)
← Week 3: Log Aggregation & Analysis

Required Correlation Fields

Signal Required field Purpose
Metrics service label Group traces and logs by the same service
Traces trace_id (16-char hex) Link spans to log lines
Logs trace_id field (from span context) Jump from log to trace
Logs service field Consistent with metric label

with_current_span(true) in tracing-subscriber injects trace_id and span_id into every log event within a span scope — the key that enables log-trace correlation.

← Week 3: Log Aggregation & Analysis

Grafana Exemplars

Exemplars link metric data points to specific trace IDs:

// Emit exemplar with the histogram observation
histogram!("request_duration_seconds",
    "trace_id" => tracing::Span::current()
        .context()
        .span()
        .span_context()
        .trace_id()
        .to_string()
).record(elapsed.as_secs_f64());

In Grafana: hover on a latency spike → exemplar tooltip → click trace ID → opens Tempo.

← Week 3: Log Aggregation & Analysis

Key Takeaways

  • Correlation requires a shared key (trace_id, service) present in all three signals
  • with_current_span(true) propagates the trace context automatically to log events
  • Exemplars embed trace IDs into metric data points — enable one-click metric → trace navigation
  • The investigation path: alert → metric spike → correlated trace → correlated log lines

Tomorrow: Phase 6 Week 3 Challenge — unified observability for the full task service stack.