← Week 1: Distributed Tracing

Day 3: Jaeger and Grafana Tempo

Phase 6 · Sep 4, 2026

← Week 1: Distributed Tracing

Agenda (2–3 hours)

  • Read (45 min): Jaeger architecture documentation; Grafana Tempo documentation; TraceQL query language
  • Study (45 min): What storage backends does Jaeger support? How does Tempo's object-storage design differ from Jaeger's?
  • Practice (45 min): Run Jaeger all-in-one via Docker; generate traces from the app; find the slowest span using the Jaeger UI search
  • Challenge (30 min): Write a TraceQL query to find all traces where db.query duration exceeds 100ms and the root service is task-service
← Week 1: Distributed Tracing

Jaeger Architecture

Application → OTLP → OTel Collector → Jaeger Collector
                                            └── Storage (Cassandra / Elasticsearch / in-memory)
                                                  └── Jaeger Query API
                                                        └── Jaeger UI

All-in-one binary (dev): runs all components in a single process.
Production: separate collector and query services + external storage.

docker run -d --name jaeger \
  -p 16686:16686 \   # UI
  -p 4317:4317 \     # OTLP gRPC
  jaegertracing/all-in-one:latest
← Week 1: Distributed Tracing

Grafana Tempo

Object-storage-based trace backend — stores traces in S3/GCS, no index database required.

# tempo.yaml
storage:
  trace:
    backend: s3
    s3:
      bucket: my-traces-bucket
      region: us-east-1

TraceQL — query language for trace search:

{ span.db.table = "tasks" && duration > 100ms }
| select(span.db.query, rootName, traceDuration)

Tempo integrates with Grafana: link from a log line or metric to the corresponding trace.

← Week 1: Distributed Tracing

Reading a Flame Graph

api-gateway          [====================================] 200ms
  auth-service       [==]                                   13ms
  task-service           [==============================]  170ms
    db:query                    [==========]               60ms  ← hot
    cache:get                              [=]              2ms
    serialize                               [====]         20ms
  • Width = duration (time-proportional)
  • Children are contained within parent span time
  • Gaps between children = work done in the parent span itself
  • Identify: longest child, largest gap, sequential vs parallel calls
← Week 1: Distributed Tracing

Key Takeaways

  • Jaeger uses Elasticsearch/Cassandra for indexed span search; Tempo uses object storage
  • TraceQL enables structured queries across span attributes and durations
  • Flame graphs reveal sequential vs parallel work and pin-point latency bottlenecks
  • Tempo + Grafana links traces from dashboards and logs for unified investigation

Tomorrow: sampling strategies — head-based, tail-based, and adaptive sampling.