← Week 1: Distributed Tracing

Day 4: Sampling Strategies

Phase 6 · Sep 5, 2026

← Week 1: Distributed Tracing

Agenda (2–3 hours)

  • Read (45 min): OpenTelemetry sampling documentation; Jaeger adaptive sampling documentation; tail-based sampling in the OTel Collector
  • Study (45 min): A service handles 10,000 req/s. At 100% sampling, storage costs $500/month. Design a sampling strategy that captures all errors and 1% of successful requests
  • Practice (45 min): Configure parent-based sampling at 10% in the OTel SDK; add a rule to always sample spans with error = true
  • Challenge (30 min): Tail-based sampling requires buffering the full trace before the sampling decision. What happens when a tail sampler's buffer fills up? Design an overflow policy
← Week 1: Distributed Tracing

Why Sample?

At 10,000 req/s with 5 spans/trace:

  • 100% sampling: 50,000 spans/s → ~4GB/day storage
  • 1% sampling: 500 spans/s → 40MB/day
  • Smart sampling: capture all errors + 1% success → ~50MB/day

Sampling does not affect metrics (counters always increment) — only trace storage.

← Week 1: Distributed Tracing

Head-Based Sampling

Decision made at the root span — propagated to all downstream services.

use opentelemetry_sdk::trace::{Sampler, ParentBased, TraceIdRatioBased};

let sampler = ParentBased::new(TraceIdRatioBased::new(0.01)); // 1%

opentelemetry_sdk::trace::Config::default()
    .with_sampler(sampler)

ParentBased: if the parent was sampled, sample the child. Consistent across services.
TraceIdRatioBased: deterministic — same trace ID always makes the same decision.

← Week 1: Distributed Tracing

Tail-Based Sampling (OTel Collector)

Decision made after the trace is complete — can sample based on outcome:

# otel-collector.yaml
processors:
  tail_sampling:
    decision_wait: 10s          # buffer time to collect full trace
    num_traces: 100000          # max buffered traces
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }
← Week 1: Distributed Tracing

Key Takeaways

  • Head-based sampling is simple but can't filter on outcome (error/slow)
  • Tail-based sampling buffers the full trace before deciding; catches all errors and slow traces
  • ParentBased ensures consistent sampling across the whole call graph
  • Always sample errors at 100%; use ratio sampling (1–10%) for healthy traffic

Tomorrow: context propagation — W3C TraceContext and Baggage across service boundaries.