← Week 3: Service Mesh & mTLS

Day 20: Observability in Service Mesh

Phase 3 · Jul 20, 2026

← Week 3: Service Mesh & mTLS

Agenda (2–3 hours)

  • Read (45 min): Istio observability documentation; Kiali documentation (service graph visualization)
  • Study (45 min): What does Envoy report per-request that a service can't see without instrumentation? What does Envoy NOT see?
  • Practice (45 min): Enable Prometheus + Jaeger + Kiali addons in Istio; observe the service graph while sending traffic; find a latency anomaly
  • Challenge (30 min): What is the difference between "black-box" observability (Envoy) and "white-box" observability (application-level traces)? When do you need both?
← Week 3: Service Mesh & mTLS

What Envoy Reports Automatically

Every sidecar automatically emits:

Metrics (Prometheus format):

  • Request count, success rate, latency percentiles per source-destination pair
  • Connection pool sizes, upstream health, circuit breaker state

Access logs (JSON):

  • Per-request: method, path, response code, latency, bytes in/out, upstream cluster, trace IDs

Traces (Zipkin/Jaeger B3 format):

  • Span for each hop through the mesh
  • Automatically propagated via B3 headers (services must forward headers they receive)
← Week 3: Service Mesh & mTLS

Distributed Tracing in Istio

Envoy creates a trace span for each request. For the trace to be complete across hops:

Service A receives:           Service A forwards to B:
  x-b3-traceid: abc123          x-b3-traceid: abc123  ← same!
  x-b3-spanid: 111              x-b3-spanid: 222      ← new child span
  x-b3-parentspanid: (none)     x-b3-parentspanid: 111

Services must propagate B3 headers (or W3C TraceContext headers). If a service calls B but doesn't forward trace headers, the trace is broken — B's span appears orphaned.

In Rust with tonic/Axum: use OpenTelemetry's TraceContextPropagator to extract/inject headers.

← Week 3: Service Mesh & mTLS

Kiali: Service Graph

Kiali provides a real-time service dependency graph built from Envoy metrics:

[frontend] → [cart] → [inventory]
          ↘ [payment] → [fraud-check]

Shows:

  • Error rates and latency per edge (red = degraded)
  • Circuit breaker state per node
  • Traffic distribution across versions (for canaries)
  • Missing mTLS (yellow edges)

Invaluable for debugging cascading failures: "why is frontend slow?" → follow the red edges.

← Week 3: Service Mesh & mTLS

Black-Box vs White-Box Observability

Black-box (Envoy/mesh): network-level visibility

  • Request counts, latency, errors per service boundary
  • No visibility inside a service: can't see which SQL query is slow

White-box (application): code-level visibility

  • Which handler, which query, which lock
  • Business-level context (user ID, order ID, request type)

Both are needed:

  • Black-box for: "service A calls service B with 5% errors"
  • White-box for: "the /api/checkout handler is slow because of a missing DB index"
← Week 3: Service Mesh & mTLS

Key Takeaways

  • Envoy reports per-request metrics, access logs, and trace spans automatically
  • Services must propagate trace headers for cross-service traces to be complete
  • Kiali provides a real-time visual service graph built from Envoy metrics
  • Black-box (mesh) + white-box (application) observability together give complete visibility

Tomorrow: Phase 3 Challenge — Istio VirtualService with mTLS and traffic shifting.