← Week 1: Circuit Breakers & Bulkheads

Day 5: Timeout Hierarchy

Phase 4 · Jul 26, 2026

← Week 1: Circuit Breakers & Bulkheads

Agenda (2–3 hours)

  • Read (45 min): Google SRE Book Chapter 22 (addressing cascading failures); AWS "Timeouts, retries, and backoff" blog post
  • Study (45 min): For a three-tier service call (frontend → backend → DB), design timeout values so that each outer tier's timeout is larger than the inner tier's
  • Practice (45 min): Implement deadline propagation in a mock three-tier service using tokio::time::timeout at each level; verify the outer timeout fires correctly when the inner service is slow
  • Challenge (30 min): What is a "deadline" vs a "timeout"? Why does gRPC use deadlines instead of timeouts?
← Week 1: Circuit Breakers & Bulkheads

Types of Timeouts

Connection timeout: time to establish a TCP connection

  • Should be short (1–5s): if you can't connect in 5s, the host is likely down
  • Default in many HTTP clients: 30s or infinity (dangerous)

Request timeout: time from first byte sent to last byte received

  • Depends on the operation: read (100ms), write (1s), report generation (30s)

Socket idle timeout: close connection after N seconds of inactivity

  • Prevents leaked connections from holding resources

Overall deadline: maximum wall-clock time from the perspective of the end user

  • Propagate down the call chain; each inner service uses the remaining time
← Week 1: Circuit Breakers & Bulkheads

Timeout Hierarchy Example

User → Frontend (timeout: 10s)
  → Backend (timeout: 8s)  ← frontend gives backend 8s, keeps 2s buffer
    → DB (timeout: 5s)     ← backend gives DB 5s, keeps 3s buffer
      → Storage (timeout: 3s) ← DB gives storage 3s

Each layer leaves a buffer for its own processing overhead. If DB takes 6s (over its 5s budget), the backend returns an error immediately and the frontend sees a 6.x second request — still within its 8s limit.

Without this hierarchy, a slow DB can cause a thundering cascade of timeouts at all layers simultaneously.

← Week 1: Circuit Breakers & Bulkheads

Deadline vs Timeout

Timeout: duration remaining from the start of this operation

timeout(Duration::from_secs(5), operation()).await?;

Deadline: absolute point in time when the operation must complete

let deadline = Instant::now() + Duration::from_secs(5);
timeout_at(deadline, operation()).await?;

gRPC uses deadlines because they propagate naturally: pass the same deadline to downstream calls. The remaining time shrinks automatically as the call proceeds through the chain.

Timeouts don't propagate: a 5s timeout at each hop means the total latency budget is 5 × N, not 5.

← Week 1: Circuit Breakers & Bulkheads

Implementing Deadline Propagation in Rust

async fn handle_request(
    deadline: Instant,
    downstream: Arc<dyn DownstreamService>,
) -> Result<Response, AppError> {
    let remaining = deadline.saturating_duration_since(Instant::now());
    if remaining.is_zero() {
        return Err(AppError::DeadlineExceeded);
    }
    tokio::time::timeout_at(
        tokio::time::Instant::from_std(deadline),
        downstream.call(Request::new()),
    )
    .await
    .map_err(|_| AppError::DeadlineExceeded)?
}
← Week 1: Circuit Breakers & Bulkheads

Key Takeaways

  • Design timeouts as a hierarchy: each layer's budget is smaller than the caller's
  • Use deadlines (absolute time) for propagation; timeouts (duration) for entry points
  • Always set connection timeouts explicitly — default infinity is a resource leak waiting to happen
  • gRPC's request.timeout() is the standard mechanism; tonic exposes it on Request

Tomorrow: chaos engineering — deliberately injecting failures to test resilience.