← Week 2: Distributed Transactions

Day 10: Sagas — Implementation and Failure Handling

Phase 4 · Jul 31, 2026

← Week 2: Distributed Transactions

Agenda (2–3 hours)

  • Read (45 min): Caitie McCaffrey "Sagas" talk (YouTube); eventuate.io saga orchestration documentation
  • Study (45 min): Design the state machine for a saga execution — what states does each saga step have?
  • Practice (45 min): Implement a simple saga orchestrator in Rust: execute steps sequentially, run compensations on failure, persist saga state to a HashMap (simulate DB)
  • Challenge (30 min): What happens if the saga orchestrator itself crashes mid-saga? How do you make the orchestrator crash-safe?
← Week 2: Distributed Transactions

Saga Step State Machine

Each saga step:

[PENDING] → [RUNNING] → [COMPLETED]
                    ↘ [FAILED] → [COMPENSATING] → [COMPENSATED]
                              ↘ [COMPENSATION_FAILED] (alarm: manual intervention)

Saga overall:

[STARTED] → [COMPLETED]      (all steps succeeded)
          → [COMPENSATED]    (failure, all compensations succeeded)
          → [FAILED]         (compensation also failed: manual intervention)
← Week 2: Distributed Transactions

Saga Orchestrator in Rust

struct SagaOrchestrator<S: SagaStore> {
    steps: Vec<Box<dyn SagaStep>>,
    store: S,
}

#[async_trait]
trait SagaStep: Send + Sync {
    async fn execute(&self, ctx: &SagaContext) -> Result<StepOutput, StepError>;
    async fn compensate(&self, ctx: &SagaContext) -> Result<(), StepError>;
    fn name(&self) -> &str;
}

impl<S: SagaStore> SagaOrchestrator<S> {
    async fn run(&self, saga_id: Uuid, input: Value) -> Result<SagaOutput, SagaError> {
        let mut ctx = SagaContext::new(saga_id, input);
        for (i, step) in self.steps.iter().enumerate() {
            self.store.record_step_start(saga_id, step.name()).await?;
            match step.execute(&ctx).await {
                Ok(output) => { ctx.add_output(step.name(), output); }
                Err(e) => { return self.compensate_from(saga_id, i, &ctx).await; }
            }
        }
        Ok(ctx.into_output())
    }
}
← Week 2: Distributed Transactions

Crash Safety: Saga Log

The saga orchestrator must be crash-safe. Each state transition writes to a log before taking effect:

saga_log table:
  saga_id     UUID
  step_name   TEXT
  state       ENUM (started, completed, failed, compensating, compensated)
  output      JSONB
  created_at  TIMESTAMPTZ

On orchestrator restart: read the saga log, find any in-progress sagas, resume from the last recorded state.

This is the same pattern as write-ahead logging (WAL) in databases.

← Week 2: Distributed Transactions

Idempotency of Saga Steps

Steps will be retried on failure. They must be idempotent:

async fn reserve_inventory(&self, ctx: &SagaContext) -> Result<StepOutput, StepError> {
    let reservation_id = ctx.saga_id.to_string(); // deterministic key

    // Check if already reserved (idempotency check)
    if let Some(existing) = inventory_db.find_reservation(&reservation_id).await? {
        return Ok(StepOutput::from(existing));
    }

    // Create reservation
    let reservation = inventory_db
        .create_reservation(&reservation_id, ctx.input.item_id)
        .await?;
    Ok(StepOutput::from(reservation))
}
← Week 2: Distributed Transactions

Key Takeaways

  • Saga orchestrators must persist state before each transition (write-ahead log pattern)
  • On restart, resume in-progress sagas from the last recorded state
  • Every step must be idempotent — retries on failure will call it again
  • COMPENSATION_FAILED requires manual intervention — alert on it immediately

Tomorrow: ACID vs BASE, isolation levels, and MVCC.