← Week 2: Consensus Algorithms

Day 10: Raft Deep Dive — Snapshots and Membership Changes

Phase 1 · May 29, 2026

← Week 2: Consensus Algorithms

Agenda (2–3 hours)

  • Read (45 min): Ongaro & Ousterhout §6–7 (snapshots, cluster membership); Ongaro's PhD thesis Chapter 4 on reconfiguration
  • Study (45 min): Trace the snapshot installation flow; work through the single-server joint consensus transition
  • Practice (45 min): Draw the state machine for a node catching up from snapshot + subsequent log entries
  • Challenge (30 min): Why is removing a leader during a membership change dangerous? How does etcd handle it?
← Week 2: Consensus Algorithms

Log Compaction: The Problem

The Raft log grows without bound. After enough entries, a new or recovering node would take forever to replay the entire log.

Solution: snapshotting. Each node independently:

  1. Serializes its state machine to a snapshot at some log index
  2. Discards all log entries up to that index
  3. Retains the snapshot's last-included index and term

This is independent per node — no coordination required for snapshotting.

← Week 2: Consensus Algorithms

Installing Snapshots on Lagging Followers

If a follower is so far behind that the leader no longer has the log entries it needs:

  1. Leader sends InstallSnapshot RPC with snapshot data
  2. Follower replaces its state machine with the snapshot
  3. Follower resumes log replication from the snapshot's last index

The snapshot is chunked for large state machines. The follower must atomically install the snapshot to avoid partial states.

← Week 2: Consensus Algorithms

Cluster Membership Changes

Adding or removing nodes is dangerous: a naive single-step change can create two disjoint majorities.

Example: 3-node cluster → 5-node cluster. If the change isn't atomic:

  • Old quorum (2/3) and new quorum (3/5) might overlap in only 1 node
  • Two separate leaders could be elected simultaneously

Raft's solution: joint consensus (or single-server changes):

Single-server change (simpler):

  • Add or remove exactly one server at a time
  • The new configuration's quorum always overlaps with the old configuration's quorum
  • Commit the config change log entry before switching to the new config
← Week 2: Consensus Algorithms

Read Linearizability

A naive Raft read (just returning state machine value) is unsafe: a newly-elected leader's state machine might be stale before it commits a no-op entry.

Linearizable reads require:

  1. Leader must have committed a no-op entry in the current term (confirms it's the real leader)
  2. Leader checks it hasn't been superseded before responding (ReadIndex or lease-based)

etcd uses ReadIndex: leader records current commit index, waits for heartbeat confirming leadership, then returns the read once state machine catches up to that index.

← Week 2: Consensus Algorithms

Key Takeaways

  • Snapshots allow log compaction without coordination — each node snapshots independently
  • InstallSnapshot RPC handles nodes that have fallen too far behind for log catch-up
  • Membership changes must be done one server at a time to avoid split-brain
  • Linearizable reads require extra steps beyond just reading the state machine
  • Real Raft implementations (etcd, TiKV) add leader leases, pre-vote, and read indices on top of the base protocol

Tomorrow: Byzantine fault tolerance — what if nodes lie?