← Week 2: Consensus Algorithms

Day 13: Consensus in Production

Phase 1 · Jun 1, 2026

← Week 2: Consensus Algorithms

Agenda (2–3 hours)

  • Read (45 min): Corbett et al. "Spanner: Google's Globally Distributed Database" (OSDI 2012) §1–3; CockroachDB architecture blog post
  • Study (45 min): How Spanner uses TrueTime; how CockroachDB maps SQL to Raft groups
  • Practice (45 min): Set up a local CockroachDB 3-node cluster; observe range leadership and rebalancing
  • Challenge (30 min): What makes Spanner's external consistency stronger than CockroachDB's? What does it cost?
← Week 2: Consensus Algorithms

Consensus as a Building Block

Standalone consensus (Paxos/Raft) is a primitive — production databases build entire systems on top.

The key abstraction: replicated state machine

  • Apply consensus to a sequence of commands
  • Every replica executes commands in the same order
  • The result: any node can serve reads from a consistent state
← Week 2: Consensus Algorithms

CockroachDB

  • SQL database with serializable isolation and horizontal sharding
  • Data is divided into ranges (default 512MB), each replicated as an independent Raft group
  • A range has one Raft leader (leaseholder); reads and writes go to the leaseholder
  • Range splits and merges rebalance data across nodes

Key challenge: distributed transactions across ranges require a cross-range transaction protocol (similar to 2PC, but using Raft for coordinator durability).

← Week 2: Consensus Algorithms

Google Spanner

  • Globally distributed; external consistency (linearizability across datacenters)
  • Uses TrueTime API: each TT.now() returns an interval [earliest, latest]
    • Atomic clocks + GPS receivers in each datacenter
    • Max interval width ≈ 7ms
  • Commit wait: transactions delay commit by the TrueTime uncertainty interval, ensuring that by the time the client sees the commit, real time has advanced past the transaction's timestamp
  • This gives Spanner external consistency without requiring global clock synchronization
← Week 2: Consensus Algorithms

TiKV (PingCAP)

  • Distributed key-value store used by TiDB
  • Each region (shard) runs an independent Raft group
  • Multi-Raft: thousands of Raft groups per cluster; the routing layer (PD) manages region assignment
  • Uses Percolator (Google 2010) for cross-region distributed transactions: optimistic 2PC with Raft-backed lock service
← Week 2: Consensus Algorithms

Key Observations from Production

  1. Raft groups per shard — one Raft group per shard enables horizontal scaling
  2. Leaseholder optimization — reads go directly to the leaseholder, avoiding full quorum round-trips
  3. Clock accuracy matters — Spanner's external consistency relies on bounded clock uncertainty; most systems use optimistic locking instead
  4. Reconfiguration is complex — adding/removing nodes to a Raft group must be done carefully to avoid availability gaps
← Week 2: Consensus Algorithms

Key Takeaways

  • Production databases layer SQL, transactions, and sharding on top of Raft
  • CockroachDB: multi-Raft with range-based sharding; serializable SQL
  • Spanner: TrueTime-based external consistency; the gold standard for geo-distributed ACID
  • TiKV: Multi-Raft + Percolator transactions; powers TiDB
  • Consensus is a solved problem; the hard part is everything built on top of it

Tomorrow: Challenge — implement Raft leader election in Rust.