← Week 3: Distributed Data Structures

Day 17: Merkle Trees

Phase 1 · Jun 5, 2026

← Week 3: Distributed Data Structures

Agenda (2–3 hours)

  • Read (45 min): Merkle's original patent; Dynamo paper §4.7 (anti-entropy); Bitcoin developer guide on the Merkle tree in block headers
  • Study (45 min): Trace a two-replica anti-entropy sync using Merkle tree comparison
  • Practice (45 min): Implement a Merkle tree over a set of key-value pairs in Rust (build bottom-up, expose root hash and proof generation)
  • Challenge (30 min): How does Git's object graph use a Merkle DAG? How does IPFS extend this to a content-addressed distributed filesystem?
← Week 3: Distributed Data Structures

What Is a Merkle Tree?

A hash tree where:

  • Leaves = hash(data chunk)
  • Internal nodes = hash(left_child || right_child)
  • Root = single hash representing all data

Property: any change to any data chunk changes the root hash — and the root hash is the only thing you need to compare to detect divergence.

← Week 3: Distributed Data Structures

Anti-Entropy with Merkle Trees

Amazon Dynamo's synchronization between replicas:

  1. Each replica maintains a Merkle tree over its keyspace
  2. To sync replica A and B: compare root hashes
  3. If equal → in sync (no I/O needed)
  4. If different → recurse into left and right subtrees
  5. Find the differing leaf → sync only those keys

Efficiency: O(log n) messages to identify differences in n key-value pairs, rather than comparing all keys.

Used by: Dynamo, Cassandra (repair mechanism), Riak

← Week 3: Distributed Data Structures

Merkle Proofs

A Merkle proof is a compact proof that a value is part of the tree:

  1. Client has root hash R
  2. Server wants to prove that leaf L is in the tree
  3. Server provides: L's hash + hashes of siblings along the path to root
  4. Client recomputes: hash(L, sibling1), hash(result, sibling2), ... verify against R

Proof size: O(log n) hashes.

Applications:

  • Bitcoin/Ethereum SPV (simplified payment verification): verify a transaction is in a block without downloading the whole blockchain
  • Certificate Transparency (RFC 9162): prove a certificate is in the CT log
← Week 3: Distributed Data Structures

Git's Merkle DAG

Git stores all content as a Merkle DAG (directed acyclic graph):

  • Blobs (file contents), Trees (directories), Commits — all addressed by SHA-1 of content
  • Commit hash changes if any file in the tree changes
  • Immutable history: you cannot change a past commit without changing all subsequent commits

IPFS extends this: content-addressed blocks, distributed across nodes — the hash IS the address.

← Week 3: Distributed Data Structures

Key Takeaways

  • Merkle trees provide compact, efficient comparison and verification of large datasets
  • Anti-entropy: find divergent data in O(log n) messages instead of O(n)
  • Merkle proofs enable compact membership proofs for clients with only the root hash
  • Used in Dynamo, Cassandra, Bitcoin, Certificate Transparency, Git, IPFS

Tomorrow: LSM trees — how modern storage engines achieve high write throughput.