← Week 3: Distributed Data Structures

Day 18: LSM Trees

Phase 1 · Jun 6, 2026

← Week 3: Distributed Data Structures

Agenda (2–3 hours)

  • Read (45 min): O'Neil et al. "The Log-Structured Merge-Tree" (1996) §1–3; RocksDB wiki: compaction overview
  • Study (45 min): Trace a write, memtable flush, and L0→L1 compaction; calculate write/space amplification for leveled compaction
  • Practice (45 min): Sketch the read path: when does a read need to check how many SSTables?
  • Challenge (30 min): Compare LSM trees to B-trees for read-heavy vs write-heavy workloads; when would you choose each?
← Week 3: Distributed Data Structures

Why LSM Trees?

B-trees are optimized for reads: O(log n) lookups and updates in-place.
But random writes require seeking to arbitrary disk positions — terrible for HDDs and not ideal for SSDs either.

LSM-tree insight: batch and sort writes before writing to disk sequentially. Sacrifice some read performance for dramatically higher write throughput.

← Week 3: Distributed Data Structures

LSM Write Path

  1. MemTable: writes go to an in-memory sorted structure (Red-Black tree or skip list)
  2. WAL (Write-Ahead Log): concurrent write to WAL for crash recovery
  3. Flush: when MemTable reaches threshold, flush to immutable SSTable (Sorted String Table) on disk (Level 0)
  4. Compaction: background process merges and sorts SSTables across levels

Sequential writes: both WAL and SSTable writes are sequential → very high write throughput.

← Week 3: Distributed Data Structures

Compaction Strategies

Leveled compaction (LevelDB, RocksDB default):

  • L0: 4 SSTables (unsorted)
  • L1: 10MB, keys don't overlap between files
  • L2: 100MB, Ln+1 is 10× larger
  • Merge L0 → L1 on threshold; maintain no-overlap invariant within each level
  • Write amplification: ~10×; Space amplification: ~1.1×; Good for reads

Size-tiered compaction (Cassandra default):

  • Merge N similar-sized SSTables into one larger SSTable
  • Write amplification: ~3–5×; Space amplification: ~2×; Better write throughput

FIFO compaction: keep newest SSTables; drop oldest. Good for time-series data.

← Week 3: Distributed Data Structures

Read Path and Bloom Filters

Reading from an LSM tree is more expensive than writing:

  1. Check MemTable
  2. Check L0 SSTables (may need to check all)
  3. Check L1, L2, ... (only one file per level due to non-overlapping invariant)

Without optimization, this is O(levels) disk reads per miss.

Bloom filters: each SSTable has a Bloom filter. Before reading an SSTable block, check the filter. "Not present" → skip entirely. Reduces average reads to ~1 disk I/O for most workups.

← Week 3: Distributed Data Structures

Key Takeaways

  • LSM trees achieve high write throughput by converting random writes to sequential writes
  • The cost is read amplification (check multiple SSTables) and write amplification (compaction re-writes data)
  • Bloom filters per SSTable dramatically reduce unnecessary disk reads
  • RocksDB (Facebook) is the most widely used LSM engine: used by CockroachDB, TiKV, Cassandra, MongoDB WiredTiger option
  • DynamoDB's storage layer is likely an internal B-tree + log hybrid; Cassandra uses LSM

Tomorrow: skip lists — the probabilistic data structure powering the LSM MemTable.