← Week 3: Distributed Data Structures

Day 15: Bloom Filters

Phase 1 · Jun 3, 2026

← Week 3: Distributed Data Structures

Agenda (2–3 hours)

  • Read (45 min): Bloom "Space/Time Trade-offs in Hash Coding with Allowable Errors" (1970); Broder & Mitzenmacher "Network Applications of Bloom Filters" (2004) §1–3
  • Study (45 min): Derive the optimal hash function count formula; calculate false positive rates for practical sizes
  • Practice (45 min): Implement a Bloom filter in Rust using the bit-vec crate
  • Challenge (30 min): Where does CockroachDB use Bloom filters? How does RocksDB use them to reduce I/O?
← Week 3: Distributed Data Structures

What Is a Bloom Filter?

A space-efficient probabilistic data structure for set membership testing.

  • True: "This element is definitely NOT in the set"
  • True: "This element is PROBABLY in the set" (false positives possible)
  • Never: false negatives

Operations: insert(x), query(x) — no deletion in the basic form.

← Week 3: Distributed Data Structures

Mechanics

A bit array of length m, and k independent hash functions:

Insert(x):

  • Compute h1(x), h2(x), ..., hk(x) mod m
  • Set those k bits to 1

Query(x):

  • Compute the same k indices
  • Return true iff ALL those bits are 1

False positive: x was never inserted, but all k hash positions happen to be set by other elements.

← Week 3: Distributed Data Structures

Sizing a Bloom Filter

Given n expected insertions and desired false positive rate p:

m = -n * ln(p) / (ln(2))^2   bits
k = (m/n) * ln(2)             hash functions

Practical example:

  • n = 1,000,000 URLs, p = 0.01 (1% FPR)
  • m ≈ 9,585,059 bits ≈ 1.14 MB (vs ~20MB for a hash set)
  • k ≈ 7 hash functions
← Week 3: Distributed Data Structures

Applications in Distributed Systems

LSM-tree SSTable lookups (RocksDB, LevelDB, Cassandra):

  • Before reading a block from disk, check a Bloom filter
  • Bloom filter says "not present" → skip disk I/O entirely
  • Reduces read amplification dramatically (from O(levels) to O(1) for missing keys)

Distributed caches (prevent cache penetration):

  • Bloom filter at the cache entry point blocks requests for keys known not to exist
  • Prevents stampeding the DB with requests for absent keys

Web crawlers (Googlebot):

  • "Have we seen this URL before?" — a few MB of Bloom filter vs GB of URL set
← Week 3: Distributed Data Structures

Key Takeaways

  • Bloom filters trade perfect accuracy for dramatically reduced space
  • No false negatives; false positives are bounded and tunable
  • Widely used in LSM storage engines to avoid unnecessary disk reads
  • Counting Bloom filters support deletion; Cuckoo filters offer better space efficiency

Tomorrow: consistent hashing — distributing keys across dynamically changing nodes.