Applications in Distributed Systems
LSM-tree SSTable lookups (RocksDB, LevelDB, Cassandra):
- Before reading a block from disk, check a Bloom filter
- Bloom filter says "not present" → skip disk I/O entirely
- Reduces read amplification dramatically (from O(levels) to O(1) for missing keys)
Distributed caches (prevent cache penetration):
- Bloom filter at the cache entry point blocks requests for keys known not to exist
- Prevents stampeding the DB with requests for absent keys
Web crawlers (Googlebot):
- "Have we seen this URL before?" — a few MB of Bloom filter vs GB of URL set