A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set.
It is unique because it can give two types of answers:
- "Possibly in set": The element might be there, but it could be a "False Positive."
- "Definitely not in set": Guaranteed to be correct. There are zero False Negatives.
The Core Concept: Bit Arrays and Hashing
A Bloom Filter consists of:
- A Bit Array of size $m$, initialized to 0.
- $k$ different Hash Functions, each mapping an element to one of the $m$ array positions.
To add an element: Run it through all $k$ hash functions and set the bits at those positions to 1. To query an element: Run it through all $k$ hash functions. If all bits at those positions are 1, the element is "possibly in set." If any bit is 0, the element is "definitely not in set."
Bloom Filter Implementation in Java (Simple)
import java.util.BitSet;
import java.util.function.Function;
public class BloomFilter<T> {
private BitSet bitSet;
private int bitSetSize;
private int numHashFunctions;
private Function<T, Integer>[] hashFunctions;
public BloomFilter(int m, int k) {
this.bitSetSize = m;
this.numHashFunctions = k;
this.bitSet = new BitSet(m);
}
public void add(T element) {
for (int i = 0; i < numHashFunctions; i++) {
int hash = Math.abs((element.hashCode() + i * 31) % bitSetSize);
bitSet.set(hash);
}
}
public boolean mightContain(T element) {
for (int i = 0; i < numHashFunctions; i++) {
int hash = Math.abs((element.hashCode() + i * 31) % bitSetSize);
if (!bitSet.get(hash)) {
return false; // Definitely not there
}
}
return true; // Might be there
}
}
Why use Bloom Filters?
| Feature | Hash Set | Bloom Filter |
|---|---|---|
| Space Complexity | $O(N)$ - stores actual data | $O(M)$ - very small bit array |
| Accuracy | 100% | Probabilistic (False Positives) |
| Deletion | Supported | Not supported (Standard version) |
| Performance | Fast | Extremely Fast |
Real-World Applications
- Database Caching: Google BigTable and Apache Cassandra use Bloom Filters to avoid looking up non-existent rows on disk (preventing expensive I/O).
- Web Browsers: Google Chrome once used Bloom Filters to check if a URL was a known malicious site before performing a full server-side check.
- Networking: Routers use them to track blocked IP addresses or cache keys without using much memory.
Summary
Bloom Filters are a masterclass in trading accuracy for extreme efficiency. By accepting a small, controllable error rate, you can handle massive datasets that would never fit into a standard Hash Map. Understanding this trade-off is a key skill for senior backend engineers and system architects.
