HyperLogLog at Scale
Counting unique users (cardinality) for billions of events is a classic trade-off problem. Do you want 100% accuracy (massive memory) or high speed with 99% accuracy (tiny memory)?
1. The Cardinality Problem
To track unique users, you'd usually use a . But storing 1 billion unique in a would take gigabytes of RAM.
2. The Solution: Probabilistic Counting
HyperLogLog (HLL) estimates cardinality using the properties of random binary strings. It counts the number of leading zeros in the hash of every input.
3. Real-world Efficiency
Redis's implementation of HLL uses a maximum of 12KB of memory to estimate cardinality for billions of unique items with an error rate of ~0.81%.
4. Why leading zeros work
If hashes are uniformly distributed, seeing a hash with many leading zeros is rare.
The maximum observed zero-prefix length correlates with set cardinality.
HLL improves accuracy by splitting observations across many registers and combining them with harmonic-mean style estimation.
5. Operational use cases
HLL is ideal for:
- daily/monthly active users
- unique visitors by campaign
- unique IP estimation for abuse detection
- approximate unique keys in streaming pipelines
It is not ideal when exact billing/audit correctness is required.
6. Mergeability advantage
A major benefit is composability:
- compute HLL sketches per shard/region/window
- merge sketches centrally
- estimate global cardinality without raw event transfer
This is powerful for large analytics systems where raw data movement is expensive.
7. Accuracy expectations
Typical Redis HLL error is ~0.81% with fixed memory footprint.
For many product analytics questions, this is a strong trade-off:
- tiny memory
- stable throughput
- bounded relative error
For compliance-grade numbers, use exact dedupe pipelines instead.
8. Common pitfalls
- low-quality hash function causing bias
- using HLL for tiny cardinalities where exact set is cheap
- mixing incompatible sketch formats
- treating estimate as exact count in financial reporting
Always document "approximate" in downstream dashboards.
9. Design pattern in streaming stacks
Practical pattern:
- event consumers hash entity IDs
- update per-tenant/per-window HLL sketch
- periodically snapshot sketch to object storage
- merge for global views in batch layer
This supports near-real-time unique metrics at low infrastructure cost.
