HyperLogLog at Scale

Counting unique users (cardinality) for billions of events is a classic trade-off problem. Do you want 100% accuracy (massive memory) or high speed with 99% accuracy (tiny memory)?

1. The Cardinality Problem

To track unique users, you'd usually use a . But storing 1 billion unique in a would take gigabytes of RAM.

2. The Solution: Probabilistic Counting

HyperLogLog (HLL) estimates cardinality using the properties of random binary strings. It counts the number of leading zeros in the hash of every input.

3. Real-world Efficiency

Redis's implementation of HLL uses a maximum of 12KB of memory to estimate cardinality for billions of unique items with an error rate of ~0.81%.

4. Why leading zeros work

If hashes are uniformly distributed, seeing a hash with many leading zeros is rare.
The maximum observed zero-prefix length correlates with set cardinality.
HLL improves accuracy by splitting observations across many registers and combining them with harmonic-mean style estimation.

5. Operational use cases

HLL is ideal for:

daily/monthly active users
unique visitors by campaign
unique IP estimation for abuse detection
approximate unique keys in streaming pipelines

It is not ideal when exact billing/audit correctness is required.

6. Mergeability advantage

A major benefit is composability:

compute HLL sketches per shard/region/window
merge sketches centrally
estimate global cardinality without raw event transfer

This is powerful for large analytics systems where raw data movement is expensive.

7. Accuracy expectations

Typical Redis HLL error is ~0.81% with fixed memory footprint.
For many product analytics questions, this is a strong trade-off:

tiny memory
stable throughput
bounded relative error

For compliance-grade numbers, use exact dedupe pipelines instead.

8. Common pitfalls

low-quality hash function causing bias
using HLL for tiny cardinalities where exact set is cheap
mixing incompatible sketch formats
treating estimate as exact count in financial reporting

Always document "approximate" in downstream dashboards.

9. Design pattern in streaming stacks

Practical pattern:

event consumers hash entity IDs
update per-tenant/per-window HLL sketch
periodically snapshot sketch to object storage
merge for global views in batch layer

This supports near-real-time unique metrics at low infrastructure cost.

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

HyperLogLog at Scale: Billion-Cardinality Estimation