System Design: Solving the Top K Problem
The "Top K" problem (or Heavy Hitters) is about finding the $ most frequent items in a massive stream of data. For example:
- YouTube: The top 10 trending videos in the last hour.
- Twitter: The top 50 trending hashtags globally.
- E-commerce: The top selling products across all categories.
1. Core Requirements
- Real-time: Results must be updated almost instantly.
- High Volume: Handling millions of events per second.
- Accuracy vs. Efficiency: At scale, 100% accuracy is often too expensive; we need efficient probabilistic solutions.
2. Naive Approach: Hash Map
Maintain a Hash Map of item_id -> count.
- Problem: If you have billions of items, the hash map won't fit in RAM. If you store it on disk, it's too slow for real-time updates.
3. The Scalable Solution: Count-Min Sketch
The Count-Min Sketch is a probabilistic data structure that estimates the frequency of items in a stream using a constant amount of memory.
- How it works:
- An array of
Wcolumns andDrows is created, all initialized to 0. Ddifferent hash functions are chosen.- When an item arrives, it is hashed by each function to find its position in each row, and that counter is incremented.
- An array of
- Query: To find the frequency of an item, you hash it again and take the minimum value from its positions in the
Drows. - Benefit: It uses fixed memory and provides an "upper bound" estimate with a very small error margin.
4. Distributed Architecture
To handle millions of events per second:
- Ingestion: Events land in Apache Kafka.
- Aggregation: Apache Flink or Spark Streaming workers process partitions of the stream.
- Local Top K: Each worker maintains its own local Count-Min Sketch and Top K list.
- Global Top K: A central aggregator merges the local lists from all workers to produce the final global Top K.
5. Time-Series Windowing
Trending topics change over time. We use Sliding Windows (e.g., 60-minute window sliding every 5 minutes).
- Optimization: Use a Lossy Counting algorithm or a time-decayed counter where older events contribute less to the total score.
6. Storage
- Real-time: The current Top K list is stored in Redis for instant access by the front-end API.
- Historical: Full logs are stored in a data lake like Amazon S3 for long-term auditing and precise offline analysis.
Summary
The Top K problem is a classic example of the trade-off between Accuracy and Scale. By using probabilistic data structures like Count-Min Sketch and a distributed stream processing engine, you can track global trends in real-time without overwhelming your infrastructure.
