System DesignAdvancedarticle

System Design: Solving the Top K Problem (Heavy Hitters)

How does YouTube track trending videos or Twitter find trending hashtags in real-time? Learn about the Top K problem, Count-Min Sketch, and heavy hitters at scale.

Sachin SarawgiApril 20, 20263 min read3 minute lesson

System Design: Solving the Top K Problem

The "Top K" problem (or Heavy Hitters) is about finding the $ most frequent items in a massive stream of data. For example:

  • YouTube: The top 10 trending videos in the last hour.
  • Twitter: The top 50 trending hashtags globally.
  • E-commerce: The top selling products across all categories.

1. Core Requirements

  • Real-time: Results must be updated almost instantly.
  • High Volume: Handling millions of events per second.
  • Accuracy vs. Efficiency: At scale, 100% accuracy is often too expensive; we need efficient probabilistic solutions.

2. Naive Approach: Hash Map

Maintain a Hash Map of item_id -> count.

  • Problem: If you have billions of items, the hash map won't fit in RAM. If you store it on disk, it's too slow for real-time updates.

3. The Scalable Solution: Count-Min Sketch

The Count-Min Sketch is a probabilistic data structure that estimates the frequency of items in a stream using a constant amount of memory.

  • How it works:
    1. An array of W columns and D rows is created, all initialized to 0.
    2. D different hash functions are chosen.
    3. When an item arrives, it is hashed by each function to find its position in each row, and that counter is incremented.
  • Query: To find the frequency of an item, you hash it again and take the minimum value from its positions in the D rows.
  • Benefit: It uses fixed memory and provides an "upper bound" estimate with a very small error margin.

4. Distributed Architecture

To handle millions of events per second:

  1. Ingestion: Events land in Apache Kafka.
  2. Aggregation: Apache Flink or Spark Streaming workers process partitions of the stream.
  3. Local Top K: Each worker maintains its own local Count-Min Sketch and Top K list.
  4. Global Top K: A central aggregator merges the local lists from all workers to produce the final global Top K.

5. Time-Series Windowing

Trending topics change over time. We use Sliding Windows (e.g., 60-minute window sliding every 5 minutes).

  • Optimization: Use a Lossy Counting algorithm or a time-decayed counter where older events contribute less to the total score.

6. Storage

  • Real-time: The current Top K list is stored in Redis for instant access by the front-end API.
  • Historical: Full logs are stored in a data lake like Amazon S3 for long-term auditing and precise offline analysis.

Summary

The Top K problem is a classic example of the trade-off between Accuracy and Scale. By using probabilistic data structures like Count-Min Sketch and a distributed stream processing engine, you can track global trends in real-time without overwhelming your infrastructure.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

More in System Design

Category-based suggestions if you want to stay in the same domain.