System DesignAdvancedarticle

System Design: Designing an Ad Click Aggregator

How does Google or Facebook aggregate billions of ad clicks for billing? A technical deep dive into Write-Heavy Scaling, Exactly-Once Processing, and Real-time Aggregation.

Sachin SarawgiApril 20, 20263 min read3 minute lesson

System Design: Designing an Ad Click Aggregator

Ad click aggregation is a massive scale data problem. When billions of users click on ads across the web, those clicks must be aggregated, deduplicated, and stored for both real-time analytics (advertiser dashboards) and accurate billing.

1. Core Requirements

  • High Throughput: Handling billions of clicks per day (tens of thousands per second).
  • Accuracy: Billing requires exactly-once processing. We cannot charge an advertiser twice for the same click (Deduplication).
  • Latency: Real-time dashboards should update within seconds.
  • Resilience: No click should ever be lost, even if a data center goes down.

2. The Data Path

  1. Click Event: A user clicks an ad. The browser sends a request to our Click Tracking Server.
  2. Raw Log Ingestion: The Tracking Server immediately pushes the raw click event into Apache Kafka.
    • Why Kafka? It acts as a high-performance buffer and persistent log.
  3. Aggregation Engine: Apache Flink or Spark Streaming consumes from Kafka.
    • Deduplication: Uses a Redis cache or a stateful Flink map to filter out duplicate clicks (based on click_id and user_id).
    • Windowing: Clicks are aggregated in 1-minute tumbling windows.
  4. Storage:
    • Real-time: Aggregated counts are stored in Cassandra or Redis for the advertiser dashboard.
    • Historical: Raw clicks are stored in Amazon S3 (Parquet) for long-term auditing and fraud detection.

3. Dealing with "Exactly-Once" Semantics

Billing systems cannot tolerate duplicates.

  • Kafka Idempotency: Producers are configured with enable.idempotence=true.
  • Checkpointing: Flink uses distributed snapshots (checkpoints) to ensure that if a worker fails, it resumes from the exact point in the log where it left off, ensuring no event is processed twice or missed.

4. Scaling the Write Volume

The biggest bottleneck is the write volume to the database.

  • Pre-aggregation: Never write every single click to the database. Aggregate them in RAM (in Flink) and write only the summary (e.g., "Ad 123 got 500 clicks in the last minute") once to the database.
  • Sharding: Shard the database by ad_id to distribute the aggregation load across multiple nodes.

5. Fraud Detection

Ad fraud (bots clicking ads) is a major concern.

  • Real-time Filter: Use ML models or rule-based filters (e.g., "more than 10 clicks from same IP in 1 second") to flag and filter fraudulent clicks before they reach the billing layer.

Summary

The engineering of an ad click aggregator is a battle of Write Throughput and Data Integrity. By using Kafka for ingestion and a stateful stream processor like Flink for pre-aggregation and deduplication, you can build a system that processes billions of events with perfect accuracy and sub-second latency.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

System Design: Designing a Distributed Logging System (TB/Day Scale)

System Design: Designing a Distributed Logging System In a microservices architecture with thousands of containers, logs are scattered everywhere. You need a centralized system that can ingest terabytes of log data every…

Apr 20, 20263 min read
Deep Dive
#system-design#logging#elk-stack
System DesignAdvanced

System Design: Designing a Distributed Message Queue (Kafka Architecture)

System Design: Designing a Distributed Message Queue A Distributed Message Queue is the backbone of modern asynchronous architecture. It allows services to communicate without being tightly coupled. While many use Apache…

Apr 20, 20263 min read
Deep Dive
#system-design#kafka#message-queue
System DesignAdvanced

System Design: Designing a Real-Time Analytics Dashboard

System Design: Designing a Real-Time Analytics Dashboard Real-time analytics dashboards (used for tracking game players, ad clicks, or server metrics) require capturing and visualizing massive data streams. The challenge…

Apr 20, 20262 min read
Deep Dive
#system-design#analytics#real-time
System DesignAdvanced

System Design: Designing a Distributed Task Scheduler

System Design Masterclass: Designing a Distributed Task Scheduler Every backend engineer has written a cron job. It's simple: you put a script on a Linux server and tell the OS to run it every night at midnight. But what…

Apr 20, 20266 min read
Case StudyBackend Systems Mastery
#system-design#task-scheduler#cron

More in System Design

Category-based suggestions if you want to stay in the same domain.