System Design: Designing an Ad Click Aggregator
Ad click aggregation is a massive scale data problem. When billions of users click on ads across the web, those clicks must be aggregated, deduplicated, and stored for both real-time analytics (advertiser dashboards) and accurate billing.
1. Core Requirements
- High Throughput: Handling billions of clicks per day (tens of thousands per second).
- Accuracy: Billing requires exactly-once processing. We cannot charge an advertiser twice for the same click (Deduplication).
- Latency: Real-time dashboards should update within seconds.
- Resilience: No click should ever be lost, even if a data center goes down.
2. The Data Path
- Click Event: A user clicks an ad. The browser sends a request to our Click Tracking Server.
- Raw Log Ingestion: The Tracking Server immediately pushes the raw click event into Apache Kafka.
- Why Kafka? It acts as a high-performance buffer and persistent log.
- Aggregation Engine: Apache Flink or Spark Streaming consumes from Kafka.
- Deduplication: Uses a Redis cache or a stateful Flink map to filter out duplicate clicks (based on
click_idanduser_id). - Windowing: Clicks are aggregated in 1-minute tumbling windows.
- Deduplication: Uses a Redis cache or a stateful Flink map to filter out duplicate clicks (based on
- Storage:
- Real-time: Aggregated counts are stored in Cassandra or Redis for the advertiser dashboard.
- Historical: Raw clicks are stored in Amazon S3 (Parquet) for long-term auditing and fraud detection.
3. Dealing with "Exactly-Once" Semantics
Billing systems cannot tolerate duplicates.
- Kafka Idempotency: Producers are configured with
enable.idempotence=true. - Checkpointing: Flink uses distributed snapshots (checkpoints) to ensure that if a worker fails, it resumes from the exact point in the log where it left off, ensuring no event is processed twice or missed.
4. Scaling the Write Volume
The biggest bottleneck is the write volume to the database.
- Pre-aggregation: Never write every single click to the database. Aggregate them in RAM (in Flink) and write only the summary (e.g., "Ad 123 got 500 clicks in the last minute") once to the database.
- Sharding: Shard the database by
ad_idto distribute the aggregation load across multiple nodes.
5. Fraud Detection
Ad fraud (bots clicking ads) is a major concern.
- Real-time Filter: Use ML models or rule-based filters (e.g., "more than 10 clicks from same IP in 1 second") to flag and filter fraudulent clicks before they reach the billing layer.
Summary
The engineering of an ad click aggregator is a battle of Write Throughput and Data Integrity. By using Kafka for ingestion and a stateful stream processor like Flink for pre-aggregation and deduplication, you can build a system that processes billions of events with perfect accuracy and sub-second latency.
