Real-time analytics dashboards power everything from ride-sharing surge pricing to video-streaming trending lists and real-time ad click monitoring. These platforms ingest millions of continuous events per second, run rolling time-window aggregations, and push the resulting metrics down to thousands of active client web browsers in under a second.
Building such a system is a major engineering challenge. You cannot simply write raw events to a standard relational database and query them on-demand; the sheer write load will lock tables, and analytical group-by queries over millions of rows will choke database CPUs. Instead, you must build a streaming architecture that separates the write-heavy event ingestion from the read-heavy aggregate visualizations.
This system design guide details the architectural blueprint for a highly scalable, low-latency real-time analytics system capable of ingesting 1,000,000 events per second and broadcasting updates to 50,000 active clients in real time.
System Requirements
To build a production-grade real-time analytics platform, we divide our system requirements into functional, non-functional, and scale specifications.
Functional Requirements
- High-Volume Ingestion: Ingest raw event streams (such as user clickstreams, system metrics, and transaction logs) from multiple client and server sources.
- Windowed Aggregations: Support calculations (such as count, sum, average, and unique counts) over tumbling (non-overlapping) and sliding (overlapping) time windows.
- Real-Time Push Visualizations: Push updated window metrics to client dashboards in real time with sub-second propagation latency.
- Durable Historical Storage: Store raw and aggregated data for long-term historical analysis and custom period queries.
- Metric Filtering & Grouping: Allow dashboards to filter and group metrics dynamically by dimensions (such as country, device, and campaign ID).
Non-Functional Requirements
- High Ingest Availability: Ingest paths must be highly available and buffered to absorb sudden traffic spikes (e.g., during major sales or advertising campaigns).
- Sub-Second Dashboard Latency: The delay between an event occurring and its representation on the dashboard must be less than 1 second.
- Fault-Tolerant Processing: Aggregations must be calculated with exactly-once processing guarantees, even if stream processing nodes restart.
- Low Query Latency: Historical queries for custom time ranges on the dashboard must return in less than 500 milliseconds.
Scale Assumptions
- Ingestion Throughput: 1,000,000 events per second.
- Dashboard Audience: 50,000 concurrent active users viewing the real-time dashboard.
API Design and Service Contracts
The analytics system exposes HTTPS REST endpoints for ingestion, HTTP/gRPC for querying historical metrics, and WebSocket connections for real-time update delivery.
1. Ingest Event (POST /v1/events)
Invoked by client SDKs and backend agents to stream event telemetry.
Request Payload:
{
"eventId": "evt_uuid_00981a",
"eventType": "ad_click",
"timestamp": 1780821960,
"dimensions": {
"campaignId": "camp_5541",
"country": "US",
"deviceType": "MOBILE",
"userId": "usr_9981a2"
}
}
Response Payload (202 Accepted):
{
"status": "ACCEPTED",
"receivedAt": "2026-06-07T11:26:00.102Z"
}
2. Query Historical Metrics (GET /v1/metrics)
Used by dashboards to retrieve past trend data during page loads.
Query Parameters:
metric: "ad_clicks"campaignId: "camp_5541"start: "2026-06-07T10:00:00Z"end: "2026-06-07T11:00:00Z"resolution: "1m"
Response Payload (200 OK):
{
"metric": "ad_clicks",
"dimensions": { "campaignId": "camp_5541" },
"dataPoints": [
{ "time": "2026-06-07T10:00:00Z", "value": 1205 },
{ "time": "2026-06-07T10:01:00Z", "value": 1450 }
]
}
3. WebSocket Real-Time Subscription (ws://stream.analytics.com/v1/live)
Clients open a persistent connection to receive real-time metric updates.
Subscription Request (Client to Server):
{
"action": "subscribe",
"topic": "metrics:ad_clicks:campaignId:camp_5541"
}
Metric Update Broadcast (Server to Client):
{
"topic": "metrics:ad_clicks:campaignId:camp_5541",
"windowStart": "2026-06-07T11:25:00Z",
"windowEnd": "2026-06-07T11:26:00Z",
"value": 15890,
"timestamp": "2026-06-07T11:26:00.005Z"
}
High-Level Architecture
The architecture separates the write-heavy streaming path from the read-heavy query and real-time push pathways.
The Client SDK sends events to the Collector Service. The collector acts as a lightweight buffer proxy, writing events to partitioned Kafka Topics. The Stream Processor (Apache Flink) consumes raw events from Kafka, runs windowed calculations, and updates the ClickHouse Columnar Database.
For real-time updates, Flink writes aggregates to a Redis Pub/Sub broker. WebSocket Gateway Server nodes subscribe to Redis channels and push metrics down to Client Browsers.
Streaming Ingestion and Aggregation Pipeline
This sequence diagram tracks how telemetry events travel through Kafka and Flink before being saved to ClickHouse and routed to Redis.
sequenceDiagram
autonumber
participant SDK as Client SDK
participant Coll as Event Collector
participant Kafka as Apache Kafka Buffer
participant Flink as Flink Stream Processor
participant Click as ClickHouse OLAP DB
participant Redis as Redis Pub/Sub Broker
SDK->>Coll: POST /v1/events (raw click event)
Coll->>Kafka: Write Event to Partitioned Topic
Coll-->>SDK: 202 Accepted
Flink->>Kafka: Pull Batch Event Stream
Flink->>Flink: Aggregate count inside 60s tumbling window
Flink->>Click: Batch Insert Aggregated metrics
Flink->>Redis: Publish Aggregate: {clicks: 15890}
WebSocket Gateway Fan-Out Topology
This diagram illustrates how metric updates are fanned out from the stream processor to thousands of connected browser dashboards.
flowchart TD
Flink[Flink Processor] -->|1. Publish window result| Redis[Redis Pub/Sub Cluster]
subgraph Gateways [WebSocket Gateway Cluster]
WS1[WebSocket Server Node 1]
WS2[WebSocket Server Node 2]
WS3[WebSocket Server Node 3]
end
Redis -->|2. Fan-out events| WS1 & WS2 & WS3
WS1 -->|3. Push updates| BrowserA[User Browser A]
WS2 -->|3. Push updates| BrowserB[User Browser B]
WS3 -->|3. Push updates| BrowserC[User Browser C]
Low-Level Design and Schema
For high-speed analytical queries, we utilize ClickHouse, a column-oriented OLAP database. This design defines our raw event log and pre-aggregated rollup schema.
ClickHouse Raw Events Table
-- Tracks raw clickstream logs
CREATE TABLE raw_clickstream_events (
event_id UUID,
event_type LowCardinality(String),
event_timestamp DateTime64(3, 'UTC'),
campaign_id String,
country LowCardinality(String),
device_type LowCardinality(String),
user_id String,
created_at DateTime DEFAULT now()
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_timestamp)
ORDER BY (campaign_id, event_type, event_timestamp);
ClickHouse Hourly Aggregations Table
-- Stores pre-aggregated metrics for fast dashboard queries
CREATE TABLE hourly_analytics_metrics (
metric_hour DateTime,
event_type LowCardinality(String),
campaign_id String,
country LowCardinality(String),
click_count UInt64,
unique_users AggregateFunction(uniq, String)
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(metric_hour)
ORDER BY (campaign_id, event_type, metric_hour, country);
Schema Rationale & Storage Optimization
- Columnar Database Engine: Standard databases store data in row format. If you execute
SELECT SUM(clicks)on a table of 100 columns, a row database reads all 100 columns from disk, creating massive I/O. ClickHouse (a column store) only reads the specific metric column, yielding 10x to 100x faster execution speeds. SummingMergeTreeEngine: ClickHouse automatically aggregates rows with the same sorting keys during background merge passes. This reduces storage footprints and guarantees that historical dashboard queries execute in milliseconds.LowCardinality(String): Dimensions likecountryanddevice_typehave a small set of unique values. Representing them as dictionary-encoded values reduces storage size and accelerates filtering scans.
Scaling Challenges and Capacity Estimation
A platform ingesting 1,000,000 events per second requires significant compute, network, and storage capacity.
1. Ingress Network Bandwidth
-
Assumptions:
- Event Ingress Rate ($R$) = $1,000,000$ events/second
- Average Raw Event Size ($S$) = $200$ bytes
-
Calculations: $$\text{Ingress Volume} = R \times S = 1,000,000 \times 200\text{ bytes} = 200,000,000\text{ B/sec} = 200\text{ MB/sec}$$ $$\text{Required Network Speed} = 200\text{ MB/sec} \times 8 = 1.6\text{ Gbps}$$
To manage this traffic, the ingress layer uses a cluster of 20 Event Collectors behind a Layer-4 Load Balancer. The load balancer splits requests evenly among the collectors, limiting network ingestion on each node to a manageable 80 Mbps.
2. Daily Database Storage Footprint
- Calculations: $$\text{Daily Event Count} = 1,000,000\text{ events/sec} \times 86,400\text{ seconds/day} = 86,400,000,000\text{ events/day}$$ $$\text{Raw Storage Size} = 86,400,000,000 \times 200\text{ bytes} = 17,280,000,000,000\text{ bytes} = 17.28\text{ TB/day}$$
Over a year, this accumulates 6.3 PB of raw data. To make this manageable:
- We utilize ClickHouse's built-in compression (e.g., LZ4 or ZSTD), which reduces the footprint by 5x to $\approx 3.45\text{ TB/day}$.
- We write raw events to a Materialized View that outputs aggregated hourly summaries to the
hourly_analytics_metricstable. - Raw events are expired and deleted after $7$ days using ClickHouse TTL definitions, while pre-aggregated metrics are retained indefinitely.
3. WebSocket Broadcast Egress Bandwidth
-
Assumptions:
- Concurrent active dashboards ($C$) = $50,000$
- Metric update frequency = $1$ update/second per dashboard
- Broadcast payload size = $100$ bytes
-
Calculations: $$\text{Total Updates Broadcasted} = 50,000\text{ clients} \times 1\text{ update/sec} = 50,000\text{ updates/sec}$$ $$\text{Egress Bandwidth} = 50,000 \times 100\text{ bytes} = 5,000,000\text{ B/sec} = 5\text{ MB/sec} \approx 40\text{ Mbps}$$
This egress rate is easily handled. If the update payload size increases to 1 KB or if the update frequency increases (e.g., broadcasting multiple metric topics), the egress bandwidth can scale to 400 Mbps. We distribute this load across a cluster of WebSocket gateways to prevent single-node interface saturation.
Failure Scenarios and Resilience
Real-time streaming platforms must maintain accuracy and availability even during partial outages.
1. Stream Processor Backpressure Cascade
If the ClickHouse database slows down due to a heavy query, Flink cannot write window aggregates fast enough.
- The Threat: Flink's internal queues fill up, triggering backpressure. The ingestion pipeline backs up to Kafka, and if buffers overflow, events are lost.
- Resilience Design:
- We configure Kafka with a $72$-hour retention window, allowing it to act as an elastic buffer.
- Flink uses credit-based flow control to propagate backpressure back to Kafka consumers.
- The collector service writes to Kafka asynchronously and remains unaffected by database slowdowns. Once ClickHouse recovers, Flink catches up by reading the buffered events from Kafka at maximum speed.
2. Stream Processor Node Crashes
A Flink processing node crashes while computing a sliding window aggregate.
- The Threat: The state of in-progress windows is lost, leading to inaccurate metrics when the node restarts.
- Resilience Design:
- Flink is configured with RocksDB State Store and periodic checkpointing (e.g., every 10 seconds).
- Checkpoints are saved to an external, distributed filesystem (such as HDFS or Amazon S3).
- When a node crashes, the cluster manager restarts the task, restores the state from the last successful checkpoint, and replays the Kafka events since that checkpoint, achieving exactly-once processing guarantees.
3. Late-Arriving Events in Sliding Windows
Due to network delays, a mobile client sends click events 2 minutes after they actually occurred.
- The Threat: If the window is already closed, these late events will be ignored, resulting undercounted metrics.
- Resilience Design:
- We use Flink Watermarks to manage late-arriving data. A watermark represents the progression of event time.
- We allow a configurable lateness window (e.g., 30 seconds). Events arriving within this window are processed.
- If an event arrives after the watermark, it is routed to a "dead-letter" Kafka topic for asynchronous correction or logged to ClickHouse as a late adjustment.
4. WebSocket Gateway Host Outages
A WebSocket gateway node hosting 10,000 active client connections crashes.
- The Threat: 10,000 clients lose their streams simultaneously, causing dashboards to display stale data.
- Resilience Design:
- The client dashboard SDK implements automated reconnection logic.
- Upon disconnection, clients wait for a randomized delay (to prevent a thundering herd on the remaining servers) and attempt to connect to the load balancer again.
- The load balancer routes the connections to the remaining healthy servers in the gateway pool.
Architectural Trade-offs
Choosing the components and data structures for real-time analytics requires balancing data accuracy against infrastructure costs.
Trade-off 1: Tumbling Windows vs. Sliding Windows
Tumbling windows group events in fixed, non-overlapping intervals (e.g., 9:00 to 9:05). Sliding windows group events in overlapping intervals (e.g., the last 5 minutes, moving forward every 10 seconds).
| Aspect | Tumbling Windows | Sliding Windows |
|---|---|---|
| Compute Overhead | Low (Each event is processed once) | High (Each event falls into multiple overlapping windows) |
| Memory State size | Low (State is cleared when the window closes) | High (Must store events for the duration of the slide) |
| User Experience | Steppy (Dashboard metrics update in block chunks) | Smooth (Metrics drift smoothly over time) |
Trade-off 2: Columnar ClickHouse vs. Key-Value InfluxDB
We compare ClickHouse (a general-purpose columnar database) with InfluxDB (a specialized key-value time-series database).
| Aspect | Columnar ClickHouse | InfluxDB |
|---|---|---|
| Aggregation Speed | Extremely High on massive datasets | High on simple queries; struggles on large scans |
| Write Compression | High (Column structure optimizes compression) | Medium (Index overhead limits compression) |
| Metadata Tag cardinality | High (Can handle millions of unique tags) | Low (High cardinality tags crash the index) |
| SQL Support | High (Supports standard SQL queries and joins) | Low (Uses custom query languages) |
Staff Engineer Perspective
Operating real-time analytics at scale requires managing streaming states and network egress.
Verbal Script
Interviewer: "How would you design a real-time analytics system that can calculate the count of unique users in the last 15 minutes, with a 10-second update interval, at a scale of 1,000,000 events per second?"
Candidate: "Calculating exact unique counts (cardinality) over sliding windows at this scale is a hard problem because we cannot store the list of user IDs in memory. Doing so would consume terabytes of RAM and crash the stream processor. Instead, we use HyperLogLog (HLL) sketches to estimate cardinality.
First, Flink consumes events from Kafka and groups them into 10-second tumbling windows. For each window, Flink maps user IDs to an HLL sketch (which consumes only a few kilobytes of state).
Second, we write these 10-second HLL sketches to ClickHouse. To query the unique user count for the last 15 minutes, the dashboard runs a query that aggregates the 90 HLL sketches from that window using ClickHouse's HLL merge function. This estimates unique counts with less than 1% error, consumes minimal CPU, and returns results in milliseconds."
Interviewer: "What happens if a Flink node experiences a garbage collection pause of 10 seconds, causing it to fall behind Kafka?"
Candidate: "When Flink experiences a pause, it falls behind the Kafka partition offset. Once the pause ends, the Flink coordinator detects the lag and increases consumer read speeds, consuming buffered events at maximum throughput.
To handle the burst of events, Flink allocates dynamic buffer sizes and uses credit-based flow control. If the lag continues, Flink allocates more task slots to the affected partitions to balance the load, catching up to the real-time stream without dropping events."
Interviewer: "How would you prevent a slow network connection on one client from slowing down the entire WebSocket gateway server?"
Candidate: "This is a classic Slow Consumer problem. If a client has a poor connection, the TCP socket buffer on the gateway fills up, and the gateway's write buffer blocks, consuming memory and thread resources.
To protect the gateway, we isolate client connection channels. The gateway writes updates to an internal ring buffer for each client. If a client's buffer fills up (e.g., it holds more than 50 unsent updates), the gateway drops the connection, releases the socket, and frees the resources, ensuring a slow client does not impact other users on the server."
Interviewer: "What is your strategy for backfilling historical data if the Flink aggregation code is updated with a new business rule?"
Candidate: "We use a Kappa Architecture with dual-reading paths.
When the business logic changes, we deploy the new Flink job side-by-side with the active job. We configure the new job to start reading from the beginning of the Kafka retention window (e.g., 7 days ago).
The new job writes its output to a temporary ClickHouse table. Once the new job catches up to the real-time stream, we swap the active table name alias in ClickHouse to point to the new table, and terminate the old Flink job. This backfills historical data with zero downtime and no database lock contention."