Lesson 101 of 105 12 minFlagship

System Design: Real-Time Chat Application at Scale

Design a real-time chat system like WhatsApp or Slack handling 1 billion messages per day. Covers WebSocket connection management, message delivery guarantees, presence detection, and storage.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **WebSocket Persistent State:** Managing persistent, bidirectional TCP connections on stateful gateway clusters mapped globally using Redis.
  • **Cassandra Message Storage:** Designing monthly conversation partition buckets to prevent data clustering skew.
  • **Bulkhead Group Delivery:** Isolating large group chat message fan-out pipelines from latency-sensitive 1:1 message delivery lanes.
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Mental Model

Designing a real-time chat system (comparable to WhatsApp, Slack, or Telegram) at a scale of 500 Million daily active users requires shifting from traditional stateless request-response models to a highly coordinated, stateful, event-driven architecture. Because message delivery must execute in less than 100ms globally, the system maintains millions of open, bidirectional WebSocket connections on stateful Gateway nodes. The primary architectural challenge is routing: when User A sends a message to User B, the system must dynamically locate which physical Gateway node holds User B's active TCP socket, route the packet across a pub/sub backplane, log the transaction durably in wide-column storage, and update real-time presence heartbeats.


Requirements and System Goals

To engineer an instant messaging platform of global scale, we must define quantitative performance and capacity boundaries.

1. Functional Requirements

  • Bidirectional 1:1 and Group Chats: Deliver real-time text messages, reactions, and attachments. Group chats must support up to 1,000 members.
  • Message Delivery States: Expose instant delivery receipt handshakes (Sent $\rightarrow$ Delivered $\rightarrow$ Read).
  • Real-time User Presence Status: Display online/offline states and "last seen" timestamps.
  • Durable Message History: Provide paginated, search-optimized message history queries spanning up to 5 years.

2. Non-Functional Requirements & Scale Budgets

  • Ultra-Low Latency Delivery SLA: Deliver messages end-to-end to online clients in less than 100ms.
  • Massive Concurrent Scale: Support 500 Million Daily Active Users (DAU). At an average of 1 Billion messages/day, the system handles an average of 11,600 messages/sec, with peak burst throughput targeting 35,000 messages/sec.
  • At-Least-Once Delivery Guarantee: Ensure that no message is silently dropped. Client-side deduplication keys are utilized to handle network reconnect retries.
  • Time-Ordered Consistency: Guarantee strict sequence ordering for all messages sent within an individual conversation.

API Interfaces and Service Contracts

Real-time chat platforms rely on binary or JSON WebSocket frame protocols for chat events, alongside standard REST APIs for session setup.

1. Send Message WebSocket Frame Contract

When a client sends a message over an active WebSocket socket, it packs the payload with a client-generated idempotency key.

WebSocket Frame Payload:

{
  "event": "send_message",
  "client_msg_id": "msg_8a2b3c4d-9988-7766-5544-33221100aabb",
  "conversation_id": "conv_9918237462",
  "sender_id": "usr_alpha_123",
  "content": "Let us finalize the system design plan.",
  "content_type": "text",
  "sent_timestamp_ms": 1780416350000
}

2. Delivery Receipt Handshake Frame

When the recipient's device receives the message, it returns an asynchronous acknowledgment frame back to the server, which is routed back to the original sender to trigger the "double checkmark" icon.

WebSocket Frame Payload:

{
  "event": "message_receipt",
  "message_id": "srv_msg_982736451",
  "conversation_id": "conv_9918237462",
  "recipient_id": "usr_beta_456",
  "sender_id": "usr_alpha_123",
  "status": "DELIVERED", // SENT, DELIVERED, READ
  "receipt_timestamp_ms": 1780416350050
}

High-Level Design and Visualizations

Decoupling stateful TCP connection gateways from stateless routing logic, event queues, and analytical stores is critical to scaling chat platforms without connection drops.

1. High-Level Stateful Gateway Architecture

This diagram outlines how stateful WebSocket servers maintain open sockets, using a global Redis presence registry to route messages across different nodes.

graph TD
    subgraph Client Tier
        ClientA[User A Client] -->|1. Active WebSocket| GW1[WebSocket Gateway Node 1]
        ClientB[User B Client] -->|1. Active WebSocket| GW2[WebSocket Gateway Node 2]
    end

    subgraph Presence Directory
        GW1 -->|2. Register Connection| RedisPresence[(Redis Presence Cluster)]
        GW2 -->|2. Register Connection| RedisPresence
    end

    subgraph Event Pipeline
        GW1 -->|3. Publish Outgoing Event| Kafka[Kafka Chat Event Queue]
    end

    subgraph Service and Storage Tier
        Kafka -->|4. Consume & Route| DeliverySvc[Message Delivery Service]
        Kafka -->|4. Persist Message| StorageSvc[Cassandra Storage Service]
        
        StorageSvc --> MessageStore[(Cassandra Cluster)]
        DeliverySvc -->|5. Query Host Gateway| RedisPresence
        DeliverySvc -->|6. Inter-Node gRPC Forward| GW2
    end

2. End-to-End Chat Ingestion and Receipt Flow

The sequence diagram below displays how a message is routed between users on different gateways, including fallback channels when users are offline.

sequenceDiagram
    autonumber
    participant A as User A Client
    participant GW1 as WebSocket Gateway 1
    participant Kafka as Kafka Event Bus
    participant Deliv as Message Delivery Service
    participant Redis as Redis Registry
    participant GW2 as WebSocket Gateway 2
    participant B as User B Client

    A->>GW1: Stream message frame (client_msg_id)
    GW1->>GW1: Verify authentication & check duplicates
    GW1->>Kafka: Publish event (conversationId partition key)
    GW1-->>A: Send server ACK (message received at edge)
    
    Deliv->>Kafka: Pull message event batch
    Deliv->>Redis: Query: Where is User B connected?
    
    rect rgb(240, 255, 240)
        Note over Deliv, B: Case A: Recipient Online (On Gateway 2)
        Redis-->>Deliv: Return Gateway_2 IP Address
        Deliv->>GW2: Forward message via gRPC
        GW2->>B: Stream WebSocket frame
        B-->>GW2: Return ACK (DELIVERED)
        GW2->>Kafka: Publish RECEIPT event
        Deliv->>Kafka: Pull RECEIPT event
        Deliv->>Redis: Query: Where is User A connected?
        Redis-->>Deliv: Return Gateway_1 IP Address
        Deliv->>GW1: Forward RECEIPT via gRPC
        GW1->>A: Stream WebSocket receipt frame (Double Check!)
    end

    rect rgb(255, 240, 240)
        Note over Deliv, B: Case B: Recipient Offline
        Redis-->>Deliv: Return null (No active socket)
        Deliv->>Deliv: Route to Offline Push Queue (FCM / APNs)
    end

Low-Level Design and Schema Strategies

To support years of chat history with high write volumes, the system uses wide-column persistence, and fast heartbeats are stored in memory.

1. Message History Storage Schema (Cassandra)

We utilize Apache Cassandra for storing message history. Cassandra's wide-column, log-structured merge-tree architecture is optimized for high write volumes and time-ordered range queries.

  • Partitioning Strategy: We partition by (conversation_id, time_bucket). The time_bucket represents a monthly slice (e.g. 2026-06) to ensure that a highly active conversation partition does not grow greater than 100MB, preventing read latency degradation.
  • Clustering Strategy: We cluster by message_id (a TIMEUUID) in descending order to fetch the latest messages instantly.
-- Core Message Storage Table
CREATE TABLE conversation_messages (
    conversation_id UUID,
    time_bucket INT, -- e.g. 202606 for June 2026
    message_id TIMEUUID, -- Unique, embeds timestamp, sortable
    sender_id UUID,
    content TEXT,
    content_type VARCHAR(16), -- 'text', 'image', 'reaction'
    media_url TEXT,
    is_deleted BOOLEAN,
    PRIMARY KEY ((conversation_id, time_bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
  AND compaction = {
      'class': 'TimeWindowCompactionStrategy',
      'compaction_window_unit': 'DAYS',
      'compaction_window_size': 7
  };

-- Indices for user conversation listings
CREATE TABLE user_conversations (
    user_id UUID,
    conversation_id UUID,
    last_message_id TIMEUUID,
    last_message_preview TEXT,
    unread_count INT,
    PRIMARY KEY (user_id, conversation_id)
);

2. Redis Presence and Session Mapping Structures

The stateless delivery engines lookup active sockets using Redis hashes and sorted sets for heartbeats.

-- Redis Key: presence:user_session:<user_id>
-- Redis Hash DataType
HSET presence:user_session:usr_beta_456 active_gateway "gw_node_2.corp.internal"
HSET presence:user_session:usr_beta_456 client_ip "192.168.1.50"
HSET presence:user_session:usr_beta_456 last_active 1780416350000

-- Redis TTL Key for Active Presence Status
-- Expired automatically if heartbeat is missed
SET presence:status:usr_beta_456 "online" EX 30

Scaling and Operational Challenges

1. Group Message Fan-Out Complexity Math

When a message is sent in a group chat, the system must deliver it to all members. This is the Fan-Out pipeline.

  • The Mathematics of Fan-Out:
    • Suppose User A sends a message to a group containing $N = 1,000$ members.
    • The Delivery Service must fetch the members list, query Redis $N$ times to find each member's gateway host, and execute $N$ network forwards.
    • Peak Burst Scale: If 1,000 groups each send 1 message per second: $$\text{Fan-Out writes} = 1,000 \times 1,000 = 1,000,000 \text{ operations/sec}$$
    • A single un-isolated delivery queue will instantly saturate, delaying 1:1 messages by minutes.
  • The Bulkhead Solution:
    • We isolate delivery queues using Bulkheads. 1:1 messages are routed through high-priority delivery queues.
    • Group messages are routed through low-priority, partitioned queues.
    • Inter-Gateway Coalescing: Instead of making $N$ separate network calls to Gateway Node 2 for 50 members who happen to be connected to Gateway 2, the delivery service groups the recipients by gateway host. It executes exactly one gRPC payload containing the 50 recipient IDs to Gateway Node 2, reducing global network overhead from $O(N)$ to $O(G)$ where $G$ is the number of active gateway servers.

2. Redis Idempotency Deduplication

At-least-once message delivery means clients will retry message uploads under poor network conditions, causing duplicates.

  • The Deduplication Strategy:
    • Before saving to Cassandra, the API Gateway parses the client-generated client_msg_id.
    • We store a deduplication key in Redis: dedup:conv_id:client_msg_id with a value of srv_msg_id.
    • The Lifecycle: We set a 24-hour TTL on this key.
    • If a retried client_msg_id arrives within 24 hours, the gateway blocks duplicate database writes and instantly returns the existing srv_msg_id ACK, ensuring a clean user interface.

Real-Time Messaging Systems Trade-offs

Choosing a messaging architecture requires selecting between delivery latency, database durability, and ordering guarantees.

Architectural Dimension Log-Based Event Brokers (Kafka) Memory-Based Pub/Sub (Redis) Push-Based Queues (RabbitMQ)
Message Ordering Strict (Guaranteed within partition via key) None (No native partition ordering) Fair (Depends on consumer pool scale)
Durability Tier Extreme (Written to persistent disk blocks) Low (Primarily stored in volatile RAM) Medium (Buffered on local host disk)
Throughput Capacity Ultra-High (Peak throughput > 1M writes/sec) High (RAM bound scale) Medium (Slower due to routing logic)
System Complexity High (Requires managing partitions and offset state) Very Low (Simple HSET and publish/subscribe) Medium (Requires AMQP topology setups)
Best Use Case Core persistent message stream pipelines. Volatile, high-frequency presence heartbeat broadcasts. Enterprise business task routing, single-consumer queues.

Failure Modes and Fault Tolerance Strategies

1. WebSocket Gateway Server Crashes

If a stateful Gateway Node holding 50,000 active connections crashes due to hardware failure, 50,000 TCP sockets are dropped instantly.

  • The Fault Tolerance Blueprint:
    • The clients detect connection loss (via TCP keepalives or missed application-level pings).
    • Clients enter an Exponential Backoff Reconnect Loop with random mathematical jitter to prevent a thundering herd stampede from crashing the remaining healthy gateway nodes.
    • Upon reconnecting to a new gateway node (e.g. Gateway Node 5), the client sends its last received message_id.
    • Gateway 5 registers the new connection mapping in Redis, fetches missed offline messages from Cassandra, and streams them to the client, guaranteeing zero lost messages.

2. Network Partition Client-Reconnect Stampedes

When a localized cellular carrier experiences an outage and recovers, millions of clients attempt to reconnect and authenticate simultaneously, saturating the auth microservice's CPU.

  • The Mitigation Strategy:
    • The Gateway Nodes execute Pre-Auth Handshake Tokens.
    • When a client first logs in, it receives a short-lived token.
    • During reconnects, the client passes this token directly to the WebSocket gateway handshake.
    • The gateway validates the token locally in memory using cryptographic signatures (JWT verification) without hitting the database or authorization service, shielding the core system from CPU starvation.

Staff Engineer Perspective


Production Readiness Checklist

Before launching your real-time chat application to millions of active users, verify:

  • Kernel Tunings Applied: Confirm file descriptor limits (ulimit -n) are set to at least 1,000,000 on all gateway nodes.
  • TimeWindowCompaction Active: Verify Cassandra tables use TimeWindowCompactionStrategy to optimize time-based compactions.
  • Presence Heartbeat TTLs: Configure client heartbeats to send every 20 seconds, with Redis status TTL set to 30 seconds.
  • Bulkhead Ingest Partitioning: Ensure group message fan-out pipelines utilize isolated thread pools from 1:1 message routes.


Verbal Script

Interviewer: "How would you design a highly scalable real-time chat application like WhatsApp or Slack handling 1 Billion messages per day, and how do you handle message routing and delivery guarantees?"

Candidate: "To design a real-time chat system capable of handling 500 Million daily active users and 1 Billion messages per day, I would build an architecture optimized for stateful connections and event-driven decoupled pipelines. The core constraint is persistent connectivity: we deploy a stateful WebSocket Gateway cluster where each node maintains up to 50,000 open TCP sockets.

When User A sends a message to User B over their WebSocket socket, the gateway intercepts the frame. The gateway immediately publishes the message event to a Kafka cluster, partitioned by conversation_id. Partitioning by conversation_id is crucial because it guarantees that all messages sent within the same conversation are ordered sequentially on the same broker.

The message is consumed by two independent services in parallel. First, a Storage Service writes the message durably to Apache Cassandra. We partition the Cassandra table by (conversation_id, time_bucket) where the time_bucket represents a monthly slice. This capping strategy prevents any single partition from growing greater than 100MB, maintaining fast read performance.

Second, the Message Delivery Service handles routing. It queries a global Redis Presence Cluster to locate where User B is connected.

If User B is online on a different server, the delivery service forwards the message to that specific gateway node via gRPC. The gateway streams the frame over User B's open socket. When User B's client receives the message, it returns an asynchronous acknowledgment frame, which is routed back to User A, updating their UI with a 'delivered' checkmark.

If User B is offline, the service routes the message to an offline push queue, triggering an FCM or APNs push notification.

To scale group chats up to 1,000 members without bottlenecking 1:1 messages, I would implement two strategies. First, we use Bulkhead isolation, routing group pipelines through dedicated lower-priority queues. Second, we implement Inter-Gateway Coalescing: when delivering a group message, we group the recipients by their active gateway hosts. Instead of making $N$ individual network calls, the delivery service executes exactly one bulk gRPC payload to each target gateway node, reducing inter-node network traffic from $O(N)$ down to $O(G)$ where $G$ is the number of active gateway servers.

Finally, we guarantee at-least-once delivery and handle client reconnect retries by implementing a Redis-based idempotency layer. The gateway caches client-generated message IDs with a 24-hour TTL, blocking duplicate writes at the edge and ensuring absolute data consistency."

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.