System Design: Real-Time Chat Application at Scale

Real-time chat systems are among the most architecturally interesting distributed systems. They require persistent connections at massive scale, exactly-once message delivery guarantees, presence detection across millions of users, and message history queries that span years. This article designs a system comparable to WhatsApp or Slack.

Requirements

Before designing anything, you need to quantify the problem precisely. Requirements drive every architectural decision that follows — the choice of database, the number of Kafka partitions, the size of the connection pool. Notice that the numbers below aren't arbitrary; each one is derived from real-world usage patterns and drives specific design choices.

Functional:

1:1 messaging and group chats (up to 1000 members)
Real-time delivery (< 100ms end-to-end for online users)
Message delivery receipts (sent → delivered → read)
Message history (searchable, up to 5 years)
User presence (online/offline, last seen)
Message types: text, images, files, reactions

Non-Functional:

500M daily active users, 1B messages/day (~11,600 msg/sec average, 3x peak = 35,000/sec)
Message delivery guarantee: at-least-once (deduplication client-side)
Message ordering: within a conversation, strict order
Storage: ~100 bytes/message × 1B/day × 365 days × 5 years = 180TB

High-Level Architecture

The architecture is driven by one fundamental constraint: with 500M daily active users and real-time delivery requirements, you cannot use traditional request-response HTTP. Each online user needs a persistent, low-latency connection. The diagram below shows how the system is organized around this constraint.

Mobile/Web Client
    │
    │ WebSocket (persistent connection)
    ▼
┌──────────────────────────────────────────────────────────┐
│                   WebSocket Gateway Cluster              │
│   (stateful: each server holds N open WebSocket conns)  │
│   ws-1: [user_A, user_B, user_C, ...]                    │
│   ws-2: [user_D, user_E, user_F, ...]                    │
└────────────────────┬─────────────────────────────────────┘
                     │ Publish messages to Kafka
                     ▼
┌──────────────────────────────────────────────────────────┐
│          Kafka (message bus, ordered per partition)      │
│   Topic: chat-messages, partitioned by conversation_id   │
└────────────────────┬─────────────────────────────────────┘
                     │
         ┌───────────┼───────────────┐
         ▼           ▼               ▼
   Message Service  Storage Service  Notification Service
   (delivery,       (Cassandra)      (APNs, FCM for
    routing)                          offline users)

Redis: User → WebSocket server mapping (presence registry)

The WebSocket gateways are stateful — each server maintains thousands of open connections in memory. This statefulness creates the core routing challenge: when User A sends to User B, the system must find which server holds User B's connection. Redis serves as the global directory that maps user IDs to server IDs.

WebSocket Connection Management

The key challenge: when User A sends to User B, which WebSocket server holds User B's connection?

Your WebSocket gateway is like an airport's gate assignment system. Each gate (server) handles a set of flights (connections), and the central directory (Redis) tells everyone which gate a given flight is at. When a new connection lands, you register it. When it leaves, you deregister it. The code below handles the full connection lifecycle.

// WebSocket Gateway — each server handles 50,000-100,000 persistent connections
@Component
public class ChatWebSocketHandler extends TextWebSocketHandler {

    @Autowired
    private UserConnectionRegistry connectionRegistry;

    @Autowired
    private KafkaTemplate<String, ChatMessage> kafkaTemplate;

    // In-process connection map: userId → WebSocket session
    private final ConcurrentHashMap<String, WebSocketSession> localConnections =
        new ConcurrentHashMap<>();

    @Override
    public void afterConnectionEstablished(WebSocketSession session) {
        String userId = extractUserId(session);  // From JWT in handshake header

        localConnections.put(userId, session);

        // Register in Redis: userId → this server's ID
        connectionRegistry.register(userId, getServerId());

        // Send buffered messages (messages received while user was offline)
        sendOfflineMessages(userId, session);
    }

    @Override
    protected void handleTextMessage(WebSocketSession session, TextMessage message) {
        String userId = extractUserId(session);
        ChatMessage chatMessage = parseMessage(message, userId);

        // Publish to Kafka (partitioned by conversationId for ordering)
        kafkaTemplate.send(
            "chat-messages",
            chatMessage.getConversationId(),  // partition key — ensures ordering
        chatMessage
        );

        // Ack to sender immediately
        sendAck(session, chatMessage.getClientMessageId());
    }

    @Override
    public void afterConnectionClosed(WebSocketSession session, CloseStatus status) {
        String userId = extractUserId(session);
        localConnections.remove(userId);
        connectionRegistry.unregister(userId);
        updateLastSeen(userId);
    }

    // Deliver message to a local user (called by Message Delivery Service)
    public boolean deliverLocally(String userId, ChatMessage message) {
        WebSocketSession session = localConnections.get(userId);
        if (session == null || !session.isOpen()) return false;

        try {
            session.sendMessage(new TextMessage(serialize(message)));
            return true;
        } catch (IOException e) {
            return false;
        }
    }
}

Notice that when a message arrives from a client, it's immediately published to Kafka rather than routed directly to the recipient. This decoupling is intentional — Kafka provides durability (the message is persisted even if the delivery service crashes), ordering (messages in the same conversation stay in order), and fan-out (multiple consumers like storage and notification services process the same event).

Message Delivery Service

With messages flowing through Kafka, the delivery service consumes them and routes each one to the right WebSocket server. This service handles the three cases that arise in any real deployment: the recipient is online on this server, on a different server, or offline entirely.

// Kafka consumer: route messages to the right WebSocket server
@Service
public class MessageDeliveryService {

    @Autowired
    private UserConnectionRegistry connectionRegistry;  // Redis

    @Autowired
    private Map<String, ChatWebSocketHandler> wsHandlers;  // Local + remote stubs

    @Autowired
    private OfflineMessageStore offlineStore;

    @KafkaListener(
        topics = "chat-messages",
        concurrency = "12"  // One thread per Kafka partition
    )
    public void deliver(ChatMessage message) {
        List<String> recipients = getRecipients(message);  // From conversation members

        for (String recipientId : recipients) {
            String serverIdForRecipient = connectionRegistry.getServer(recipientId);

            if (serverIdForRecipient == null) {
                // User is offline: store for later delivery + send push notification
                offlineStore.store(recipientId, message);
                pushNotificationService.send(recipientId, message);
            } else if (serverIdForRecipient.equals(getServerId())) {
                // User is on THIS server: deliver directly
                boolean delivered = localWsHandler.deliverLocally(recipientId, message);
                if (!delivered) {
                    // Session closed between check and delivery
                    offlineStore.store(recipientId, message);
                }
            } else {
                // User is on a DIFFERENT server: route via internal HTTP/gRPC
                deliverToServer(serverIdForRecipient, recipientId, message);
            }
        }

        // Store in Cassandra (message history)
        messageStore.save(message);

        // Update delivery status
        updateDeliveryStatus(message.getId(), recipients, DeliveryStatus.DELIVERED);
    }
}

The concurrency = "12" setting means 12 threads consume from 12 Kafka partitions in parallel — matching consumer threads to partitions is the correct way to maximize Kafka throughput. The fallback to offline storage when a session has closed between the Redis check and the delivery attempt is an important edge case — without it, messages would silently drop during the brief window of a connection closing.

User Presence with Redis

Presence detection is the feature users take for granted but which requires careful engineering. The approach below uses Redis TTL as the mechanism — a heartbeat keeps the key alive, and natural expiry signals the user has gone offline. This is simpler and more scalable than maintaining a connection registry with explicit disconnect events, which can be missed during network failures.

@Service
public class PresenceService {

    @Autowired
    private StringRedisTemplate redis;

    private static final String ONLINE_KEY_PREFIX = "presence:online:";
    private static final Duration ONLINE_TTL = Duration.ofSeconds(30);

    // Called by WebSocket heartbeat every 20 seconds
    public void heartbeat(String userId) {
        redis.opsForValue().set(
            ONLINE_KEY_PREFIX + userId,
            Instant.now().toString(),
            ONLINE_TTL
        );
    }

    public boolean isOnline(String userId) {
        return redis.hasKey(ONLINE_KEY_PREFIX + userId);
    }

    // Batch check: are these 100 users online?
    public Map<String, Boolean> getBulkPresence(List<String> userIds) {
        List<String> keys = userIds.stream()
            .map(id -> ONLINE_KEY_PREFIX + id).toList();

        List<String> values = redis.opsForValue().multiGet(keys);

        Map<String, Boolean> result = new HashMap<>();
        for (int i = 0; i < userIds.size(); i++) {
            result.put(userIds.get(i), values.get(i) != null);
        }
        return result;
    }

    // Last seen: stored in Redis when user goes offline
    public void markOffline(String userId) {
        redis.delete(ONLINE_KEY_PREFIX + userId);
        redis.opsForValue().set(
            "presence:lastseen:" + userId,
            Instant.now().toString()
        );
    }
}

The getBulkPresence method using multiGet is a critical optimization for group chats — instead of making 100 Redis round trips to check 100 members' presence, you make one. In a 1000-member group chat, the difference between single and batch presence checks is the difference between acceptable and unusable latency.

Message Storage: Cassandra

Messages require high write throughput and time-ordered reads per conversation.

Relational databases aren't a good fit for chat message storage at this scale. The access pattern is extremely predictable — you always fetch the latest N messages for a specific conversation — and you need write throughput far beyond what a single PostgreSQL instance can handle. Cassandra's data model is designed exactly for this: partition by a natural key (conversation ID + time bucket) and cluster by time within each partition.

-- Cassandra schema (optimized for "get last 50 messages in conversation X")
CREATE TABLE messages (
    conversation_id  UUID,
    bucket           INT,          -- Time bucket (year-month): limits partition size
    message_id       TIMEUUID,     -- TimeUUID: unique + ordered by time
    sender_id        UUID,
    content          TEXT,
    message_type     TEXT,         -- text, image, file, reaction
    metadata         MAP<TEXT, TEXT>,
    deleted_at       TIMESTAMP,    -- Soft delete

    PRIMARY KEY ((conversation_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 7};

-- Query: latest 50 messages in conversation
SELECT * FROM messages
WHERE conversation_id = ? AND bucket = ?
ORDER BY message_id DESC
LIMIT 50;

-- Bucket calculation: ensures no partition grows unbounded
-- bucket = year * 100 + month (e.g., 202502 for Feb 2025)
-- Active conversations: query current bucket + previous if needed

The bucket column is the key design insight here — without it, a very active conversation could accumulate millions of rows in a single Cassandra partition, which degrades read and compaction performance. By bucketing per month, you cap each partition at roughly one month's worth of messages per conversation. TimeWindowCompactionStrategy then efficiently compacts data within time windows, which matches the write pattern perfectly.

Message Deduplication

At-least-once delivery means duplicates are possible. Handle client-side:

Because the system guarantees at-least-once delivery (not exactly-once), the same message can arrive at a client more than once — for example, during a network reconnect where the client re-requests messages from the last received position. The deduplication service below prevents duplicates from appearing in the UI by tracking which client-generated IDs have already been processed.

// Each message has a client-generated idempotency key
// Client generates: clientMessageId = UUID.randomUUID()
// Server stores: (conversationId, clientMessageId) → messageId

@Service
public class MessageDeduplicationService {

    @Autowired
    private RedisTemplate<String, String> redis;

    private static final Duration DEDUP_TTL = Duration.ofHours(24);

    public Optional<String> isDuplicate(String conversationId, String clientMessageId) {
        String key = "dedup:" + conversationId + ":" + clientMessageId;
        String existingMessageId = redis.opsForValue().get(key);
        return Optional.ofNullable(existingMessageId);
    }

    public void markProcessed(String conversationId, String clientMessageId, String messageId) {
        String key = "dedup:" + conversationId + ":" + clientMessageId;
        redis.opsForValue().set(key, messageId, DEDUP_TTL);
    }
}

The 24-hour TTL is a deliberate trade-off — it covers the window in which a duplicate is likely to arrive (network retries, reconnects), while preventing unlimited Redis growth. Messages older than 24 hours will simply be re-stored if re-delivered, which is acceptable because the client can deduplicate by message_id in its local database.

Scalability Analysis

With the components understood, here's how each one scales to meet the requirements. These numbers give you a concrete target for capacity planning.

Component         Scale           Technology
─────────────────────────────────────────────────────────
WebSocket GW      50K conn/server  Netty/Spring WebFlux
                  → 10K servers for 500M users
                  → Use consistent hashing to route users

Kafka             35K msg/sec peak  30 partitions (1K/partition)
                  Retention: 24h for real-time delivery

Cassandra         35K writes/sec    10 nodes, RF=3
                  100TB/year        Use TTL for 5-year retention

Redis (presence)  500M keys         Redis Cluster, 6 shards
                  30s TTL → natural expiry

Message routing   gRPC between GW   P2P mesh for intra-cluster
                  servers           pub/sub for cross-cluster

CDN               Images/files      S3 + CloudFront
                  Pre-signed URLs   Direct upload from client

Guarantees and Trade-offs

Every distributed system design involves deliberate trade-offs. The choices below reflect the reality that chat applications prioritize availability — users should always be able to send messages, even under network partitions — over strict consistency guarantees that would require coordination overhead.

Message ordering:
  Within a conversation: guaranteed (Kafka partitioned by conversationId,
                                     Cassandra TIMEUUID ordering)
  Across conversations: best-effort

Delivery guarantee:
  Online users: at-least-once (deduplicated client-side)
  Offline users: at-least-once (stored + retried)
  No guarantee of exactly-once end-to-end (intentional trade-off for performance)

Consistency:
  Message storage: eventual (Cassandra RF=3, quorum reads)
  Delivery status: eventual (Redis, replicated)
  Group membership: strong (PostgreSQL for group metadata)

CAP theorem position: AP (availability + partition tolerance)
  During network partition: accept writes (store in Kafka),
  deliver when partition heals → may deliver out of order
  Trade-off: always available vs always consistent

The hardest part of chat system design isn't the message passing — it's the edge cases. What happens when a user is on 3 devices? (deliver to all) What if a conversation has 1000 members and all are online? (fan-out at scale requires message fan-out service) What if a server crashes mid-delivery? (Kafka offset management + at-least-once). Design the happy path first, then systematically find failure modes.

System Design: Real-Time Chat Application at Scale

Requirements

High-Level Architecture

WebSocket Connection Management

Message Delivery Service

User Presence with Redis

Message Storage: Cassandra

Message Deduplication

Scalability Analysis

Guarantees and Trade-offs

Recommended Resources

Related Articles

Building Production Observability with OpenTelemetry and Grafana Stack

Event Sourcing and CQRS in Production: Beyond the Theory

gRPC vs REST vs GraphQL: Choosing the Right API Protocol