System Design

System Design: Designing a Distributed Message Queue (Kafka Architecture)

How does a message queue handle millions of events per second with 100% durability? A technical deep dive into Log Segments, Consumer Groups, and Zero-Copy.

Sachin Sarawgi·April 20, 2026·3 min read
#system-design#kafka#message-queue#distributed-systems#streaming#scalability

System Design: Designing a Distributed Message Queue

A Distributed Message Queue is the backbone of modern asynchronous architecture. It allows services to communicate without being tightly coupled. While many use Apache Kafka, understanding how to design one from scratch is a common and challenging system design topic.

1. Core Requirements

  • High Throughput: Handling millions of messages per second.
  • Durability: Messages must not be lost even if a broker crashes.
  • Scalability: Horizontal scaling by adding more brokers.
  • Ordering: Ensuring messages are consumed in the order they were produced (at least within a partition).

2. The Storage Model: Append-Only Log

Traditional databases use B-Trees. For a message queue, a B-Tree is too slow because of random I/O.

  • The Solution: Use an Append-Only Log. Every new message is simply appended to the end of a file on disk.
  • Efficiency: Sequential writes are significantly faster than random writes.
  • Immutability: Once written, a message never changes. This simplifies replication and caching.

3. Scalability: Partitions

A single log file cannot scale across multiple servers.

  • The Solution: Split a "Topic" into multiple Partitions.
  • Distribution: Each partition can live on a different broker.
  • Throughput: Multiple consumers can read from different partitions in parallel, increasing the overall throughput of the topic.

4. The Consumer Group Model

How do you ensure that multiple instances of a service don't process the same message twice?

  • Offset Management: Each consumer group maintains an "Offset" (a pointer) for each partition. It tracks which message the group has processed so far.
  • Rebalancing: If a new consumer joins the group, the system redistributes the partitions among the available consumers.

5. High Availability: Replication

To ensure data isn't lost, each partition is replicated across N brokers.

  • Leader/Follower: One broker is the Leader for a partition (handles all reads/writes), and others are Followers (replicate data).
  • ISR (In-Sync Replicas): A write is only considered successful when it has been replicated to all members of the ISR.

6. Performance Optimization: Zero-Copy

Moving data from disk to the network usually involves multiple copies between kernel space and application space.

  • Zero-Copy: The system uses the sendfile system call to move data directly from the OS Page Cache to the Network Card (NIC), bypassing the application entirely. This drastically reduces CPU and memory usage.

Summary

Building a message queue is about optimizing for Sequential I/O. By treating the queue as a distributed append-only log and leveraging Zero-Copy for delivery, you can build a system that powers the real-time data needs of the world's largest companies.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Found this useful? Share it: