Apache Kafka is the de facto standard for event streaming in distributed systems, but most developers treat it as a black box — a durable message queue with a fancy name. Understanding Kafka's internals unlocks its true potential: predictable performance at scale, reliable exactly-once processing, and horizontal scalability without coordination overhead.
This article goes deep on partitions, offsets, consumer groups, and replication — with production-grade Java examples.
Why Kafka Is Not a Message Queue
Traditional message queues like RabbitMQ deliver messages to consumers and delete them after acknowledgment. Kafka's fundamental design is different: it is a distributed, partitioned, replicated commit log.
Traditional Queue: Kafka Log:
Producer → [Queue] → Consumer Producer → [Partition Log]
(deleted after ACK) offset 0: event
offset 1: event
offset 2: event ← Consumer A reads here
offset 3: event ← Consumer B reads here
(retained for configurable time)
This distinction matters enormously. With Kafka:
- Multiple consumer groups can independently read the same data at their own pace
- Reprocessing is trivial — reset the offset and replay
- Time travel is possible — query data from any point in history
- Throughput is predictable — sequential disk writes are fast and consistent
Partition Anatomy
Every Kafka topic is divided into one or more partitions. A partition is an ordered, immutable sequence of records — a physical append-only log file on disk.
Topic: "order-events" (4 partitions, replication factor 3)
Partition 0: [ev0][ev1][ev2][ev3][ev4]... → Leader: Broker 1
Replicas: Broker 2, Broker 3
Partition 1: [ev0][ev1][ev2]... → Leader: Broker 2
Replicas: Broker 1, Broker 3
Partition 2: [ev0][ev1][ev2][ev3]... → Leader: Broker 3
Replicas: Broker 1, Broker 2
Partition 3: [ev0][ev1]... → Leader: Broker 1
Replicas: Broker 2, Broker 3
Key properties:
- Ordering is guaranteed within a partition, not across partitions
- Parallelism scales with partition count — more partitions = more consumers
- Messages are routed to partitions by key (default: round-robin if no key)
Partition Key Selection
Your partition key determines which events land in the same partition. Events in the same partition are guaranteed to be processed in order.
// All events for the same orderId go to the same partition
// This ensures order-placed, order-paid, order-shipped are processed in sequence
ProducerRecord<String, OrderEvent> record = new ProducerRecord<>(
"order-events",
orderId, // partition key — hash(orderId) % numPartitions
orderEvent
);
producer.send(record);
// Bad key choice: random UUID or timestamp — destroys ordering
// Good key choices: userId, orderId, deviceId, sessionId
The In-Sync Replica (ISR) Set
Kafka uses leader-based replication. Each partition has one leader and N-1 followers (replicas). All reads and writes go through the leader.
The In-Sync Replica (ISR) set is the subset of replicas that are fully caught up with the leader. A replica falls out of ISR if it lags more than replica.lag.time.max.ms (default: 30 seconds).
Partition Leader (Broker 1): offset 150
ISR = {Broker 1, Broker 2, Broker 3} ← All caught up
Scenario: Broker 3 network hiccup, lags by 45 seconds
ISR = {Broker 1, Broker 2} ← Broker 3 removed from ISR
Scenario: Broker 1 crashes
New leader elected from ISR: Broker 2
ISR = {Broker 2} ← Only Broker 2 was in-sync
Producer Configuration: Durability vs Throughput
The acks setting controls when the producer considers a write successful:
Properties producerProps = new Properties();
producerProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092");
producerProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
producerProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class.getName());
// acks=0: Fire and forget — fastest, data loss possible
// acks=1: Leader ACK — default, leader crash before replication = data loss
// acks=all (or -1): All ISR ACK — safest, use for critical data
producerProps.put(ProducerConfig.ACKS_CONFIG, "all");
// Prevent duplicate messages on retry
producerProps.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
// Max in-flight requests per connection (must be 1 for ordering with retries, unless idempotent)
producerProps.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5); // safe with idempotence
// Batching: wait up to 20ms for batch to fill before sending
producerProps.put(ProducerConfig.LINGER_MS_CONFIG, 20);
producerProps.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536); // 64KB batch
// Compression reduces network IO by 5-7x for JSON
producerProps.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
Throughput numbers (rough benchmarks on commodity hardware):
acks=0: ~1M records/secacks=1: ~500K records/secacks=all+min.insync.replicas=2: ~200K records/sec
The tradeoff is explicit: more durability = lower throughput.
Offsets and Consumer Position
Every record in a partition has an offset — a monotonically increasing integer starting at 0. Offsets are Kafka's way of tracking consumer position.
Properties consumerProps = new Properties();
consumerProps.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092");
consumerProps.put(ConsumerConfig.GROUP_ID_CONFIG, "order-processor-v1");
consumerProps.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
consumerProps.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class.getName());
// auto.offset.reset: what to do when no committed offset exists
// "earliest": read from beginning (replay all history)
// "latest": read only new messages (default)
consumerProps.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
// Disable auto-commit: commit manually after processing
consumerProps.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
KafkaConsumer<String, OrderEvent> consumer = new KafkaConsumer<>(consumerProps);
consumer.subscribe(List.of("order-events"));
try {
while (true) {
ConsumerRecords<String, OrderEvent> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, OrderEvent> record : records) {
try {
processOrder(record.value());
// Only commit AFTER successful processing
// This prevents losing events on consumer crash
} catch (Exception e) {
// Dead-letter queue or retry logic here
log.error("Failed to process {}, offset {}", record.key(), record.offset(), e);
}
}
// Synchronous commit: blocks until broker confirms
// Use commitAsync() for higher throughput if at-least-once is acceptable
consumer.commitSync();
}
} finally {
consumer.close();
}
Auto-Commit vs Manual Commit
| Auto-Commit | Manual Commit | |
|---|---|---|
| Config | enable.auto.commit=true |
enable.auto.commit=false |
| Commits every | auto.commit.interval.ms (5s default) |
After you call commitSync()/commitAsync() |
| Risk | Commits before processing = message loss | Your responsibility |
| Use case | Low-stakes analytics | Financial transactions, critical processing |
Consumer Groups: Horizontal Scaling
A consumer group is a set of consumers that share the work of consuming a topic. Kafka assigns each partition to exactly one consumer in the group.
Topic: "order-events" with 6 partitions
Consumer Group: "order-processor" (3 consumers)
Consumer 1 → Partition 0, Partition 1
Consumer 2 → Partition 2, Partition 3
Consumer 3 → Partition 4, Partition 5
If Consumer 2 crashes:
Consumer 1 → Partition 0, Partition 1, Partition 2
Consumer 3 → Partition 3, Partition 4, Partition 5
(Kafka triggers rebalance within session.timeout.ms)
Scaling rules:
- More consumers than partitions = some consumers are idle (wasted resources)
- More partitions than consumers = each consumer handles multiple partitions
- Max parallelism = partition count
The implication of the last rule is important: if you have 4 partitions and deploy 8 consumer instances, 4 of them will sit idle doing nothing. Kafka assigns one consumer per partition within a group — it doesn't split a single partition across consumers. This is why you should provision partitions generously at topic creation time. Kafka does not allow reducing partition count, and increasing it later can change the key-to-partition mapping, breaking ordering guarantees for existing keys.
Scaling out is simple: start another consumer instance with the same group.id. Kafka automatically triggers a rebalance and redistributes partitions across all active consumers. You can go from 3 to 6 consumers handling a 6-partition topic with zero configuration changes.
// Scale by starting more consumer instances with the same group.id
// Each instance handles different partitions automatically — no config changes needed
// To check current partition assignments and consumer lag:
// bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
// --describe --group order-processor
//
// Output shows:
// GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
// order-processor order-events 0 1024 1050 26 ← 26 unprocessed
// order-processor order-events 1 876 876 0 ← fully caught up
A lag of 26 on partition 0 means the consumer is 26 messages behind the producer. A lag of 0 means the consumer is keeping up in real time. Growing lag is the first sign that you need more consumers or faster processing logic.
Exactly-Once Semantics
Kafka 0.11+ supports exactly-once processing through idempotent producers and transactions.
producerProps.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
producerProps.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "order-processor-1"); // unique per producer
KafkaProducer<String, String> producer = new KafkaProducer<>(producerProps);
producer.initTransactions();
KafkaConsumer<String, OrderEvent> consumer = // ... configured as above
try {
ConsumerRecords<String, OrderEvent> records = consumer.poll(Duration.ofMillis(100));
producer.beginTransaction();
try {
for (ConsumerRecord<String, OrderEvent> record : records) {
OrderResult result = processOrder(record.value());
// Produce result to output topic
producer.send(new ProducerRecord<>("order-results", record.key(), result.toJson()));
}
// Atomically commit offsets and produce — either both happen or neither
Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();
records.partitions().forEach(tp -> {
long lastOffset = records.records(tp).get(records.records(tp).size() - 1).offset();
offsets.put(tp, new OffsetAndMetadata(lastOffset + 1));
});
producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
throw e;
}
}
Monitoring Consumer Lag
Consumer lag is the most critical Kafka operational metric. Lag = leader offset − consumer committed offset. It tells you how far behind the consumer is from the latest data the producer has written.
A lag of zero means the consumer is processing events in real time. A lag of 10,000 means there are 10,000 unprocessed events sitting in the partition waiting to be consumed. If that number is growing, your consumers cannot keep up with the incoming load. If it's stable, consumers are processing as fast as events arrive. If it's shrinking, consumers are catching up after a backlog.
Why does lag grow? The two most common reasons are: (1) processing logic got slower — perhaps a downstream database query that used to take 5ms now takes 500ms; or (2) producer throughput increased — a marketing email triggered a spike in user activity that doubled event volume. Monitoring lag gives you early warning before either of these causes a user-visible delay.
// Programmatic lag check — useful for building custom alerting
AdminClient adminClient = AdminClient.create(Map.of(
AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092"
));
Map<TopicPartition, OffsetAndMetadata> committed = adminClient
.listConsumerGroupOffsets("order-processor")
.partitionsToOffsetAndMetadata()
.get();
Map<TopicPartition, Long> endOffsets = consumer.endOffsets(committed.keySet());
long totalLag = 0;
for (Map.Entry<TopicPartition, Long> entry : endOffsets.entrySet()) {
OffsetAndMetadata committedOffset = committed.get(entry.getKey());
long lag = entry.getValue() - (committedOffset != null ? committedOffset.offset() : 0);
totalLag += lag;
log.info("Partition {}: lag = {}", entry.getKey(), lag);
}
// Alert threshold: if totalLag > maxAcceptableLag, trigger scale-out or alert
Set your alert threshold based on your latency SLA. If your system must process events within 30 seconds, and your consumers handle 100 events per second per instance, a lag of 3,000 means you're 30 seconds from breaching the SLA. Alert at half that — 1,500 — to give yourself time to add consumers before users notice.
For most teams, exposing kafka_consumer_records_lag_max via Micrometer to Prometheus and alerting in Grafana is the right setup. The programmatic approach above is useful for building auto-scaling logic — dynamically adding consumer instances when lag exceeds a threshold and removing them when it returns to baseline.
Production Kafka is not complicated — it becomes complicated when teams skip understanding these fundamentals. The partition model, offset management, and ISR mechanics explain almost every production incident involving Kafka. Build these mental models first, then instrument them.
Key Takeaways for Production
- Partition count is permanent — choose based on your parallelism needs, not current load (6-12 partitions is usually sufficient to start)
acks=all+min.insync.replicas=2for any data you cannot afford to lose- Manual offset commit for business-critical processing; auto-commit for analytics
- Consumer lag is your early warning system — monitor it obsessively
- Idempotent producers are free (negligible overhead) — always enable them
- Keys matter — wrong key = wrong ordering = subtle bugs under load