System DesignAdvancedarticle

System Design: Designing a Distributed Logging System (TB/Day Scale)

How do you ingest and query terabytes of logs every day? A deep dive into the ELK stack architecture, Buffering with Kafka, and Sharding for Search.

Sachin Sarawgi•April 20, 2026•3 min read•3 minute lesson

#system-design #logging #elk-stack #kafka #elasticsearch #distributed-systems #scalability

On This PageOpen

1. Core Requirements
2. The Ingestion Pipeline (The ELK Model)
Phase 1: The Collector (Filebeat/Fluentd)
Phase 2: The Buffer (Apache Kafka)
Phase 3: The Transformer (Logstash)
3. Storage: Elasticsearch
4. Cost Optimization: Tiered Storage
5. Avoiding the "Feedback Loop"
Summary

System Design: Designing a Distributed Logging System

In a microservices architecture with thousands of containers, logs are scattered everywhere. You need a centralized system that can ingest terabytes of log data every day, store it cost-effectively, and allow engineers to search it in near real-time.

1. Core Requirements

High Throughput: Ingesting millions of log lines per second.
Searchability: Full-text search across logs (errors, request IDs).
Retention: Keeping "hot" logs for 7 days and "cold" logs for 1 year.
Resilience: If the logging system is slow, it should not crash the main application.

2. The Ingestion Pipeline (The ELK Model)

The industry standard for logging is the ELK Stack (Elasticsearch, Logstash, Kibana).

Phase 1: The Collector (Filebeat/Fluentd)

A lightweight agent (daemonset) runs on every server/container. It monitors log files and pushes them to the next stage.

Why? It ensures that if the network is down, logs are buffered locally on disk.

Phase 2: The Buffer (Apache Kafka)

You should never push logs directly to your database.

The Problem: A spike in application traffic will create a spike in logs, which could overwhelm your search engine.
The Solution: Use Kafka as a buffer. The collectors push to Kafka, and the indexing workers consume at a steady, sustainable rate.

Phase 3: The Transformer (Logstash)

Logstash (or a custom Flink job) pulls logs from Kafka, parses them (JSON, Grok), and enriches them (adding region_id or user_metadata).

3. Storage: Elasticsearch

Elasticsearch is the "Search Engine" of the logging world.

Time-based Indexing: Create a new index every day (e.g., logs-2024-04-20). This makes deleting old data as simple as deleting an index.
Sharding: Distribute the index across multiple nodes to handle the write volume.

4. Cost Optimization: Tiered Storage

Logs grow exponentially. Storing everything on expensive SSDs is impossible.

Hot Tier (SSDs): Last 24-48 hours of logs. High-speed searching.
Warm Tier (HDDs): Last 7 days of logs. Slower, but cheaper.
Cold Tier (S3): Logs older than 7 days. Compressed and archived for compliance.

5. Avoiding the "Feedback Loop"

The logging system should never log its own logs to the same pipeline. If a logging error occurs, it could create an infinite loop that crashes the entire infrastructure.

Summary

Building a logging system at scale is a Data Engineering challenge. By using Kafka as a buffer and Elasticsearch with time-based indexing and tiered storage, you can build a platform that provides deep visibility into your systems without breaking the bank.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon →

Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon →

Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course →

Practical engineering notes

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

LinkedIn GitHub Medium More articles

Share this lesson

Share on X Share on LinkedIn

Keep Learning

Move through the archive without losing the thread.

System Design: Designing a Distributed Message Queue (Kafka Architecture)

System Design: Designing a Distributed Message Queue A Distributed Message Queue is the backbone of modern asynchronous architecture. It allows services to communicate without being tightly coupled. While many use Apache…

System Design3 min readAdvanced

System Design: Designing a Distributed Lock Service

Designing a Distributed Lock Service Distributed locking is one of the most critical and complex infrastructure components. It provides a "global" synchronization mechanism that ensures resources (like file systems, data…

System Design2 min readAdvanced

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

System Design: Designing a Distributed Message Queue (Kafka Architecture)

Apr 20, 20263 min read

Deep Dive

#system-design#kafka#message-queue

System DesignAdvanced

System Design: Designing a Distributed Search Engine (Elasticsearch)

System Design: Designing a Distributed Search Engine Search is the most common way humans interact with massive datasets. Building a system that can perform full-text search across billions of documents with millisecond…

Apr 20, 20263 min read

Deep Dive

#system-design#search-engine#elasticsearch

System DesignAdvanced

System Design: Designing a Distributed Task Scheduler

System Design Masterclass: Designing a Distributed Task Scheduler Every backend engineer has written a cron job. It's simple: you put a script on a Linux server and tell the OS to run it every night at midnight. But what…

Apr 20, 20266 min read

Case StudyBackend Systems Mastery

#system-design#task-scheduler#cron

System DesignAdvanced

System Design: Designing an Ad Click Aggregator

System Design: Designing an Ad Click Aggregator Ad click aggregation is a massive scale data problem. When billions of users click on ads across the web, those clicks must be aggregated, deduplicated, and stored for both…

Apr 20, 20263 min read

Deep Dive

#system-design#ad-aggregator#analytics

More in System Design

Category-based suggestions if you want to stay in the same domain.

System DesignIntermediate

System Design: Designing Stateless Authentication

System Design: Designing Stateless Authentication In a microservices architecture, you can't rely on server-side sessions (stored in memory/database) because every request might hit a different service instance. Stateles…

Apr 22, 20263 min read

Deep DiveBackend Systems Mastery

#system design#authentication#jwt

System DesignBeginner

gRPC vs REST: The Decision-Maker's Guide for Backend Architecture

gRPC vs REST: Which One for Your Microservices? In modern backend architecture, how services talk is as important as what they say. Choosing between REST and gRPC isn't just about syntax; it's about the trade-off between…

Apr 20, 20262 min read

ComparisonBackend Systems Mastery

#grpc#rest#api-design

System DesignBeginner

gRPC vs REST: A Decision-Maker's Guide for Backend Architecture

gRPC vs REST: Which One for Your Microservices? > Prerequisite: Before diving into protocols, ensure you understand the fundamentals of Load Balancing and API Idempotency. Choosing between REST and gRPC is one of the mos…

Apr 20, 20262 min read

ComparisonBackend Systems Mastery

#grpc#rest#api-design

← Back to all articles

System Design: Designing a Distributed Logging System (TB/Day Scale)

System Design: Designing a Distributed Logging System

1. Core Requirements

2. The Ingestion Pipeline (The ELK Model)

Phase 1: The Collector (Filebeat/Fluentd)

Phase 2: The Buffer (Apache Kafka)

Phase 3: The Transformer (Logstash)

3. Storage: Elasticsearch

4. Cost Optimization: Tiered Storage

5. Avoiding the "Feedback Loop"

Summary

Recommended Resources

Get the next backend guide in your inbox

Sachin Sarawgi

Keep Learning

System Design: Designing a Distributed Message Queue (Kafka Architecture)

System Design: Designing a Distributed Lock Service

Related Articles

System Design: Designing a Distributed Message Queue (Kafka Architecture)

System Design: Designing a Distributed Search Engine (Elasticsearch)

System Design: Designing a Distributed Task Scheduler

System Design: Designing an Ad Click Aggregator

More in System Design

System Design: Designing Stateless Authentication

gRPC vs REST: The Decision-Maker's Guide for Backend Architecture

gRPC vs REST: A Decision-Maker's Guide for Backend Architecture