System DesignAdvancedarticle

System Design: Designing a Distributed Logging System (TB/Day Scale)

How do you ingest and query terabytes of logs every day? A deep dive into the ELK stack architecture, Buffering with Kafka, and Sharding for Search.

Sachin SarawgiApril 20, 20263 min read3 minute lesson

System Design: Designing a Distributed Logging System

In a microservices architecture with thousands of containers, logs are scattered everywhere. You need a centralized system that can ingest terabytes of log data every day, store it cost-effectively, and allow engineers to search it in near real-time.

1. Core Requirements

  • High Throughput: Ingesting millions of log lines per second.
  • Searchability: Full-text search across logs (errors, request IDs).
  • Retention: Keeping "hot" logs for 7 days and "cold" logs for 1 year.
  • Resilience: If the logging system is slow, it should not crash the main application.

2. The Ingestion Pipeline (The ELK Model)

The industry standard for logging is the ELK Stack (Elasticsearch, Logstash, Kibana).

Phase 1: The Collector (Filebeat/Fluentd)

A lightweight agent (daemonset) runs on every server/container. It monitors log files and pushes them to the next stage.

  • Why? It ensures that if the network is down, logs are buffered locally on disk.

Phase 2: The Buffer (Apache Kafka)

You should never push logs directly to your database.

  • The Problem: A spike in application traffic will create a spike in logs, which could overwhelm your search engine.
  • The Solution: Use Kafka as a buffer. The collectors push to Kafka, and the indexing workers consume at a steady, sustainable rate.

Phase 3: The Transformer (Logstash)

Logstash (or a custom Flink job) pulls logs from Kafka, parses them (JSON, Grok), and enriches them (adding region_id or user_metadata).

3. Storage: Elasticsearch

Elasticsearch is the "Search Engine" of the logging world.

  • Time-based Indexing: Create a new index every day (e.g., logs-2024-04-20). This makes deleting old data as simple as deleting an index.
  • Sharding: Distribute the index across multiple nodes to handle the write volume.

4. Cost Optimization: Tiered Storage

Logs grow exponentially. Storing everything on expensive SSDs is impossible.

  • Hot Tier (SSDs): Last 24-48 hours of logs. High-speed searching.
  • Warm Tier (HDDs): Last 7 days of logs. Slower, but cheaper.
  • Cold Tier (S3): Logs older than 7 days. Compressed and archived for compliance.

5. Avoiding the "Feedback Loop"

The logging system should never log its own logs to the same pipeline. If a logging error occurs, it could create an infinite loop that crashes the entire infrastructure.

Summary

Building a logging system at scale is a Data Engineering challenge. By using Kafka as a buffer and Elasticsearch with time-based indexing and tiered storage, you can build a platform that provides deep visibility into your systems without breaking the bank.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

System Design: Designing a Distributed Message Queue (Kafka Architecture)

System Design: Designing a Distributed Message Queue A Distributed Message Queue is the backbone of modern asynchronous architecture. It allows services to communicate without being tightly coupled. While many use Apache…

Apr 20, 20263 min read
Deep Dive
#system-design#kafka#message-queue
System DesignAdvanced

System Design: Designing a Distributed Search Engine (Elasticsearch)

System Design: Designing a Distributed Search Engine Search is the most common way humans interact with massive datasets. Building a system that can perform full-text search across billions of documents with millisecond…

Apr 20, 20263 min read
Deep Dive
#system-design#search-engine#elasticsearch
System DesignAdvanced

System Design: Designing a Distributed Task Scheduler

System Design Masterclass: Designing a Distributed Task Scheduler Every backend engineer has written a cron job. It's simple: you put a script on a Linux server and tell the OS to run it every night at midnight. But what…

Apr 20, 20266 min read
Case StudyBackend Systems Mastery
#system-design#task-scheduler#cron
System DesignAdvanced

System Design: Designing an Ad Click Aggregator

System Design: Designing an Ad Click Aggregator Ad click aggregation is a massive scale data problem. When billions of users click on ads across the web, those clicks must be aggregated, deduplicated, and stored for both…

Apr 20, 20263 min read
Deep Dive
#system-design#ad-aggregator#analytics

More in System Design

Category-based suggestions if you want to stay in the same domain.