Data EngineeringBeginnerarticle

The 'Small Files' Problem in Data Lakes: Why Your Kafka Sink is Slow

Why does writing data from Kafka to S3 slow down over time? Learn about the 'Small Files' problem and how to implement a compaction strategy for your Data Lake.

Sachin SarawgiApril 20, 20262 min read2 minute lesson

The 'Small Files' Problem: The Data Lake Killer

Streaming data from Kafka into a Data Lake (like Amazon S3 or Azure Blob Storage) seems simple. However, if you write data as soon as it arrives, you will quickly hit the Small Files Problem.

1. What is the Problem?

Distributed storage systems and query engines (like Athena, Presto, or Spark) are optimized for large, sequential reads.

  • The Pitfall: If you have 10 million files that are each 10KB, the overhead of opening each file and reading its metadata (listing, seeking) will consume 90% of your query time.
  • The Symptom: Your S3-based queries that used to take seconds now take minutes, and your cloud bill for "ListBucket" and "GetObject" requests is skyrocketing.

2. Why does Kafka cause this?

Kafka is a real-time system. If you use a Kafka Connect S3 Sink with a small flush.size or rotate.interval.ms, it will create a new file every few seconds. Over a day, a single Kafka topic can generate thousands of tiny files.

3. The Solution: Compaction (The Bin-Packing Pattern)

To keep your Data Lake healthy, you must implement a Compaction Strategy.

  1. The Landing Zone: Write raw, tiny files into a "temporary" prefix in S3.
  2. The Compactor: Run a background process (e.g., an AWS Glue job or a Spark job) that reads these tiny files and merges them into large, 128MB to 512MB Parquet files.
  3. The Gold Zone: Move the compacted files to your final table location.

4. Using Partitioning Efficiently

Partition your data by time (e.g., /year=2024/month=04/day=20/).

  • Benefit: When you run a query for a specific day, the engine only has to scan the files in that specific folder, skipping terabytes of irrelevant data.

5. Metadata Storage (The Hive Metastore)

Use a tool like the AWS Glue Data Catalog to keep track of where your files are and what schema they use. This allows you to update your metadata once compaction is done, so your users always see the most optimized view of the data.

Summary

Building a scalable Data Lake requires moving from real-time "streaming" to "batch-oriented storage." By implementing a robust compaction process and choosing the right file size, you can maintain sub-second query performance even as your data grows to petabytes.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

More in Data Engineering

Category-based suggestions if you want to stay in the same domain.