The 'Small Files' Problem: The Data Lake Killer

Streaming data from Kafka into a Data Lake (like Amazon S3 or Azure Blob Storage) seems simple. However, if you write data as soon as it arrives, you will quickly hit the Small Files Problem.

1. What is the Problem?

Distributed storage systems and query engines (like Athena, Presto, or Spark) are optimized for large, sequential reads.

The Pitfall: If you have 10 million files that are each 10KB, the overhead of opening each file and reading its metadata (listing, seeking) will consume 90% of your query time.
The Symptom: Your S3-based queries that used to take seconds now take minutes, and your cloud bill for "ListBucket" and "GetObject" requests is skyrocketing.

2. Why does Kafka cause this?

Kafka is a real-time system. If you use a Kafka Connect S3 Sink with a small flush.size or rotate.interval.ms, it will create a new file every few seconds. Over a day, a single Kafka topic can generate thousands of tiny files.

3. The Solution: Compaction (The Bin-Packing Pattern)

To keep your Data Lake healthy, you must implement a Compaction Strategy.

The Landing Zone: Write raw, tiny files into a "temporary" prefix in S3.
The Compactor: Run a background process (e.g., an AWS Glue job or a Spark job) that reads these tiny files and merges them into large, 128MB to 512MB Parquet files.
The Gold Zone: Move the compacted files to your final table location.

4. Using Partitioning Efficiently

Partition your data by time (e.g., /year=2024/month=04/day=20/).

Benefit: When you run a query for a specific day, the engine only has to scan the files in that specific folder, skipping terabytes of irrelevant data.

5. Metadata Storage (The Hive Metastore)

Use a tool like the AWS Glue Data Catalog to keep track of where your files are and what schema they use. This allows you to update your metadata once compaction is done, so your users always see the most optimized view of the data.

Summary

Building a scalable Data Lake requires moving from real-time "streaming" to "batch-oriented storage." By implementing a robust compaction process and choosing the right file size, you can maintain sub-second query performance even as your data grows to petabytes.

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

The 'Small Files' Problem in Data Lakes: Why Your Kafka Sink is Slow

The 'Small Files' Problem: The Data Lake Killer

1. What is the Problem?

2. Why does Kafka cause this?

3. The Solution: Compaction (The Bin-Packing Pattern)

4. Using Partitioning Efficiently

5. Metadata Storage (The Hive Metastore)

Summary

Recommended Resources

Sachin Sarawgi

Keep Learning

Optimistic vs. Pessimistic Locking: Concurrency Control in Practice

Consistent Hashing: The Secret Sauce of Distributed Scalability

Related Articles

S3 Express One Zone: When to Use it for Stateful Workloads

S3 Express One Zone: When to use it

Windowing in Stream Processing: Tumbling, Sliding, and Session Windows

Kafka Streams: Real-Time Stream Processing Without a Separate Cluster

More in Data Engineering

Change Data Capture with Debezium: Real-Time Data Synchronization Patterns

The 'Small Files' Problem in Data Lakes: Why Your Kafka Sink is Slow

The 'Small Files' Problem: The Data Lake Killer

1. What is the Problem?

2. Why does Kafka cause this?

3. The Solution: Compaction (The Bin-Packing Pattern)

4. Using Partitioning Efficiently

5. Metadata Storage (The Hive Metastore)

Summary

Recommended Resources

Get the next backend guide in your inbox

Sachin Sarawgi

Keep Learning

Optimistic vs. Pessimistic Locking: Concurrency Control in Practice

Consistent Hashing: The Secret Sauce of Distributed Scalability

Related Articles

S3 Express One Zone: When to Use it for Stateful Workloads

S3 Express One Zone: When to use it

Windowing in Stream Processing: Tumbling, Sliding, and Session Windows

Kafka Streams: Real-Time Stream Processing Without a Separate Cluster

More in Data Engineering

Change Data Capture with Debezium: Real-Time Data Synchronization Patterns