System Design: Designing a Distributed File System (HDFS)
A Distributed File System (like HDFS or GFS) is designed to store massive datasets across a cluster of commodity servers. It handles the complexity of breaking large files into chunks, replicating those chunks, and managing hardware failures without interrupting access.
1. Core Requirements
- Massive Storage: Storing petabytes of data.
- Reliability: Data must survive disk/node failures.
- Sequential Access: Optimized for large, sequential file reads.
- Throughput: High bandwidth for batch processing (MapReduce/Spark).
2. High-Level Architecture
- NameNode (Metadata Server): Keeps track of file structure (namespaces), which chunks make up each file, and where those chunks reside. It's the "brain."
- DataNodes (Storage Servers): The physical machines where the actual file chunks are stored. They handle read/write requests from clients.
3. Data Chunking & Replication
- Chunking: Files are broken into fixed-size chunks (e.g., 128MB). This size is intentionally large to reduce metadata overhead and optimize sequential reads.
- Replication: Each chunk is replicated (usually 3 times). To ensure fault tolerance, replicas are placed in different racks/AZs so that a single rack failure doesn't lose the entire chunk.
4. The NameNode Bottleneck
The NameNode stores metadata in memory to keep lookups fast. But what if we have billions of files?
- Federation: Split the file system into multiple namespaces, each managed by its own NameNode.
- Caching: Frequently accessed metadata is cached, but the NameNode remains the absolute source of truth.
5. Heartbeats and Failure Recovery
- Heartbeats: DataNodes send heartbeats to the NameNode every 3-5 seconds.
- Recovery: If the NameNode stops receiving heartbeats from a DataNode, it marks all chunks on that node as "under-replicated" and commands other healthy nodes to create new replicas immediately.
Summary
Building a distributed file system is about Abstraction. By separating metadata management (NameNode) from raw storage (DataNodes) and using massive block sizes, you can build a storage engine that scales to thousands of nodes and petabytes of data while appearing as a single, simple directory structure to the user.
