Mental Model
Scaling storage to petabytes across thousands of commodity servers requires decoupling control operations from high-throughput data streams. A master-worker distributed file system isolates metadata organization onto a highly available, memory-optimized NameNode cluster while streaming raw, block-fragmented data payloads directly to and from rack-aware, autonomous DataNodes.
Requirements and System Goals
To design a distributed file system capable of driving high-throughput analytics pipelines (e.g. MapReduce, Spark, or Large Language Model training ingestion), we establish clear capacity limits and latency bounds.
1. Functional Requirements
- Hierarchical File System Namespace: Support standard directory structures, file creation, deletion, rename, and listing.
- Large Block-Based Ingestion: Automatically partition files into massive, fixed-size contiguous blocks.
- Rack-Aware State Replication: Automatically distribute block copies across distinct hardware racks to survive raw network switch or node failures.
2. Non-Functional Requirements & Performance Budgets
- High Read/Write Throughput: Maximize sequential data transfer bandwidth. Reading a 100 Gigabyte (GB) file must stream at a composite rate of greater than 10 Gbps over clustered interfaces.
- Fault Tolerance & Durability: Survive the concurrent failure of entire server racks without data loss or pipeline interruption.
- Metadata Lookup Latency: The NameNode must resolve file-to-block routing queries inside a P99 budget of less than 10ms.
- Write Pipeline Latency: Acknowledging a block write to the client must return in less than 50ms once successfully pipelined to replicas.
3. Back-of-the-Envelope Estimation: Metadata Memory Footprint
Unlike raw data, the HDFS NameNode holds the entire namespace structure and block-location map directly in its JVM heap to keep lookups sub-millisecond. We calculate the NameNode memory limits under massive scale:
- Target Cluster Capacity: 100 Petabytes (PB).
- Block Configuration: Fixed size of 128 Megabytes (MB) per block.
- Larger block sizes (e.g., 128MB or 256MB) are chosen specifically to minimize the metadata records stored in the NameNode memory, preventing GC pauses.
- Total Blocks Count: $100 \text{ PB} \div 128 \text{ MB} = 100,000,000 \text{ Gigabytes} \div 128 \text{ Megabytes} \approx 819,200 \text{ blocks}$.
- Replication Factor: $3\times$ replication.
- Total physical blocks tracked in the cluster = $819,200 \times 3 = 2,457,600 \text{ physical blocks}$.
- Metadata Footprint Sizing:
- On average, each block metadata record (Block ID, list of hosting DataNodes, generation stamp) requires 150 bytes of memory.
- Directory and file namespace objects require 150 bytes per inode.
- Assume our file system stores $10,000,000$ files across directories.
- Memory required for blocks: $2,457,600 \times 150 \text{ bytes} \approx 368 \text{ Megabytes (MB)}$.
- Memory required for namespace inodes: $10,000,000 \times 150 \text{ bytes} \approx 1.5 \text{ Gigabytes (GB)}$.
- Total NameNode JVM Heap footprint: $\approx 1.87 \text{ GB}$. This scales linearly; storing 1 Billion files requires a dedicated, robust physical master host equipped with at least 150 GB of highly optimized ECC RAM to prevent JVM garbage collection thrashing.
API Interfaces and Service Contracts
Distributed file system operations separate metadata leases from direct DataNode block pipelines.
1. Initiate File Write & Obtain Lease
POST /api/v1/files/open
Request Payload:
{
"filePath": "/analytics/logs/2026-05-31/raw_events.csv",
"blockSizeBytes": 134217728,
"replicationFactor": 3,
"clientHost": "client-worker-99ab"
}
Response Payload (201 Created):
{
"filePath": "/analytics/logs/2026-05-31/raw_events.csv",
"leaseToken": "lease_token_019a-88cd",
"blockSizeBytes": 134217728,
"firstBlock": {
"blockId": "blk_9981a-uuid7",
"generationStamp": 1,
"targetDataNodes": [
{
"hostName": "datanode-rack1-01.corp.internal",
"ipAddress": "10.0.1.12"
},
{
"hostName": "datanode-rack1-02.corp.internal",
"ipAddress": "10.0.1.13"
},
{
"hostName": "datanode-rack2-01.corp.internal",
"ipAddress": "10.0.2.12"
}
]
}
}
2. Register Block Completion
POST /api/v1/blocks/complete
Request Payload:
{
"sagaId": "blk_9981a-uuid7",
"leaseToken": "lease_token_019a-88cd",
"bytesWritten": 134217728,
"checksum": "sha256-a1928bc7...",
"confirmedNodes": [
"10.0.1.12",
"10.0.1.13",
"10.0.2.12"
]
}
Response Payload (200 OK):
{
"blockState": "FINALIZED",
"nextBlockReady": true
}
High-Level Design and Visualizations
Decoupling NameNode control flows from the direct DataNode client streams prevents master networking starvation under heavy bulk reads.
1. Active-Standby NameNode Cluster Layout
To ensure continuous metadata service availability and prevent Split-Brain states, we utilize ZooKeeper active election alongside shared EditLog journals.
graph TD
subgraph Client
C[HDFS Client]
end
subgraph NameNode High Availability Cluster
C -->|1. Get Block Metadata| NN_Active[Active NameNode]
NN_Standby[Standby NameNode]
%% ZooKeeper Fencing
ZKFC_A[ZooKeeper Failover A] <-->|Monitor Heartbeat| NN_Active
ZKFC_B[ZooKeeper Failover B] <-->|Monitor Heartbeat| NN_Standby
ZKFC_A <-->|Active Lock| ZK[(ZooKeeper Consensus)]
ZKFC_B <-->|Active Lock| ZK
%% Shared Journal
NN_Active -->|2. Write EditLogs| JN[(JournalNode Cluster)]
NN_Standby -.->|3. Tail EditLogs| JN
end
subgraph Storage Rack 1
C -->|4. Read/Write Block Direct| DN_A[(DataNode A)]
C -->|4. Read/Write Block Direct| DN_B[(DataNode B)]
end
subgraph Storage Rack 2
C -->|4. Read/Write Block Direct| DN_C[(DataNode C)]
end
DN_A <-->|Heartbeats & Block Reports| NN_Active
DN_B <-->|Heartbeats & Block Reports| NN_Active
DN_C <-->|Heartbeats & Block Reports| NN_Active
2. Block Write Replication Pipeline Sequence
HDFS clients stream block payloads directly to DataNodes in a sequential, pipelined packet queue.
sequenceDiagram
autonumber
participant Client as HDFS Client
participant DN1 as DataNode A (Rack 1)
participant DN2 as DataNode B (Rack 1)
participant DN3 as DataNode C (Rack 2)
Client->>DN1: Stream Packet (64 KB Chunk)
DN1->>DN2: Pipeline Packet (Forward)
DN2->>DN3: Pipeline Packet (Forward)
Note over DN3: Verify Packet Checksum
DN3-->>DN2: Acknowledge Packet (ACK)
DN2-->>DN1: Acknowledge Packet (ACK)
DN1-->>Client: Acknowledge Packet (ACK)
Low-Level Design and Schema Strategies
To support sub-millisecond indexing, we represent NameNode in-memory tables and block allocations clearly.
1. NameNode Memory Struct Representation
Internally, the NameNode maps files to blocks using time-sortable memory hierarchies.
{
"inodes": {
"/analytics/logs/2026-05-31/raw_events.csv": {
"inodeId": "inod_019a-99cd",
"owner": "hadoop_analytics",
"permissions": "rwxr-xr-x",
"fileSizeBytes": 268435456,
"blocksList": [
{
"blockId": "blk_9981a-uuid7",
"blockSize": 134217728,
"generationStamp": 1,
"replicas": ["10.0.1.12", "10.0.1.13", "10.0.2.12"]
},
{
"blockId": "blk_9982b-uuid7",
"blockSize": 134217728,
"generationStamp": 1,
"replicas": ["10.0.1.12", "10.0.2.12", "10.0.2.13"]
}
]
}
}
}
2. EditLog Journal Persistent Layout
Because in-memory state is volatile, every metadata transaction (creating a file, adding a block, renaming a folder) is appended to a sequential persistent journal (EditLog) on disk.
# EditLog transactional representation
OP_START_LOG_SEGMENT txid=1029881
OP_ADD filepath="/analytics/logs/2026-05-31/raw_events.csv" owner="hadoop" client="10.0.5.21" txid=1029882
OP_ALLOCATE_BLOCK blockid="blk_9981a-uuid7" txid=1029883
OP_CLOSE filepath="/analytics/logs/2026-05-31/raw_events.csv" length=268435456 txid=1029884
During startup, the NameNode loads a snapshot (FSImage) and replays the outstanding EditLog transaction logs to reconstruct the entire namespace in RAM, ensuring zero transaction loss.
Scaling and Operational Challenges
1. NameNode JVM Garbage Collection Exhaustion
As the file count grows to hundreds of millions, the sheer density of metadata objects in JVM heap creates massive garbage collection overhead.
- The Danger: When a Stop-the-World GC pause triggers, the NameNode freezes execution for up to 30 seconds. DataNodes stop receiving master ping responses, assume the NameNode is dead, and begin a chaotic failover sequence.
- Staff Mitigation:
- Enforce a strict Minimum Block Size (e.g. 128MB or 256MB) to limit the total block metadata count.
- Implement Namespace Federation. Instead of one NameNode, partition the file system directory trees across multiple independent NameNodes (e.g.,
/usermanaged by NameNode A,/analyticsmanaged by NameNode B), utilizing ViewFS client mount tables to present a unified directory mapping.
2. Rack-Aware Replication Placement Math
To achieve maximum fault tolerance, HDFS distributes block replicas across different physical racks using a rack-aware algorithm:
- The Placement Strategy:
- Replica 1: Placed on a local DataNode in the same rack as the client.
- Replica 2: Placed on a DataNode in a physically distinct remote rack.
- Replica 3: Placed on a different DataNode in the same remote rack as Replica 2.
- The Mathematical Benefit: Placing replicas across two racks rather than three preserves network switch bandwidth (since we only cross the core datacenter WAN switch once during the write pipeline) while guaranteeing that the block survives a total rack power loss or top-of-rack (TOR) switch blowout.
Architectural Trade-offs and Replication Decisions
Choosing how to protect data against node failure dictates storage costs and sequential read throughput budgets.
| Operational Dimension | $3\times$ Block Replication | Erasure Coding (Reed-Solomon 6+3) |
|---|---|---|
| Storage Overhead | Massive ($200%$ storage cost premium) | Low ($50%$ storage cost premium) |
| Write Pipeline Complexity | Low (Simple sequential TCP forwarding) | High (Heavy mathematical encoding computations) |
| CPU Saturation | Negligible | High (Constant parity bit calculation) |
| Read Recovery Speed | Fast (Direct read from active healthy replica) | Slow (Must read 6 remaining fragments to rebuild) |
Failure Modes and Fault Tolerance Strategies
1. ZooKeeper Active-Fencing & Split-Brain Mitigation
Under a network partition where the Active NameNode and Standby NameNode become isolated, both might assume they are the only active leader. If both accept client writes, the EditLog journals will permanently diverge, destroying file system integrity.
- The Failover Battle: ZooKeeper Failover Controllers (ZKFC) monitor active states. If the WAN sever splits, ZKFC B detects that ZKFC A's ZooKeeper active lease has expired, and attempts to promote Standby NameNode B to Active.
- Staff Mitigation: Enforce Strict Active-Fencing. Before Standby NameNode B is promoted to Active, ZKFC B must execute an SSH fencing command to log into the original Active NameNode A and execute a hard
kill -9process command, or trigger a physical IPMI power fencing command to cut power to Node A's motherboard, ensuring only one Active NameNode can possibly exist in the datacenter.
2. DataNode Checksum Corruption & Bit Rot
Under prolonged storage, silent disk corruption (bit rot) can alter blocks on DataNodes without throwing hard OS hardware exceptions.
- Mitigation: DataNodes store a block-checksum file (
.meta) for every block. When a client reads a block, it validates the payload checksum against the meta file. If a mismatch is detected, the client rejects the payload, fetches the block from a healthy replica, and alerts the NameNode. The NameNode schedules a background replication job to overwrite the corrupt DataNode block from a healthy copy, self-healing the cluster.
Staff Engineer Perspective
Production Readiness Checklist
Before moving a distributed file cluster into production:
- IPMI Power Fencing configured: Hardware fencing controllers are verified to cut power to partitioned NameNode hosts automatically during failovers.
- Federated directory ViewFS configured: Large directory mounts are split across multiple namespace NameNodes.
- DataNode Disk Health thresholds active: DataNodes are configured to take disks offline automatically if local write checksum failures exceed 1%.
- Rack Topology mappings updated: Datacenter node configurations reflect correct network switches to ensure rack-aware replication works.
Read Next
- Consistent Hashing: The Secret Sauce of Distributed Scalability
- High Availability: Building a Five Nines Infrastructure
Verbal Script
Interviewer: "How would you design a distributed file system to store petabytes of data, and how would you protect the system from master name node split-brain failures?"
Candidate: "To design a petabyte-scale distributed file system, I would adopt a decoupled master-worker architecture, similar to HDFS.
The control operations reside in a centralized, memory-optimized NameNode, which tracks the hierarchical namespace, folder inodes, and file-to-block routing maps directly in its JVM heap for sub-millisecond lookups. The raw file payloads are divided into massive, fixed-size blocks—such as 128MB—to keep metadata counts low, and stored directly on thousands of commodity DataNodes.
To read or write data, the client performs a fast handshake with the NameNode to retrieve block locations, and then streams the block data directly to the DataNodes in a pipelined packet queue. This prevents the NameNode from becoming a network or disk bottleneck.
To ensure high availability and prevent a catastrophic Split-Brain state where two NameNodes concurrently act as Active leaders, I would construct a ZooKeeper-backed High-Availability cluster. We run ZooKeeper Failover Controllers on both master hosts.
If a network partition occurs and the Standby NameNode is promoted, the failover controller must enforce strict active-fencing before the Standby is allowed to accept writes. It issues a remote SSH fencing command to kill the original Active NameNode process, or triggers a physical IPMI motherboard power fencing command to cut power to the host.
This guarantees that only a single NameNode is active, protecting our transaction journal EditLogs and shared storage folders from data corruption."