System Design Module 1: Introduction to Distributed Systems

For many years, web applications ran on single, large servers. As traffic grew, companies purchased larger hardware—adding more CPU, memory, and disk space (vertical scaling). However, physical hardware has limits. Eventually, a single machine cannot handle the write rate, connection limits, and network throughput of millions of global users.

A Distributed System is a collection of independent computer nodes that work together to solve a large computing problem, appearing to the end user as a single coherent service. Moving from a single monolithic server to a distributed network solves the scaling problem, but it introduces major architectural challenges. Networks fail, messages are delayed, databases become inconsistent, and servers crash. Master system design requires understanding how to manage these challenges safely.

This introductory module details the core goals, communication protocols, database schemas, and architectural trade-offs of distributed systems.

System Requirements

To build or analyze a distributed system, we establish the core design goals and architectural parameters that govern its behavior. These specifications form the baseline against which all engineering trade-offs (such as consistency vs. availability) are evaluated.

Target Goals of Distributed Systems

Resource Sharing: Pools compute power, memory, and storage arrays across multiple physical or virtual nodes. This allows large-scale parallel processing, where jobs are partitioned and executed concurrently across multiple machines, dramatically reducing the execution times of analytical and transactional workloads.
Horizontal Scalability: Scales the system's total processing and storage capacity linearly by adding standard commodity servers (scaling out) to the cluster. This avoids the cost curve and physical performance ceilings associated with purchasing high-end mainframe hardware (scaling up).
High Availability: Guarantees that the platform remains operational and accessible to clients even if several individual server nodes, network switches, or entire availability zones fail. High availability is typically measured in "nines" (e.g., 99.999% uptime), requiring automated redundancy.
Fault Tolerance: Automatically detects, isolates, and recovers from hardware, operating system, and network-link failures without human intervention. The system must degrade gracefully (fail-soft) rather than crashing entirely when sub-components become unresponsive.
Geographic Proximity: Routes user requests to regional edge servers and data centers that are physically close to them. This minimizes round-trip latency caused by physical limits (the speed of light in fiber optic cables) and complies with local data residency regulations.

Design Constraints

Asynchronous Networks: Under the asynchronous network model, there is no upper bound on the time a message takes to travel across the network. Sockets can drop connections, packets can be reordered, and latencies can fluctuate unpredictably.
Independent Node Failures: Nodes in a cluster fail independently due to hardware degradation, kernel panics, or power outages. The system must continue operating correctly and reach consensus even if a subset of nodes becomes permanently unavailable.
No Shared Memory: Cluster nodes do not share physical memory (RAM) or a global hardware backplane. They must communicate exclusively by exchanging messages over network links. This makes message serialization, deserialization, and state synchronization critical points of efficiency.
Physical Clock Drift: Physical hardware clocks on separate machines drift due to crystal oscillator variations, temperature changes, and NTP synchronization latency. Because of this drift, distributed systems cannot rely on physical timestamps to order transactions or determine causality.

API Design and Interface Contracts

Nodes in a distributed cluster communicate using standard Remote Procedure Calls (gRPC) to register status, exchange data partitions, and check heartbeats.

1. Cluster Membership Heartbeat API (gRPC Protocol)

Worker nodes push periodic health signals to the active master node to confirm availability.

syntax = "proto3";

package codesprintpro.cluster.discovery.v1;

service ClusterDiscoveryService {
  rpc ReportHeartbeat (HeartbeatRequest) returns (HeartbeatResponse);
}

message ResourceUsage {
  double cpu_percent = 1;
  int64 free_memory_bytes = 2;
  int64 active_connections = 3;
}

message HeartbeatRequest {
  string node_id = 1;
  string ip_address = 2;
  int64 epoch_timestamp_ms = 3;
  ResourceUsage resource_status = 4;
}

message HeartbeatResponse {
  enum Command {
    CONTINUE = 0;
    REBALANCE = 1;
    TERMINATE = 2;
  }
  Command recommended_command = 1;
  int64 sync_epoch_timestamp_ms = 2;
}

2. Node Metadata Sync Contract (HTTP GET `/v1/cluster/topology`)

Used by internal routers and proxies to fetch the current partition and shard map.

{
  "clusterEpoch": 10294,
  "masterNodeId": "node_master_001.prod",
  "partitionCount": 3,
  "partitions": [
    {
      "partitionIndex": 0,
      "primaryNodeIp": "10.0.1.5",
      "replicaNodeIps": ["10.0.1.6", "10.0.1.7"],
      "hashRangeStart": "00000000",
      "hashRangeEnd": "55555555"
    },
    {
      "partitionIndex": 1,
      "primaryNodeIp": "10.0.2.5",
      "replicaNodeIps": ["10.0.2.6"],
      "hashRangeStart": "55555556",
      "hashRangeEnd": "aaaaaaaa"
    },
    {
      "partitionIndex": 2,
      "primaryNodeIp": "10.0.3.5",
      "replicaNodeIps": ["10.0.3.6"],
      "hashRangeStart": "aaaaaaab",
      "hashRangeEnd": "ffffffff"
    }
  ]
}

High-Level Architecture

We analyze the architectural transition from a monolithic single-node system to a distributed horizontal cluster.

Monolithic Single Server vs. Distributed Horizontal Cluster

A monolith routes all connections, writes, and reads through a single server. A distributed cluster uses a load balancer to distribute connections across multiple stateless web nodes, which read and write to sharded databases.

graph TD
    subgraph Monolithic System
        ClientM1[Client 1] -->|All Traffic| Monolith[Single Monolith Server]
        ClientM2[Client 2] -->|All Traffic| Monolith
        Monolith --> RDBMS[(Single Database)]
    end

    subgraph Distributed System
        ClientD1[Client 1] --> LB[Load Balancer]
        ClientD2[Client 2] --> LB
        
        LB -->|Route Request| Web1[Web Server Node 1]
        LB -->|Route Request| Web2[Web Server Node 2]
        LB -->|Route Request| Web3[Web Server Node 3]
        
        Web1 --> Shard1[(DB Shard 1)]
        Web2 --> Shard2[(DB Shard 2)]
        Web3 --> Shard3[(DB Shard 3)]
        
        Shard1 <-->|Async Replication| Shard2
        Shard2 <-->|Async Replication| Shard3
    end

CAP Theorem: Partition Tolerance and Consistency Split

When a network partition occurs between Node A and Node B, the system must choose between Availability (AP) or Consistency (CP).

sequenceDiagram
    autonumber
    participant Client as User Client
    participant NodeA as Node A (Region 1)
    participant NodeB as Node B (Region 2)

    note over NodeA, NodeB: Network Partition occurs (WAN is cut)

    Client->>NodeA: Write Request: Set Balance = 100 USD
    NodeA->>NodeA: Commit locally
    NodeA--xNodeB: Replicate Update (Blocked by Network Cut)

    note over Client, NodeB: Client queries Node B
    Client->>NodeB: Read Request: Get Balance

    alt AP Architecture (Availability Priority)
        NodeB-->>Client: Return Balance = 50 USD (Old Data / Inconsistent)
        note over NodeB, Client: System remains available but stale
    else CP Architecture (Consistency Priority)
        NodeB-->>Client: Return Error: 500 Inconsistent (Blocked)
        note over NodeB, Client: System rejects request to guarantee accuracy
    end

Low-Level Design and Schema

Managing a distributed cluster requires tracking the state of individual nodes, monitoring replication health, and mapping data partitions. We model these metadata configurations in a PostgreSQL schema.

-- Tracks all active and dead nodes in the cluster
CREATE TABLE cluster_nodes (
    node_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    ip_address VARCHAR(45) NOT NULL UNIQUE,
    node_role VARCHAR(32) NOT NULL DEFAULT 'WORKER', -- MASTER, WORKER, ARBITER
    node_status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE', -- ACTIVE, DEGRADED, DEAD
    cpu_utilization DECIMAL(5, 2) NOT NULL DEFAULT 0.00,
    memory_free_bytes BIGINT NOT NULL,
    last_heartbeat_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_nodes_status ON cluster_nodes (node_status, last_heartbeat_at);

-- Logs historic heartbeats for health analysis and flap detection
CREATE TABLE node_heartbeats (
    heartbeat_id BIGSERIAL PRIMARY KEY,
    node_id UUID NOT NULL REFERENCES cluster_nodes(node_id) ON DELETE CASCADE,
    response_time_ms INT NOT NULL,
    system_load_status JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_heartbeats_lookup ON node_heartbeats (node_id, created_at DESC);

-- Tracks replication lag across sharded database replicas
CREATE TABLE replication_logs (
    log_id BIGSERIAL PRIMARY KEY,
    source_node_id UUID NOT NULL REFERENCES cluster_nodes(node_id),
    target_node_id UUID NOT NULL REFERENCES cluster_nodes(node_id),
    replication_status VARCHAR(32) NOT NULL DEFAULT 'IN_SYNC', -- IN_SYNC, LAGGING, STALLED
    replication_lag_bytes BIGINT NOT NULL DEFAULT 0,
    last_replicated_sequence BIGINT NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_replication_status ON replication_logs (replication_status, replication_lag_bytes DESC);

-- Maps database partitions to physical primary and replica nodes
CREATE TABLE data_partitions (
    partition_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    partition_index INT NOT NULL UNIQUE,
    primary_node_id UUID NOT NULL REFERENCES cluster_nodes(node_id),
    replica_node_ids UUID[] NOT NULL, -- Array of backup node IDs
    hash_range_start VARCHAR(64) NOT NULL,
    hash_range_end VARCHAR(64) NOT NULL
);

CREATE INDEX idx_partitions_range ON data_partitions (hash_range_start, hash_range_end);

Schema Rationale & Index Optimization

idx_nodes_status: The cluster leader query checks this index every 5 seconds to locate worker nodes that have missed their heartbeat window (e.g., querying for nodes where status is equal to 'ACTIVE' and last_heartbeat_at is less than NOW() - INTERVAL '15 seconds') and marks them as DEAD.
replica_node_ids (Array Data Type): Storing backup configurations in a relational array allows routers to quickly fetch fallback node options when a primary database connection fails.
idx_replication_status: Allows operations teams to monitor replication lag, alerting if any database replica drops behind by more than 10 MB.

Scaling Challenges and Capacity Estimation

Transitioning from a single monolithic server to a distributed cluster requires calculating connection boundaries, system memory allocations, and network bandwidth thresholds. Underestimating these physical limits leads to socket exhaustion and high latencies in production.

1. Monolithic Thread and Socket Limits

When operating a monolith, the operating system kernel and physical hardware resources place hard limits on concurrent throughput.

Assumptions:
- Web Server RAM = $16$ GB
- Memory allocated per thread for stack space (in Java or standard thread-per-connection setups) = $1$ MB
- Operating System socket port limit = $65,535$ ports
Calculations: $$\text{Theoretical Thread Limit} = \frac{16\text{ GB}}{1\text{ MB}} = 16,000\text{ threads}$$

In practice, a thread-per-connection model cannot scale to this theoretical limit. CPU context-switching overhead severely degrades system performance when active threads exceed 2,000. Each context switch forces the CPU to save and restore registers, invalidates the Translation Lookaside Buffer (TLB), and causes CPU cache misses, spending more time on scheduling overhead than actual computation.

Additionally, operating systems use file descriptors to manage network sockets. By default, standard Unix kernels impose file descriptor limits (e.g., ulimit -n configured to 1024 or 4096), which must be manually tuned. Even with system tuning, a client connection is uniquely identified by a 4-tuple: (Source IP, Source Port, Destination IP, Destination Port). When communicating with a single database server or upstream dependency, the client is limited to a maximum of 65,535 ephemeral ports.

Furthermore, when connections are closed, sockets transition to the TIME_WAIT state for double the Maximum Segment Lifetime ($2 \times \text{MSL}$, typically 120 seconds) to ensure stray packets are discarded. A high rate of short-lived connections quickly exhausts the ephemeral port pool, causing new connection attempts to fail with Address already in use errors.

To support 1,000,000 concurrent active users, we must:
- Transition to a non-blocking asynchronous event loop (such as Netty or Node-style async event loops) that handles thousands of connections on a small, fixed pool of worker threads.
- Scale horizontally across a cluster of multiple servers to distribute the file descriptor and ephemeral port load.

2. Bandwidth and Network Limitations

We must calculate the total raw network bandwidth required to ingress and egress data through the application layer.

Assumptions:
- Total concurrent request load = $100,000$ requests/second
- Average JSON request + response payload size = $50$ KB
Calculations: $$\text{Network Bandwidth Required} = 100,000\text{ requests/s} \times 50\text{ KB} = 5,000,000\text{ KB/second} = 5\text{ GB/second}$$ $$\text{Bandwidth in Bits} = 5\text{ GB/s} \times 8 \approx 40\text{ Gbps}$$

A standard high-performance commodity server's network interface card (NIC) typically supports 10 Gbps. A 40 Gbps load exceeds the physical throughput of a single server's network connection by a factor of four. To prevent packet loss and queue buffer bloat at the network level, we must use a layer-4 or layer-7 load balancer to distribute the network load across a cluster of 5 to 10 independent servers, each equipped with dedicated 10 Gbps or 25 Gbps interfaces.

Failure Scenarios and Resilience

Distributed systems must be designed assuming failures are common.

1. Network Partitions (The CAP Split)

A network switch fails, isolating one availability zone (AZ-A) from the remaining cluster.

The Threat: If AZ-A continues to accept writes while disconnected from the replica database shards, the database states diverge, causing data inconsistencies.
Resilience Design:
- We use Consensus Algorithms (such as Raft or Paxos).
- Shards are organized into replica groups. To commit a write, the primary node must receive a confirmation quorum from the majority of replicas: $$\text{Quorum} \ge \frac{N}{2} + 1$$
- If a partition isolates AZ-A, the isolated primary node fails to reach quorum, blocks writes, and transitions to read-only status. The healthy majority partition elects a new primary and continues to accept writes, maintaining consistency.

2. Cascading Failures and Thundering Herds

A database shard crashes under a high load. When it restarts, thousands of waiting client requests hit the database simultaneously.

The Threat: This sudden spike in traffic immediately overloads the database, causing it to crash again and creating a cascading outage loop.
Resilience Design:
- We configure Circuit Breakers on the clients. When the database fails, the client circuit breaker trips and returns immediate error responses (or cached values) for a set time, protecting the database.
- We apply Exponential Backoff with Jitter on retries: $$\text{Retry Delay} = \min(\text{max_delay}, \text{base_delay} \times 2^{\text{attempt}}) \pm \text{random_jitter}$$
- This spreads request retries evenly over time, preventing thundering herds.

3. Clock Drift and Out-of-Order Transactions

Two database servers in different regions sync their clocks using Network Time Protocol (NTP), but their clocks drift by 150 milliseconds.

The Threat: User A writes at 12:00:00.100 on Server A, and User B writes an update to the same row at 12:00:00.050 on Server B (due to clock drift, Server B is behind). Under a last-write-wins (LWW) resolution model, User A's update is kept, and User B's update is discarded, corrupting state.
Resilience Design:
- We avoid relying on physical timestamps to order writes.
- We use Vector Clocks or Logical Clocks to track causal relationships between updates.
- Modern databases use Hybrid Logical Clocks (HLC), combining physical time with monotonic counters to guarantee sequence ordering across nodes.

4. Telemetry and Telemetry Resource Starvation

A system logs every network socket connection event to diagnose errors.

The Threat: High logging volume consumes disk space and CPU cycles, stalling application threads.
Resilience Design:
- We use Log Level Filtering and Sampled Tracing.
- In production, we filter out informational logs and only sample a small percentage of successful requests (e.g., 1% of transactions) for distributed tracing, while logging all errors, protecting system resources.

Architectural Trade-offs

Designing a distributed system requires balancing resource limits against application complexity.

Trade-off 1: Horizontal Scaling vs. Vertical Scaling

Horizontal scaling adds more commodity servers; vertical scaling upgrades existing server hardware (CPU/RAM).

Feature / Metric	Horizontal Scaling	Vertical Scaling
Cost Efficiency	High. Uses cost-effective commodity hardware.	Low. Upgrading to high-end hardware is expensive.
Availability	High. Failures are bypassed automatically.	Low. The server remains a single point of failure.
System Complexity	High. Requires load balancing and sharding.	Low. No changes are required to application code.
Scaling Limit	High. Scalable to thousands of nodes.	Low. Limited by physical motherboard bounds.

Trade-off 2: Shared-Everything vs. Shared-Nothing Architectures

Shared-Everything nodes access the same disk and memory; Shared-Nothing nodes maintain private resources.

Feature / Metric	Shared-Everything	Shared-Nothing
Scalability	Low. Disk I/O bottlenecks limit horizontal scaling.	High. Nodes scale independently without shared resources.
Data Synchronization	High. Handled automatically by the hardware layer.	Low. Requires application-layer consensus and replication.
Fault Isolation	Poor. A corrupt disk crashes all nodes.	High. Node failures do not impact other nodes.

Staff Engineer Perspective

Operating distributed systems requires designing for partial failures and understanding network realities.

Verbal Script

Interviewer: "What is the difference between horizontal and vertical scaling, and when should you choose one over the other?"

Candidate: "Vertical scaling, or scaling up, involves adding more resources (like CPU, RAM, or disk space) to a single server instance. Horizontal scaling, or scaling out, involves adding more independent commodity servers to the application pool.

Vertical scaling is simpler because it requires no changes to the application code; you are running the same software on a larger machine.

However, vertical scaling is limited by physical hardware limits, and it creates a single point of failure (if that server fails, the system is offline).

We choose horizontal scaling when:

The write rate or network throughput exceeds the capacity of a single physical server.
We require high availability; horizontal scaling allows us to distribute nodes across availability zones and regions to survive hardware failures.
It is more cost-effective. Commodity servers are cheaper to scale horizontally than high-end database hardware.

However, horizontal scaling introduces complexity: we must implement load balancing, data sharding, and network consensus protocols to keep data synchronized."

Interviewer: "Explain the CAP Theorem and how it affects database design during network partitions."

Candidate: "The CAP Theorem states that a distributed data store can simultaneously provide at most two of three guarantees: Consistency (every read receives the most recent write or an error), Availability (every non-failing node returns a response), and Partition Tolerance (the system continues to operate despite network packet losses).

Crucially, Partition Tolerance (P) is a physical reality. Networks will eventually partition.

Therefore, a distributed database must choose between Consistency (C) and Availability (A) when a partition occurs:

If we choose Consistency (CP): The system rejects reads and writes on nodes that cannot reach the primary database to prevent stale data. This is critical for transactional systems (like financial ledgers) where accuracy is mandatory.
If we choose Availability (AP): All nodes accept reads and writes, even if they cannot synchronize. This results in eventual consistency, where nodes temporarily serve stale data. This is preferred for social media feeds or comment sections where uptime is more important than absolute real-time accuracy.

Our database design must match this CAP trade-off depending on the business requirements of each service."

Interviewer: "How do you handle clock drift in a distributed system, and why are physical timestamps dangerous for ordering events?"

Candidate: "We handle clock drift by using logical clocks (like Lamport timestamps) or Hybrid Logical Clocks (HLC), and we avoid relying on physical timestamps to order transactions.

Physical clocks on servers drift due to temperature, hardware quality, and NTP synchronization adjustments.

If Node A's clock drifts ahead by 100 milliseconds, it will write a transaction with a future timestamp.

If Node B later writes an update to the same row, its transaction will have an earlier timestamp.

Under a Last-Write-Wins (LWW) resolution model, Node B's newer update is discarded, corrupting the database.

To solve this:

We use HLCs, which combine physical timestamps with logical sequence numbers. When a node receives a write request, it compares the physical clock with the transaction's incoming timestamp. If the incoming timestamp is in the future, the HLC advances its logical counter beyond that value, guaranteeing monotonic ordering.
Alternatively, for transactional safety, we use consensus groups (like Raft) where a single leader node sequences all write events, bypassing physical clock dependencies."

System Design Module 1: Introduction to Distributed Systems

Bridge the gap between architecture diagrams and implementation details.

System Requirements

Target Goals of Distributed Systems

Design Constraints

API Design and Interface Contracts

1. Cluster Membership Heartbeat API (gRPC Protocol)

2. Node Metadata Sync Contract (HTTP GET `/v1/cluster/topology`)

High-Level Architecture

Monolithic Single Server vs. Distributed Horizontal Cluster

CAP Theorem: Partition Tolerance and Consistency Split

Low-Level Design and Schema

Schema Rationale & Index Optimization

Scaling Challenges and Capacity Estimation

1. Monolithic Thread and Socket Limits

2. Bandwidth and Network Limitations

Failure Scenarios and Resilience

1. Network Partitions (The CAP Split)

2. Cascading Failures and Thundering Herds

3. Clock Drift and Out-of-Order Transactions

4. Telemetry and Telemetry Resource Starvation

Architectural Trade-offs

Trade-off 1: Horizontal Scaling vs. Vertical Scaling

Trade-off 2: Shared-Everything vs. Shared-Nothing Architectures

Staff Engineer Perspective

Verbal Script

Read Next

Want to track your progress?

System Design Module 1: Introduction to Distributed Systems

Bridge the gap between architecture diagrams and implementation details.

System Requirements

Target Goals of Distributed Systems

Design Constraints

API Design and Interface Contracts

1. Cluster Membership Heartbeat API (gRPC Protocol)

2. Node Metadata Sync Contract (HTTP GET /v1/cluster/topology)

High-Level Architecture

Monolithic Single Server vs. Distributed Horizontal Cluster

CAP Theorem: Partition Tolerance and Consistency Split

Low-Level Design and Schema

Schema Rationale & Index Optimization

Scaling Challenges and Capacity Estimation

1. Monolithic Thread and Socket Limits

2. Bandwidth and Network Limitations

Failure Scenarios and Resilience

1. Network Partitions (The CAP Split)

2. Cascading Failures and Thundering Herds

3. Clock Drift and Out-of-Order Transactions

4. Telemetry and Telemetry Resource Starvation

Architectural Trade-offs

Trade-off 1: Horizontal Scaling vs. Vertical Scaling

Trade-off 2: Shared-Everything vs. Shared-Nothing Architectures

Staff Engineer Perspective

Verbal Script

Read Next

Want to track your progress?

2. Node Metadata Sync Contract (HTTP GET `/v1/cluster/topology`)