Lesson 43 of 105 12 minFlagship

System Design: Data Partitioning and Sharding Strategies

How to scale databases horizontally? A deep dive into Partitioning, Sharding (Horizontal vs Vertical), and choosing the right Shard Key to avoid Hotspots.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **Vertical Partitioning (Normalization):** Splitting tables by columns (e.g., storing user metadata and user preferences in separate tables).
  • **Horizontal Partitioning (Sharding):** Splitting the table by rows. Server 1 holds user IDs 1-1M, Server 2 holds 1M-2M.
  • **Hash-based Sharding:** Distributing rows uniformly using a hash function on the shard key.
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

System Design: Data Partitioning and Sharding

When a database outgrows a single machine, vertical scaling (upgrading RAM, CPU, or storage) eventually hits a hard physical ceiling and becomes exponentially expensive. To scale to hundreds of millions of active users and petabytes of data, you must scale out horizontally.

Sharding is the architectural practice of breaking up a massive monolithic database into smaller, faster, and more manageable pieces called physical shards. This guide breaks down the core patterns, algorithmic mechanics, and production realities of sharding distributed databases at high scale.


Requirements and System Goals

Designing a sharded database infrastructure requires balancing query routing overhead, uniform data distribution, and partition tolerance.

1. Functional Requirements

  • Uniform Ingress & Egress: Distribute read and write queries evenly across all physical shards without creating bottlenecks.
  • Support for Arbitrary Queries: Maintain the ability to query data by both primary identifiers (e.g., user_id) and secondary attributes (e.g., email, created_at).
  • Elastic Cluster Resizing: Allow adding or removing physical database nodes with minimal data redistribution and zero client-visible downtime.
  • Deterministic Routing: Resolve the target physical database shard for any request in sub-millisecond lookup times.

2. Non-Functional Requirements

  • High Availability: Maintain a multi-primary or primary-replica architecture for every shard to ensure no single point of failure (SPOF) exists.
  • Latency Budgets: Query routing proxies must introduce less than 1.5ms of overhead to the end-to-end P99 latency.
  • Consistency Commitments: Support strong consistency (linearizability) within a single shard, and support configurable eventual consistency models for cross-shard queries.
  • Minimal Cross-Shard Operations: Minimize or completely eliminate multi-shard transactions and joins, which introduce severe network coordination overhead.

API Interfaces and Service Contracts

To separate application logic from physical database locations, systems utilize a routing proxy layer (such as Vitess or Citus). Below are the service contracts for both routing requests and dynamic directory-lookup operations.

1. User Entity Query API

When a service queries a sharded record, the API gateway or sharding proxy inspects the payload or headers to extract the shard key.

GET /api/v1/users/{userId}
Accept: application/json

Response Payload (200 OK):

{
  "user_id": "usr_9988776655",
  "organization_id": "org_443322",
  "email": "engineering@codesprintpro.com",
  "first_name": "Antigravity",
  "last_name": "Architect",
  "created_at": 1774895600,
  "shard_routing_metadata": {
    "assigned_shard_id": "shard_us_east_04",
    "routing_mechanism": "consistent_hashing",
    "routing_version": "v2.1.0"
  }
}

2. Dynamic Shard Directory Mapping Lookup

For directory-based sharding, a central coordinator (such as ZooKeeper or a dedicated metadata database) exposes an interface to resolve shard keys to physical IP addresses.

POST /api/v1/directory/resolve
Content-Type: application/json

Request Payload:

{
  "shard_key": "org_443322",
  "key_type": "organization_id"
}

Response Payload (200 OK):

{
  "shard_key": "org_443322",
  "shard_id": "shard_us_east_04",
  "physical_connection": {
    "primary_host": "db-primary-shard-04.us-east.internal",
    "primary_port": 5432,
    "replica_hosts": [
      "db-replica-shard-04a.us-east.internal",
      "db-replica-shard-04b.us-east.internal"
    ],
    "ssl_mode": "verify-full"
  },
  "status": "HEALTHY",
  "active_connections": 1420
}

High-Level Design and Visualizations

A robust sharded database system contains three main tiers: the stateless application services, a highly available sharding proxy router, and the partitioned storage nodes.

1. Database Proxy Routing Architecture

The routing layer inspects the query, extracts the shard key (e.g., tenant_id or user_id), hashes it, and forwards the command to the correct database instance.

graph TD
    App[Stateless App Services] -->|SQL Query / Extract Shard Key| Proxy[Sharding Proxy Router]
    Proxy -->|Read Shard Configuration| Zoo[ZooKeeper / Consul Registry]
    
    Proxy -->|Hash / Range Route| Shard1[Physical Shard Group 01]
    Proxy -->|Hash / Range Route| Shard2[Physical Shard Group 02]
    Proxy -->|Hash / Range Route| Shard3[Physical Shard Group 03]

    subgraph Shard1
        S1_Primary[(Primary DB 01)] -->|Async Replication| S1_Replica[(Replica DB 01)]
    end
    subgraph Shard2
        S2_Primary[(Primary DB 02)] -->|Async Replication| S2_Replica[(Replica DB 02)]
    end
    subgraph Shard3
        S3_Primary[(Primary DB 03)] -->|Async Replication| S3_Replica[(Replica DB 03)]
    end

2. Consistent Hashing Ring with Virtual Nodes

Consistent hashing ensures that when a new physical database server is introduced, only a minimal subset of keys are re-allocated. By hashing both keys and servers to a circular $360^\circ$ space ($0$ to $2^{32}-1$), we achieve elegant load distribution.

graph TD
    subgraph Consistent Hashing Ring
        RingNode0["0 degrees (Start)"] --> RingNode90["90 degrees (Shard A - vNode 1)"]
        RingNode90 --> RingNode180["180 degrees (Shard B - vNode 1)"]
        RingNode180 --> RingNode270["270 degrees (Shard A - vNode 2)"]
        RingNode270 --> RingNode360["360 degrees (Shard B - vNode 2)"]
    end
    
    Key1[User Key: usr_1001] -->|Hash Value: 120 deg| RingNode180
    Key2[User Key: usr_1002] -->|Hash Value: 290 deg| RingNode360
    
    style RingNode90 fill:#f9f,stroke:#333,stroke-width:2px
    style RingNode180 fill:#bbf,stroke:#333,stroke-width:2px
    style RingNode270 fill:#f9f,stroke:#333,stroke-width:2px
    style RingNode360 fill:#bbf,stroke:#333,stroke-width:2px

Low-Level Design and Schema Strategies

In a sharded architecture, the data layout must enforce local uniqueness and enable fast indexing within each database instance. We define a highly scalable structure for a multi-tenant SaaS application below.

1. Shard Mapping Directory Schema (Global Registry)

If using directory-based sharding, a global configuration store tracks the mappings. This table lives in a highly resilient database cluster or is aggressively cached in Redis.

CREATE TABLE global_shard_directory (
    tenant_id VARCHAR(64) PRIMARY KEY,
    shard_id VARCHAR(32) NOT NULL,
    status VARCHAR(16) NOT NULL DEFAULT 'ACTIVE',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_global_shard_status ON global_shard_directory(status);

2. Sharded Table Schema (Instantiated on Each Physical Shard)

This is the tenant schema that resides inside each independent database shard instance. Note that the shard key (tenant_id) is composite-indexed with the primary keys to avoid cross-instance lookup requirements.

-- Executed on physical database instance: db-shard-us-east-04
CREATE TABLE customers (
    tenant_id VARCHAR(64) NOT NULL,
    customer_id VARCHAR(64) NOT NULL,
    email VARCHAR(255) NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (tenant_id, customer_id)
);

-- Local index covering secondary lookup patterns within this specific shard
CREATE UNIQUE INDEX idx_local_customers_email ON customers (tenant_id, email);

3. Global Secondary Index Lookup Table

Since querying the database by email would require a slow, expensive multi-shard broadcast (fan-out query), we maintain a separate global lookup index table sharded by the email field itself.

CREATE TABLE global_email_to_tenant_lookup (
    email VARCHAR(255) PRIMARY KEY,
    tenant_id VARCHAR(64) NOT NULL,
    shard_id VARCHAR(32) NOT NULL
);

Scaling and Operational Challenges

Sharding is not a magic solution; it introduces complex mathematical constraints and significant operational challenges.

1. Capacity Estimation (Back-of-the-Envelope Calculations)

Let us calculate the storage and throughput capacity for a large-scale global service:

  • Total Users: 10 Billion ($1 \times 10^{10}$) registered profiles.
  • Row Storage footprint: Each user profile record requires 500 Bytes of storage (including overhead for indexes).
  • Total Raw Data Size: $$1 \times 10^{10} \times 500 \text{ Bytes} = 5 \times 10^{12} \text{ Bytes} = 5 \text{ Terabytes (TB)}$$
  • Indexing Overhead: Indexes on primary key and secondary lookups add another 40% storage overhead: $$5 \text{ TB} \times 1.4 = 7 \text{ TB}$$
  • Write Throughput (Peak): 50,000 write operations per second.
  • Single Database Limitation: A standard cloud-based PostgreSQL instance can safely handle up to 3,000 high-performance writes per second.
  • Required Shards (Minimum): $$\text{Shards} = \frac{50,000 \text{ writes/sec}}{3,000 \text{ writes/sec}} \approx 17 \text{ Shards}$$ To provide a safe buffer for spikes and uneven traffic, we select 32 physical shards.

2. The Mathematics of Consistent Hashing Re-Sharding

In a naive sharding strategy using a standard modulo algorithm: $$\text{Shard ID} = \text{Hash}(\text{Key}) \pmod N$$ where $N$ is the number of active shards.

  • If you scale your system from $N$ shards to $N+1$ shards, the modulo value changes for almost every single key. The percentage of keys that must be migrated over the network is: $$\text{Relocation Ratio}_{\text{modulo}} = \frac{N}{N+1}$$ For example, scaling from 9 to 10 nodes forces 90% of the entire database to migrate over the network, bringing down database performance.

  • By contrast, using Consistent Hashing, the system hashes both the key and the node onto a unit circle. When adding a new node, only the keys that immediately precede the new node on the circle need to be relocated. The expected fraction of keys migrated is only: $$\text{Relocation Ratio}_{\text{consistent}} = \frac{1}{N+1}$$ If scaling from 9 to 10 nodes, only 10% of the data is moved. This is a massive $9 \times$ reduction in network I/O and disk operations!


Trade-offs and Architectural Alternatives

No sharding strategy fits every query pattern. Architects must choose their partitioning dimensions carefully based on access profiles.

Sharding Strategy Read Performance (Range Queries) Write Distribution (Uniformity) Re-Sharding Complexity Best Use Case
Range-Based Excellent (Data is sorted; adjacent keys live on the same shard) Poor (New writes frequently hit the newest range shard, creating hotspots) Medium (Splitting a range is clean, but requires range metadata updates) Time-series databases, chronological ledger logs
Hash-Based Poor (Adjacent keys hash to entirely different shards; requires scatter-gather queries) Excellent (Uniformly distributes rows using cryptographic hashes) High (Mitigated by consistent hashing, but still requires physical key moves) General User/Identity data stores, transaction records
Directory-Based Good (Configurable mappings allow arbitrary co-location of records) Good (Can balance load dynamically by re-mapping key values to quiet shards) Low (Simply update the directory map pointers and move individual rows) Multi-tenant B2B SaaS with extremely large, highly dynamic accounts

Failure Modes and Fault Tolerance Strategies

Distributed databases operate under constant threat of hardware degradation, network failure, and split-brain configurations.

1. Shard Node Primary Failure & Election

If a primary node of Shard Group 04 goes offline due to physical hardware failure:

  • Detection: The sharding proxy and cluster coordinator detect missing heartbeats for longer than 3 seconds.
  • Failover Sequence: The coordinator initiates a consensus vote (using Raft/Paxos via ZooKeeper) to promote the most up-to-date read-replica of Shard Group 04 to primary.
  • Routing Update: The routing proxy updates its dynamic directory map to redirect all write traffic for Shard Group 04 to the new primary.

2. Network Partitions and Split-Brain Prevention

If a network partition isolates the primary node from its replicas:

  • The Danger: Replicas might assume the primary is dead and elect a new primary, resulting in two nodes accepting writes for the same shard simultaneously. This will corrupt the data.
  • The Solution: Enforce Quorum-based writes. A primary node is only allowed to accept writes if it can successfully replicate the write to a majority of its configured replicas: $$\text{Quorum Write Requirement} = \left( \lfloor \frac{R}{2} \rfloor + 1 \right)$$ where $R$ is the replica group size. If it cannot achieve quorum, it rejects writes and steps down as primary.

Staff Engineer Perspective


Verbal Script

Interviewer: "How would you design a horizontally scalable database system for a ride-hailing platform like Uber, handling millions of active rides globally?"

Candidate: "To handle millions of rides globally, a single database instance will quickly exhaust its disk write IOPS and connection pool limit. I would design a highly available, horizontally sharded relational database system.

First, I need to select the optimal Shard Key. A naive choice like ride_id would distribute rides uniformly using a hash-based sharding strategy. However, this is an anti-pattern because the most common operational queries—such as matching a rider with nearby drivers or showing active rides—are heavily dependent on geography. Querying by location on a ride_id hash-sharded system would force a scatter-gather query across all physical shards, destroying our latency budget.

Therefore, I would choose a spatial shard key based on geography, using a routing system like Google's S2 Geometry or Uber's H3 Hexagonal Hierarchical Spatial Index at a specific resolution level. We can map these spatial cells into logical regions—for example, city_id or a combination of country_code and s2_cell_id.

To manage this mapping dynamically, I would use Directory-Based Sharding paired with Consistent Hashing. stateless microservices will query our routing proxy layer. The routing proxy will query a high-performance, in-memory caching directory (backed by Redis and coordinated by ZooKeeper) to resolve the city_id to its physical database shard.

For instance, high-density metropolitan areas like New York City or London will have their own dedicated physical shard groups to isolate their high write throughput. Conversely, multiple lower-volume cities can be co-located within a single shard group.

To handle hot spots—such as a major sporting event or New Year's Eve traffic spikes in a specific city—we will introduce a Salting Strategy on the shard key. If a particular city cell is identified as extremely hot, we append a deterministic pseudo-random suffix—such as city_id_1, city_id_2—to split the write load across multiple shard instances.

For data integrity, every physical shard will run in a high-availability primary-replica group with semi-synchronous replication. This ensures strong consistency and instant failover without split-brain corruption, matching our strict CP operational requirements during network partitions."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.