System Design: Data Partitioning and Sharding
When a database outgrows a single machine, vertical scaling (upgrading RAM, CPU, or storage) eventually hits a hard physical ceiling and becomes exponentially expensive. To scale to hundreds of millions of active users and petabytes of data, you must scale out horizontally.
Sharding is the architectural practice of breaking up a massive monolithic database into smaller, faster, and more manageable pieces called physical shards. This guide breaks down the core patterns, algorithmic mechanics, and production realities of sharding distributed databases at high scale.
Requirements and System Goals
Designing a sharded database infrastructure requires balancing query routing overhead, uniform data distribution, and partition tolerance.
1. Functional Requirements
- Uniform Ingress & Egress: Distribute read and write queries evenly across all physical shards without creating bottlenecks.
- Support for Arbitrary Queries: Maintain the ability to query data by both primary identifiers (e.g.,
user_id) and secondary attributes (e.g.,email,created_at). - Elastic Cluster Resizing: Allow adding or removing physical database nodes with minimal data redistribution and zero client-visible downtime.
- Deterministic Routing: Resolve the target physical database shard for any request in sub-millisecond lookup times.
2. Non-Functional Requirements
- High Availability: Maintain a multi-primary or primary-replica architecture for every shard to ensure no single point of failure (SPOF) exists.
- Latency Budgets: Query routing proxies must introduce less than 1.5ms of overhead to the end-to-end P99 latency.
- Consistency Commitments: Support strong consistency (linearizability) within a single shard, and support configurable eventual consistency models for cross-shard queries.
- Minimal Cross-Shard Operations: Minimize or completely eliminate multi-shard transactions and joins, which introduce severe network coordination overhead.
API Interfaces and Service Contracts
To separate application logic from physical database locations, systems utilize a routing proxy layer (such as Vitess or Citus). Below are the service contracts for both routing requests and dynamic directory-lookup operations.
1. User Entity Query API
When a service queries a sharded record, the API gateway or sharding proxy inspects the payload or headers to extract the shard key.
GET /api/v1/users/{userId}
Accept: application/json
Response Payload (200 OK):
{
"user_id": "usr_9988776655",
"organization_id": "org_443322",
"email": "engineering@codesprintpro.com",
"first_name": "Antigravity",
"last_name": "Architect",
"created_at": 1774895600,
"shard_routing_metadata": {
"assigned_shard_id": "shard_us_east_04",
"routing_mechanism": "consistent_hashing",
"routing_version": "v2.1.0"
}
}
2. Dynamic Shard Directory Mapping Lookup
For directory-based sharding, a central coordinator (such as ZooKeeper or a dedicated metadata database) exposes an interface to resolve shard keys to physical IP addresses.
POST /api/v1/directory/resolve
Content-Type: application/json
Request Payload:
{
"shard_key": "org_443322",
"key_type": "organization_id"
}
Response Payload (200 OK):
{
"shard_key": "org_443322",
"shard_id": "shard_us_east_04",
"physical_connection": {
"primary_host": "db-primary-shard-04.us-east.internal",
"primary_port": 5432,
"replica_hosts": [
"db-replica-shard-04a.us-east.internal",
"db-replica-shard-04b.us-east.internal"
],
"ssl_mode": "verify-full"
},
"status": "HEALTHY",
"active_connections": 1420
}
High-Level Design and Visualizations
A robust sharded database system contains three main tiers: the stateless application services, a highly available sharding proxy router, and the partitioned storage nodes.
1. Database Proxy Routing Architecture
The routing layer inspects the query, extracts the shard key (e.g., tenant_id or user_id), hashes it, and forwards the command to the correct database instance.
graph TD
App[Stateless App Services] -->|SQL Query / Extract Shard Key| Proxy[Sharding Proxy Router]
Proxy -->|Read Shard Configuration| Zoo[ZooKeeper / Consul Registry]
Proxy -->|Hash / Range Route| Shard1[Physical Shard Group 01]
Proxy -->|Hash / Range Route| Shard2[Physical Shard Group 02]
Proxy -->|Hash / Range Route| Shard3[Physical Shard Group 03]
subgraph Shard1
S1_Primary[(Primary DB 01)] -->|Async Replication| S1_Replica[(Replica DB 01)]
end
subgraph Shard2
S2_Primary[(Primary DB 02)] -->|Async Replication| S2_Replica[(Replica DB 02)]
end
subgraph Shard3
S3_Primary[(Primary DB 03)] -->|Async Replication| S3_Replica[(Replica DB 03)]
end
2. Consistent Hashing Ring with Virtual Nodes
Consistent hashing ensures that when a new physical database server is introduced, only a minimal subset of keys are re-allocated. By hashing both keys and servers to a circular $360^\circ$ space ($0$ to $2^{32}-1$), we achieve elegant load distribution.
graph TD
subgraph Consistent Hashing Ring
RingNode0["0 degrees (Start)"] --> RingNode90["90 degrees (Shard A - vNode 1)"]
RingNode90 --> RingNode180["180 degrees (Shard B - vNode 1)"]
RingNode180 --> RingNode270["270 degrees (Shard A - vNode 2)"]
RingNode270 --> RingNode360["360 degrees (Shard B - vNode 2)"]
end
Key1[User Key: usr_1001] -->|Hash Value: 120 deg| RingNode180
Key2[User Key: usr_1002] -->|Hash Value: 290 deg| RingNode360
style RingNode90 fill:#f9f,stroke:#333,stroke-width:2px
style RingNode180 fill:#bbf,stroke:#333,stroke-width:2px
style RingNode270 fill:#f9f,stroke:#333,stroke-width:2px
style RingNode360 fill:#bbf,stroke:#333,stroke-width:2px
Low-Level Design and Schema Strategies
In a sharded architecture, the data layout must enforce local uniqueness and enable fast indexing within each database instance. We define a highly scalable structure for a multi-tenant SaaS application below.
1. Shard Mapping Directory Schema (Global Registry)
If using directory-based sharding, a global configuration store tracks the mappings. This table lives in a highly resilient database cluster or is aggressively cached in Redis.
CREATE TABLE global_shard_directory (
tenant_id VARCHAR(64) PRIMARY KEY,
shard_id VARCHAR(32) NOT NULL,
status VARCHAR(16) NOT NULL DEFAULT 'ACTIVE',
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_global_shard_status ON global_shard_directory(status);
2. Sharded Table Schema (Instantiated on Each Physical Shard)
This is the tenant schema that resides inside each independent database shard instance. Note that the shard key (tenant_id) is composite-indexed with the primary keys to avoid cross-instance lookup requirements.
-- Executed on physical database instance: db-shard-us-east-04
CREATE TABLE customers (
tenant_id VARCHAR(64) NOT NULL,
customer_id VARCHAR(64) NOT NULL,
email VARCHAR(255) NOT NULL,
password_hash VARCHAR(255) NOT NULL,
first_name VARCHAR(100),
last_name VARCHAR(100),
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (tenant_id, customer_id)
);
-- Local index covering secondary lookup patterns within this specific shard
CREATE UNIQUE INDEX idx_local_customers_email ON customers (tenant_id, email);
3. Global Secondary Index Lookup Table
Since querying the database by email would require a slow, expensive multi-shard broadcast (fan-out query), we maintain a separate global lookup index table sharded by the email field itself.
CREATE TABLE global_email_to_tenant_lookup (
email VARCHAR(255) PRIMARY KEY,
tenant_id VARCHAR(64) NOT NULL,
shard_id VARCHAR(32) NOT NULL
);
Scaling and Operational Challenges
Sharding is not a magic solution; it introduces complex mathematical constraints and significant operational challenges.
1. Capacity Estimation (Back-of-the-Envelope Calculations)
Let us calculate the storage and throughput capacity for a large-scale global service:
- Total Users: 10 Billion ($1 \times 10^{10}$) registered profiles.
- Row Storage footprint: Each user profile record requires 500 Bytes of storage (including overhead for indexes).
- Total Raw Data Size: $$1 \times 10^{10} \times 500 \text{ Bytes} = 5 \times 10^{12} \text{ Bytes} = 5 \text{ Terabytes (TB)}$$
- Indexing Overhead: Indexes on primary key and secondary lookups add another 40% storage overhead: $$5 \text{ TB} \times 1.4 = 7 \text{ TB}$$
- Write Throughput (Peak): 50,000 write operations per second.
- Single Database Limitation: A standard cloud-based PostgreSQL instance can safely handle up to 3,000 high-performance writes per second.
- Required Shards (Minimum): $$\text{Shards} = \frac{50,000 \text{ writes/sec}}{3,000 \text{ writes/sec}} \approx 17 \text{ Shards}$$ To provide a safe buffer for spikes and uneven traffic, we select 32 physical shards.
2. The Mathematics of Consistent Hashing Re-Sharding
In a naive sharding strategy using a standard modulo algorithm: $$\text{Shard ID} = \text{Hash}(\text{Key}) \pmod N$$ where $N$ is the number of active shards.
-
If you scale your system from $N$ shards to $N+1$ shards, the modulo value changes for almost every single key. The percentage of keys that must be migrated over the network is: $$\text{Relocation Ratio}_{\text{modulo}} = \frac{N}{N+1}$$ For example, scaling from 9 to 10 nodes forces 90% of the entire database to migrate over the network, bringing down database performance.
-
By contrast, using Consistent Hashing, the system hashes both the key and the node onto a unit circle. When adding a new node, only the keys that immediately precede the new node on the circle need to be relocated. The expected fraction of keys migrated is only: $$\text{Relocation Ratio}_{\text{consistent}} = \frac{1}{N+1}$$ If scaling from 9 to 10 nodes, only 10% of the data is moved. This is a massive $9 \times$ reduction in network I/O and disk operations!
Trade-offs and Architectural Alternatives
No sharding strategy fits every query pattern. Architects must choose their partitioning dimensions carefully based on access profiles.
| Sharding Strategy | Read Performance (Range Queries) | Write Distribution (Uniformity) | Re-Sharding Complexity | Best Use Case |
|---|---|---|---|---|
| Range-Based | Excellent (Data is sorted; adjacent keys live on the same shard) | Poor (New writes frequently hit the newest range shard, creating hotspots) | Medium (Splitting a range is clean, but requires range metadata updates) | Time-series databases, chronological ledger logs |
| Hash-Based | Poor (Adjacent keys hash to entirely different shards; requires scatter-gather queries) | Excellent (Uniformly distributes rows using cryptographic hashes) | High (Mitigated by consistent hashing, but still requires physical key moves) | General User/Identity data stores, transaction records |
| Directory-Based | Good (Configurable mappings allow arbitrary co-location of records) | Good (Can balance load dynamically by re-mapping key values to quiet shards) | Low (Simply update the directory map pointers and move individual rows) | Multi-tenant B2B SaaS with extremely large, highly dynamic accounts |
Failure Modes and Fault Tolerance Strategies
Distributed databases operate under constant threat of hardware degradation, network failure, and split-brain configurations.
1. Shard Node Primary Failure & Election
If a primary node of Shard Group 04 goes offline due to physical hardware failure:
- Detection: The sharding proxy and cluster coordinator detect missing heartbeats for longer than 3 seconds.
- Failover Sequence: The coordinator initiates a consensus vote (using Raft/Paxos via ZooKeeper) to promote the most up-to-date read-replica of Shard Group 04 to primary.
- Routing Update: The routing proxy updates its dynamic directory map to redirect all write traffic for Shard Group 04 to the new primary.
2. Network Partitions and Split-Brain Prevention
If a network partition isolates the primary node from its replicas:
- The Danger: Replicas might assume the primary is dead and elect a new primary, resulting in two nodes accepting writes for the same shard simultaneously. This will corrupt the data.
- The Solution: Enforce Quorum-based writes. A primary node is only allowed to accept writes if it can successfully replicate the write to a majority of its configured replicas: $$\text{Quorum Write Requirement} = \left( \lfloor \frac{R}{2} \rfloor + 1 \right)$$ where $R$ is the replica group size. If it cannot achieve quorum, it rejects writes and steps down as primary.
Staff Engineer Perspective
Verbal Script
Interviewer: "How would you design a horizontally scalable database system for a ride-hailing platform like Uber, handling millions of active rides globally?"
Candidate: "To handle millions of rides globally, a single database instance will quickly exhaust its disk write IOPS and connection pool limit. I would design a highly available, horizontally sharded relational database system.
First, I need to select the optimal Shard Key. A naive choice like ride_id would distribute rides uniformly using a hash-based sharding strategy. However, this is an anti-pattern because the most common operational queries—such as matching a rider with nearby drivers or showing active rides—are heavily dependent on geography. Querying by location on a ride_id hash-sharded system would force a scatter-gather query across all physical shards, destroying our latency budget.
Therefore, I would choose a spatial shard key based on geography, using a routing system like Google's S2 Geometry or Uber's H3 Hexagonal Hierarchical Spatial Index at a specific resolution level. We can map these spatial cells into logical regions—for example, city_id or a combination of country_code and s2_cell_id.
To manage this mapping dynamically, I would use Directory-Based Sharding paired with Consistent Hashing. stateless microservices will query our routing proxy layer. The routing proxy will query a high-performance, in-memory caching directory (backed by Redis and coordinated by ZooKeeper) to resolve the city_id to its physical database shard.
For instance, high-density metropolitan areas like New York City or London will have their own dedicated physical shard groups to isolate their high write throughput. Conversely, multiple lower-volume cities can be co-located within a single shard group.
To handle hot spots—such as a major sporting event or New Year's Eve traffic spikes in a specific city—we will introduce a Salting Strategy on the shard key. If a particular city cell is identified as extremely hot, we append a deterministic pseudo-random suffix—such as city_id_1, city_id_2—to split the write load across multiple shard instances.
For data integrity, every physical shard will run in a high-availability primary-replica group with semi-synchronous replication. This ensures strong consistency and instant failover without split-brain corruption, matching our strict CP operational requirements during network partitions."