System Design: Data Partitioning and Sharding
When your database outgrows a single machine, you have two choices: Scale Up (bigger machine) or Scale Out (multiple machines). Sharding (Horizontal Partitioning) is the art of scaling out by splitting your data across multiple servers.
1. Sharding vs. Partitioning
- Vertical Partitioning (Normalization): Splitting tables by columns (e.g., storing user metadata in one table and user preferences in another).
- Horizontal Partitioning (Sharding): Splitting the table by rows. Server 1 holds user IDs 1-1M, Server 2 holds 1M-2M.
2. Choosing the Right Shard Key
The shard key is the most important architectural decision. If you get it wrong, you end up with Hotspots.
- Hash-based Sharding: .
- Pros: Even distribution of data.
- Cons: Hard to perform range queries (e.g., "Find all users joined between X and Y") because data is scattered everywhere.
- Range-based Sharding: for IDs 0-1000, for 1001-2000.
- Pros: Efficient range queries.
- Cons: Leads to "hot shards" if one range becomes popular (e.g., all new users being added to the newest shard).
3. Directory-Based Sharding
Maintain a "Shard Map" (a lookup table) that tells the application where each key lives.
- Pros: Extremely flexible; you can move users between shards without changing the key.
- Cons: The lookup table itself becomes a single point of failure and a performance bottleneck.
4. The Challenge: Re-Sharding
What happens when your 10 shards are full and you need to add 2 more?
- If you use , changing it to forces you to move almost 100% of your data.
- The Solution: Use Consistent Hashing to ensure that only of the data needs to be moved when the cluster size changes.
5. Summary
Sharding is a "last resort" for scaling. It adds significant complexity to queries (cross-shard joins are nearly impossible). Before sharding, always exhaust other options: better indexing, caching (Redis), and read replicas. But when you finally do shard, choose your key based on your most frequent access pattern.
