Database Sharding Part 5: The Scatter-Gather Problem
When you implement a sharded database architecture, you establish a primary Shard Key (as discussed in Part 3) to route your queries. If your Shard Key is user_id, retrieving a user's profile is instantaneous: the routing layer hashes the user_id, identifies the correct physical shard, and executes the query perfectly.
But what happens when you need to execute a query that does not contain the Shard Key?
Imagine your application allows users to search for other users by email address:
SELECT * FROM users WHERE email = 'ceo@codesprintpro.com';
Because the query lacks the user_id, the routing layer has no idea which physical server holds this row. It is forced to perform a Scatter-Gather.
The Mechanics of Scatter-Gather
To fulfill the email search query, the routing layer (or your application code) must execute the following steps:
- Scatter: Broadcast the exact same
SELECTquery to every single physical shard in your cluster concurrently. If you have 100 shards, it opens 100 database connections and fires 100 queries. - Gather: Wait for all 100 databases to finish executing the query and return their results over the network.
- Merge: Aggregate, sort, and deduplicate the results in the application layer before returning the final payload to the client.
The P99 Latency Killer (Tail Latency Amplification)
Scatter-Gather is the ultimate enemy of highly available distributed systems because it mathematically guarantees latency spikes. This is known as Tail Latency Amplification.
Let's assume your individual database shards are highly optimized. 99% of queries finish in 5ms or less. Only 1% of queries experience a latency spike of 200ms (perhaps due to a brief disk I/O wait, an OS context switch, or a Java Garbage Collection pause).
If you execute a single targeted query against one shard, you have a 1% chance of hitting that slow 200ms penalty.
If you execute a Scatter-Gather query against a cluster of 100 shards, the probability that at least one of those shards is currently experiencing a 1% latency spike is:
1 - (0.99 ^ 100) = 0.634
There is a 63.4% chance that your query will take 200ms.
Because the Gather phase must wait for the slowest shard to respond before returning data to the user, your overall system latency is completely dictated by the absolute worst-performing node in your cluster at any given millisecond. The more shards you add, the slower your scatter-gather queries become.
Scatter-Gather makes LIMIT and OFFSET pagination nearly impossible. If a user asks for ORDER BY created_at DESC LIMIT 10, you cannot ask each shard for 1 row. You must ask every shard for its top 10 rows, pull 1,000 rows into application memory, sort them all globally, and return the final top 10. The memory overhead on the application server is devastating.
How to Fix Scatter-Gather
If disk is cheap and latency is expensive, the solution to Scatter-Gather is always data duplication.
1. Global Secondary Indexes (GSI)
If you frequently query by a secondary column (like email), you must create a mapping table. A Global Secondary Index is simply another sharded database table where the Shard Key is the email, and the value is the user_id.
When the user searches by email, your application performs two blazing-fast, targeted queries:
- Hit the GSI cluster using the email to get the
user_id. - Hit the primary cluster using the
user_idto get the profile.
This completely eliminates the scatter-gather, transforming it into two constant-time O(1) network requests.
2. Eventual Consistency via CDC
Keeping a GSI synchronized with the primary database is challenging. You cannot use distributed transactions because they are too slow.
Instead, modern architectures use Change Data Capture (CDC). Tools like Debezium monitor the Write-Ahead Log of your primary shards. When a user updates their email, Debezium streams that change to a Kafka topic. A background worker consumes the topic and updates the GSI cluster.
This means the GSI is eventually consistent—it might be a few seconds behind the primary database, but it guarantees that read latency remains flat.
3. Dedicated Search Engines
If you need to perform complex text searches (e.g., WHERE description LIKE '%system design%'), neither a primary database nor a GSI will suffice.
For full-text search, the standard pattern is to stream your data via CDC into a dedicated search cluster like Elasticsearch or OpenSearch. These engines are specifically designed to ingest documents and maintain massive inverted indices, abstracting the scatter-gather complexity away from your primary transactional databases.
Summary
Never allow your application to perform Scatter-Gather queries in the critical path of a user request. As your cluster scales horizontally, tail latency amplification will destroy your performance.
Embrace data duplication, implement Change Data Capture, and build Global Secondary Indexes to guarantee O(1) query routing.
