Speculative Retries: Solving the P99 Tail
In a large distributed system, the "tail latency" (P99.9) is often dominated by a single "slow" node. This is the Tail at Scale problem. No matter how much you optimize your code, things like GC pauses, network flaps, or noisy neighbors will eventually make one request out of 1,000 extremely slow.
Google's solution? Speculative Retries (also known as Hedged Requests or Backup Requests).
1. The Strategy: Hedged Requests
Instead of waiting for a slow request to time out, you issue a second request in parallel if the first one hasn't responded within a certain time (the "hedging" threshold).
- The Logic:
- Send request to Node A.
- Wait for 95th percentile latency (e.g., 20ms).
- If no response, send the same request to Node B.
- Use whichever response arrives first.
2. Why this works
The probability of two independent nodes being slow at the same exact millisecond is much lower than the probability of one node being slow. By issuing a backup request at the P95 mark, you effectively "clip" the tail of your latency distribution.
3. Implementation in Java (CompletableFuture)
public CompletableFuture<Response> getHedgedResponse(Request req) {
CompletableFuture<Response> first = callService(req, "NodeA");
// Create a delayed backup request
CompletableFuture<Response> backup = new CompletableFuture<>();
scheduler.schedule(() -> {
if (!first.isDone()) {
callService(req, "NodeB").whenComplete((res, ex) -> {
if (ex == null) backup.complete(res);
});
}
}, 20, TimeUnit.MILLISECONDS);
return CompletableFuture.anyOf(first, backup)
.thenApply(res -> (Response) res);
}
4. The Trade-off: Resource Overhead
Speculative retries are not free.
- Cost: If you hedge at the P95 mark, you are increasing your total request volume by 5%.
- The Catch: You must ensure your backend has the capacity to handle this 5% extra load. If your system is already at 90% CPU, speculative retries will trigger a Cascading Failure.
Summary
Speculative retries are the most powerful tool in a staff engineer's kit for maintaining consistent latency in large clusters. By spending a small amount of extra resource (5%), you can reduce your P99.9 tail latency from seconds to milliseconds.
