System DesignAdvancedarticlePart 1 of 4 in Distributed Systems Mastery

Speculative Retries: The Google Approach to Solving Tail Latency

How to cut your P99.9 latency by 50% without optimizing a single line of business logic. A deep dive into Backup Requests and Hedged Requests.

Sachin SarawgiApril 20, 20262 min read2 minute lesson

Speculative Retries: Solving the P99 Tail

In a large distributed system, the "tail latency" (P99.9) is often dominated by a single "slow" node. This is the Tail at Scale problem. No matter how much you optimize your code, things like GC pauses, network flaps, or noisy neighbors will eventually make one request out of 1,000 extremely slow.

Google's solution? Speculative Retries (also known as Hedged Requests or Backup Requests).

1. The Strategy: Hedged Requests

Instead of waiting for a slow request to time out, you issue a second request in parallel if the first one hasn't responded within a certain time (the "hedging" threshold).

  • The Logic:
    1. Send request to Node A.
    2. Wait for 95th percentile latency (e.g., 20ms).
    3. If no response, send the same request to Node B.
    4. Use whichever response arrives first.

2. Why this works

The probability of two independent nodes being slow at the same exact millisecond is much lower than the probability of one node being slow. By issuing a backup request at the P95 mark, you effectively "clip" the tail of your latency distribution.

3. Implementation in Java (CompletableFuture)

public CompletableFuture<Response> getHedgedResponse(Request req) {
    CompletableFuture<Response> first = callService(req, "NodeA");
    
    // Create a delayed backup request
    CompletableFuture<Response> backup = new CompletableFuture<>();
    scheduler.schedule(() -> {
        if (!first.isDone()) {
            callService(req, "NodeB").whenComplete((res, ex) -> {
                if (ex == null) backup.complete(res);
            });
        }
    }, 20, TimeUnit.MILLISECONDS);

    return CompletableFuture.anyOf(first, backup)
            .thenApply(res -> (Response) res);
}

4. The Trade-off: Resource Overhead

Speculative retries are not free.

  • Cost: If you hedge at the P95 mark, you are increasing your total request volume by 5%.
  • The Catch: You must ensure your backend has the capacity to handle this 5% extra load. If your system is already at 90% CPU, speculative retries will trigger a Cascading Failure.

Summary

Speculative retries are the most powerful tool in a staff engineer's kit for maintaining consistent latency in large clusters. By spending a small amount of extra resource (5%), you can reduce your P99.9 tail latency from seconds to milliseconds.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Continue Series

Distributed Systems Mastery

Lesson 1 of 4 in this learning sequence.

Next in series

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

System Design: Designing a Stock Trading Platform and Matching Engine

System Design: Designing a High-Performance Trading Platform Designing a stock or crypto trading platform is the ultimate test of low-latency engineering. You need to process millions of orders per second, maintain a per…

Apr 20, 20263 min read
Deep Dive
#system-design#fintech#matching-engine
System DesignBeginner

System Design: Designing a Real-time Bidding (RTB) Ad System

System Design: Designing a Real-time Bidding (RTB) Ad System Real-time Bidding (RTB) is the backbone of the modern digital advertising industry. When you load a webpage, an auction happens in the background to decide whi…

Apr 20, 20263 min read
Deep Dive
#system-design#rtb#ad-tech
System DesignAdvanced

Distributed Tracing Propagation: Mastering B3 and W3C Traceparent Headers

Distributed Tracing Propagation When a request travels through 10 different services, how does Zipkin or Jaeger know they all belong to the same user click? The answer is Context Propagation. 1. Trace ID vs. Span ID - Tr…

Apr 20, 20263 min read
Deep DiveDistributed Systems Mastery
#distributed-tracing#opentelemetry#observability
System DesignIntermediate

System Design: Designing Idempotent APIs for Reliable Services

System Design: Designing Idempotent APIs In a distributed system, network failures are inevitable. A common failure scenario is: "The client sends a request -> The server processes it -> The server's response fails to re…

Apr 20, 20262 min read
Deep DiveBackend Systems Mastery
#system-design#api-design#idempotency

More in System Design

Category-based suggestions if you want to stay in the same domain.