System DesignAdvancedguide

Microservices Patterns: Circuit Breaker, Retry, Bulkhead, and Saga

A comprehensive guide to microservices resiliency and transactional design. Learn how to implement Circuit Breakers, Bulkheads, Retries, and Sagas.

Sachin SarawgiApril 20, 202612 min read12 minute lesson

Reading Mode

Reduce distractions and widen the article focus for long-form reading.

Key Takeaways

What you will learn

**Resiliency Gating:** The Circuit Breaker pattern prevents cascading microservice failures by failing fast when downstream services exhibit high error rates.

**Resource Isolation:** The Bulkhead pattern segments thread pools, ensuring that a single saturated dependency cannot exhaust all gateway worker threads.

**Saga Transaction Coordination:** Sagas coordinate distributed transactions across independent microservices databases using asynchronous event-driven compensating steps.

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

In monolithic systems, code executions are reliable local function calls. In distributed microservice environments, executions are remote network calls over unreliable networks. If a single downstream microservice experiences high latency or errors, it can cause thread pools to back up, leading to a cascading failure that can crash the entire platform. Mitigating this requires Resiliency Patterns—gating connections with Circuit Breakers, Retries, and Bulkheads, and coordinating data transactions via Sagas.


System Requirements

To establish a highly resilient distributed runtime, we define the following operational and capacity requirements:

Functional Requirements

  • Failure Isolation: If a non-critical downstream microservice (such as recommendation or email notifications) fails or slows down, the core e-commerce checkout path must remain operational.
  • Auto-Recovery: Gating systems must automatically probe failing downstream services and restore traffic when the systems recover.
  • Distributed Transactions: Cross-service workflows must execute atomic-like transactions, rolling back completed steps if subsequent steps fail.
  • Fallback Executions: The proxy gateway must provide logical default responses (static data, cache lookups) when downstream dependencies are unreachable.

Non-Functional Requirements

  • Failure Detection Time: The system must detect and isolate a degraded downstream microservice within 5 seconds of error rate breaches.
  • Gateway Thread Protection: Saturated downstream endpoints must be prevented from consuming more than 10% of the gateway's thread capacity.
  • Compensating Action SLA: Saga compensating actions must execute within 2 seconds of a primary transaction failure to prevent resource locking.
  • Throughput SLA: Resiliency filter evaluations (Circuit Breaker status, Semaphore bulkhead checking) must complete in less than 0.5ms per request.

API Design and Interface Contracts

To coordinate resiliency configurations, gateways and orchestrators utilize standardized schemas. Below is a structured JSON API payload representing the runtime state of a Circuit Breaker and Bulkhead manager, followed by the YAML configuration:

1. Resiliency Config (YAML for Gateway)

gateway_resilience:
  billing_circuit_breaker:
    failure_rate_threshold: 50 # Transition to OPEN if 50% of requests fail
    slow_call_rate_threshold: 75 # Transition to OPEN if 75% of calls take > 500ms
    slow_call_duration_threshold_ms: 500
    sliding_window_size: 100 # Track last 100 calls
    wait_duration_in_open_state_ms: 10000 # Wait 10s before transitioning to HALF-OPEN
  billing_bulkhead:
    max_concurrent_calls: 20 # Max concurrent threads allowed
    max_wait_duration_ms: 50 # Block queue up to 50ms before failing fast

2. Saga State Update Payload (Orchestrator to Database)

POST /api/v1/sagas/update

{
  "saga_id": "saga-uuid-998811-order",
  "saga_type": "ORDER_CHECKOUT",
  "current_step": "INVENTORY_RESERVE",
  "status": "COMPENSATING",
  "steps": [
    {
      "step_name": "ORDER_CREATE",
      "status": "SUCCESS",
      "updated_at": "2026-06-16T18:38:35Z"
    },
    {
      "step_name": "BILLING_CHARGE",
      "status": "SUCCESS",
      "updated_at": "2026-06-16T18:38:36Z"
    },
    {
      "step_name": "INVENTORY_RESERVE",
      "status": "FAILED",
      "updated_at": "2026-06-16T18:38:37Z"
    }
  ]
}

High-Level Architecture

Distributed systems enforce safety boundaries by segmenting resource paths and state flows. A load balancer terminates TLS and routes traffic to the API Gateway. The API Gateway integrates a Circuit Breaker filter and a Bulkhead coordinator before making downstream network calls. For multi-service transactions, the Gateway delegates coordination to a Saga Orchestrator, which logs states in a durable transactional database and routes tasks via RabbitMQ or Kafka event topics.

1. Circuit Breaker State Machine

A Circuit Breaker operates in three states:

  • CLOSED: Traffic flows normally. Telemetry rates are monitored.
  • OPEN: Downstream calls fail fast immediately, returning a fallback response without making network calls.
  • HALF-OPEN: A small percentage of trial traffic is allowed through to test downstream health.
stateDiagram-v2
    [*] --> Closed
    Closed --> Open: Error Rate > Threshold (Closed to Open)
    Open --> HalfOpen: Sleep Window Expires (Open to Half-Open)
    
    HalfOpen --> Closed: Trial Success Rate > Threshold
    HalfOpen --> Open: Trial Fails
    
    classDef stateColor fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
    class Closed,Open,HalfOpen stateColor;

2. Orchestrated Saga Transaction Flow

If the Order Service Orchestrator fails to reserve inventory at Step 2, it triggers Compensating Transactions in reverse order: issuing a refund to the Billing Service and canceling the order, returning the system to a clean state.

sequenceDiagram
    autonumber
    participant Orchestrator as Saga Orchestrator
    participant Billing as Billing Service
    participant Inventory as Inventory Service

    rect rgb(240, 248, 255)
        Note over Orchestrator, Inventory: Happy Path Executions
        Orchestrator->>Billing: Step 1: Charge Card
        Billing-->>Orchestrator: Charged Success
        Orchestrator->>Inventory: Step 2: Reserve Inventory
        Inventory--xOrchestrator: Out of Stock! (Failed)
    end
    
    rect rgb(255, 235, 235)
        Note over Orchestrator, Billing: Compensating Rollback Steps
        Orchestrator->>Billing: Step 3 (Compensate): Issue Refund
        Billing-->>Orchestrator: Refunded Completed
    end

Low-Level Design and Schema

Below is a production-ready, compilable Java class modeling a Resilient Gateway Interceptor. It implements a local Semaphore-based Bulkhead and a sliding-window Circuit Breaker fallback layer:

package com.codesprintpro.resilience;

import java.util.concurrent.Semaphore;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

public class ResilientGatewayHandler {
    private final Semaphore bulkheadSemaphore;
    private final int timeoutMs;
    
    // Simple state indicators for the Circuit Breaker
    private String circuitState = "CLOSED";
    private final AtomicInteger failureCounter = new AtomicInteger(0);
    private final int failureThreshold = 5;

    public ResilientGatewayHandler(int maxConcurrency, int timeoutMs) {
        this.bulkheadSemaphore = new Semaphore(maxConcurrency);
        this.timeoutMs = timeoutMs;
    }

    /**
     * Executes a downstream network request under strict bulkhead isolation
     * and circuit breaker gating boundaries.
     */
    public String executeWithResilience(DownstreamClient client) {
        // 1. Circuit Breaker check
        if ("OPEN".equals(this.circuitState)) {
            return executeFallback(); // Fail fast immediately
        }

        // 2. Bulkhead execution
        boolean acquired = false;
        try {
            acquired = this.bulkheadSemaphore.tryAcquire(this.timeoutMs, TimeUnit.MILLISECONDS);
            if (!acquired) {
                return executeFallback(); // Bulkhead queue saturated
            }

            // 3. Execute downstream network call
            String response = client.call();
            this.failureCounter.set(0); // Reset failures on success
            return response;

        } catch (Exception e) {
            int failures = this.failureCounter.incrementAndGet();
            if (failures >= this.failureThreshold) {
                this.circuitState = "OPEN"; // Trip circuit breaker
            }
            return executeFallback();
        } finally {
            if (acquired) {
                this.bulkheadSemaphore.release();
            }
        }
    }

    private String executeFallback() {
        return "{\"status\":\"FALLBACK_DEFAULT_VALUE\",\"message\":\"Service temporarily degraded.\"}";
    }
}

// Representational interface for compilation
interface DownstreamClient {
    String call() throws Exception;
}

Relational Schema Design (Saga Log Registry)

When implementing orchestrated Sagas, storing the active transaction steps, compensating actions, and current completion states in a durable SQL database is critical. This prevents data loss during service crashes.

CREATE TABLE saga_instances (
    saga_id VARCHAR(255) PRIMARY KEY,
    saga_type VARCHAR(100) NOT NULL, -- e.g., 'ORDER_CHECKOUT'
    global_status VARCHAR(50) NOT NULL, -- STARTED, SUCCESS, COMPENSATING, ROLLED_BACK, FAILED
    payload_json TEXT NOT NULL, -- Original transaction inputs
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE saga_steps (
    step_id VARCHAR(255) PRIMARY KEY,
    saga_id VARCHAR(255) REFERENCES saga_instances(saga_id) ON DELETE CASCADE,
    step_name VARCHAR(100) NOT NULL,
    execution_order INT NOT NULL,
    status VARCHAR(50) NOT NULL, -- PENDING, SUCCESS, FAILED, COMPENSATING, COMPENSATED
    action_class VARCHAR(255) NOT NULL, -- Java class string to execute
    compensation_class VARCHAR(255) NOT NULL, -- Java class for compensating rollback
    retry_count INT NOT NULL DEFAULT 0,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_saga_steps_saga_id ON saga_steps(saga_id);
CREATE INDEX idx_saga_instances_status ON saga_instances(global_status);

Scaling Challenges and Capacity Estimation

Deploying resilience patterns at scale exposes distinct distributed bottlenecks that impact network utilization and thread pooling capacity:

1. Retry Storms under Network Flaps

If a downstream database experiences a brief 100ms connection drop, and 1,000 gateway clients immediately retry their calls 3 times without waiting, the database will be hit with 3,000 requests as it recovers, saturating CPU pools and keeping the system down.

  • Mathematical Retry Jitter Bound: We calculate retry delays using exponential backoff with randomized jitter to spread requests uniformly. The formula for wait time $T_i$ on attempt $i$ is: $$T_i = \text{Min}(T_{\text{max}}, T_{\text{base}} \times 2^i) \pm \text{RandomJitter}$$ For $T_{\text{base}} = 100\text{ms}$, on attempt 3, the backoff is $100 \times 8 = 800\text{ms}$. By applying a random jitter between 0 and 200ms, the request is dispatched between 600ms and 1000ms. This prevents synchronization spikes.
  • Mitigation: Implement a Retry Budget. Set the gateway to only allow retries if total retry requests represent less than 10% of overall traffic. This prevents retries from overwhelming recovering downstream systems.

2. Thread-Context Overhead in Bulkhead Pools

Using thread-pool-isolated bulkheads requires allocating separate thread pools per downstream service. This creates high CPU context-switching overhead under massive concurrency.

  • Mitigation: Transition from thread-pool bulkheads to Semaphore-based Bulkheads. Semaphores enforce concurrency limits without spawning new threads, running executions directly on the active caller threads. This cuts context switching by a factor of greater than 10 under peak request rates.

Architectural Trade-offs

Selecting the optimal transaction coordination strategy requires balancing locking overheads and consistency:

Pattern Read Isolation Availability Locking Overhead Implementation Complexity
Two-Phase Commit (2PC) Absolute (Pessimistic) Low (CP) Extremely High Low (Handled by DB engine natively)
Orchestrated Saga None (Eventually isolated) High (AP) Zero High (Requires state machine database)
Choreographed Saga None High (AP) Zero Extremely High (Difficult to trace and debug)
Outbox + Reconciliation None Extremely High Zero Medium

Trade-off Evaluation

  • Orchestrated vs. Choreographed Sagas: In an orchestrated Saga, a single central service coordination component decides the flow of events and executes rollback commands. In a choreographed Saga, there is no central controller; services publish messages and react to event streams asynchronously. Choreography simplifies architecture because there is no single orchestrator bottleneck, but it is extremely difficult to trace and debug, as no single database tracks the global state of a transaction.

Failure Scenarios and Resilience

Resilience loops must survive process crashes and network anomalies during distributed transactions:

Scenario A: Saga Orchestrator Crash mid-Workflow

If the Saga Orchestrator crashes halfway through executing an order checkout transaction, the system will remain in an incomplete state with resources locked.

  • Resiliency Mitigation: Store the active Saga state changes in a strongly consistent database (such as PostgreSQL or a durable event store). When a standby orchestrator instance boots up, it reads the pending transaction logs and resumes the execution workflow.

Scenario B: Compensating Action Failures

If a compensating step (e.g. issuing a refund) fails due to downstream database downtime, the system remains in an inconsistent financial state.

  • Resiliency Mitigation: Set up a dedicated Dead Letter Queue (DLQ) for failed compensating actions, and deploy background reconciliation jobs to retry these actions or alert operators for manual intervention.

Scenario C: Circuit Breaker State Flapping

If the threshold parameters for sliding windows are too low (e.g., sliding window of 5 requests), a transient network drop of less than 1 second can cause the Circuit Breaker to trip to OPEN. As soon as it tests HALF-OPEN, another transient drop trips it again, causing state flapping that ruins user metrics.

  • Resiliency Mitigation: Ensure the sliding window is statistically significant (minimum 50 to 100 requests) and configure the HALF-OPEN test volume to allow a progressive trial pool before shifting back to CLOSED.

Staff Engineer Perspective


Verbal Script

Verbal Script: Resiliency and Transactional Patterns

Interviewer: "How would you prevent a latency spike in a downstream recommendation service from knocking out our entire e-commerce checkout API Gateway?"

Candidate: "I would implement a hybrid protection boundary combining the Circuit Breaker and Bulkhead patterns. First, I would wrap the recommendation network calls inside a Semaphore-based Bulkhead, limiting the maximum concurrent calls to the recommendation service to 10% of our gateway thread capacity. If the service experiences a latency spike and saturates those semaphore slots, subsequent requests will fail fast immediately, protecting our core checkout threads. Second, I would configure a Circuit Breaker on the recommendation path. If the failure or slow-call rate exceeds 50% over a sliding window of 100 requests, the Circuit Breaker will trip to the OPEN state, gating all calls locally without making network requests."

Interviewer: "Excellent. If the billing service successfully charges a user, but the shipping service fails during checkout, how would you roll back the transaction?"

Candidate: "Since we cannot use blocking distributed locks across microservices, I would coordinate the transaction using the Orchestrated Saga Pattern. When the checkout starts, a Saga Orchestrator commits the transaction log to a database. It charges the user via the Billing Service. If the subsequent shipping call fails, the orchestrator initiates a rollback loop by executing Compensating Transactions in reverse order. It calls the Billing Service to issue a refund and marks the order as canceled. If a compensating step fails due to network loss, the orchestrator publishes the event to a Dead Letter Queue for background retry loops, maintaining consistency."

Interviewer: "How do you design the retry mechanisms downstream of the circuit breaker to prevent worsening a downstream database failure?"

Candidate: "We must implement three mechanisms. First, we use exponential backoff to increase the time between retries, ensuring we do not spam the recovering service. Second, we add randomized jitter to the backoff duration. This breaks lockstep synchronization among retrying clients, smoothing the traffic spike. Third, we establish a retry budget at the gateway level. If the ratio of retried calls to total calls exceeds 10%, we fail subsequent retries immediately, preserving recovery capacity for the downstream service."

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

API Pagination at Scale: Why OFFSET 100,000 is a Database Killer

Designing a paginated API seems simple. Standard frameworks make it trivial: just use LIMIT 20 OFFSET 100. This works perfectly during development and for the first few pages of small tables. However, once your data scal…

Apr 20, 202611 min read
Deep DiveBackend Systems Mastery
#databases#java#performance
System DesignAdvanced

Event-Driven Architecture: CQRS and Event Sourcing in Practice

Mental Model In traditional CRUD (Create, Read, Update, Delete) architectures, the same database model is used for both writing and reading data. Under high traffic, this creates locking contention and complex SQL joins…

Apr 20, 202610 min read
Deep Dive
#performance#system-design
System DesignAdvanced

Bypassing the Kernel: User-Space Networking for Sub-Microsecond Performance

Mental Model For ultra-low-latency distributed systems—such as high-frequency trading (HFT) matching engines, real-time telemetry filters, and high-performance packet routers—even the optimized Linux kernel is too slow.…

Apr 20, 202611 min read
Deep DivePerformance & Optimization Mastery
#performance#system-design
System DesignAdvanced

HyperLogLog at Scale: Billion-Cardinality Estimation

Mental Model > Connecting isolated components into a resilient, scalable, and observable distributed web. Counting unique items (such as Daily Active Users - DAUs, unique page views, or IP addresses) is a classic problem…

Apr 20, 202614 min read
Deep Dive
#performance#system-design

More in System Design

Category-based suggestions if you want to stay in the same domain.