Microservices Patterns: Circuit Breaker, Retry, Bulkhead, and Saga

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

In monolithic systems, code executions are reliable local function calls. In distributed microservice environments, executions are remote network calls over unreliable networks. If a single downstream microservice experiences high latency or errors, it can cause thread pools to back up, leading to a cascading failure that can crash the entire platform. Mitigating this requires Resiliency Patterns—gating connections with Circuit Breakers, Retries, and Bulkheads, and coordinating data transactions via Sagas.

System Requirements

To establish a highly resilient distributed runtime, we define the following operational and capacity requirements:

Functional Requirements

Failure Isolation: If a non-critical downstream microservice (such as recommendation or email notifications) fails or slows down, the core e-commerce checkout path must remain operational.
Auto-Recovery: Gating systems must automatically probe failing downstream services and restore traffic when the systems recover.
Distributed Transactions: Cross-service workflows must execute atomic-like transactions, rolling back completed steps if subsequent steps fail.
Fallback Executions: The proxy gateway must provide logical default responses (static data, cache lookups) when downstream dependencies are unreachable.

Non-Functional Requirements

Failure Detection Time: The system must detect and isolate a degraded downstream microservice within 5 seconds of error rate breaches.
Gateway Thread Protection: Saturated downstream endpoints must be prevented from consuming more than 10% of the gateway's thread capacity.
Compensating Action SLA: Saga compensating actions must execute within 2 seconds of a primary transaction failure to prevent resource locking.
Throughput SLA: Resiliency filter evaluations (Circuit Breaker status, Semaphore bulkhead checking) must complete in less than 0.5ms per request.

API Design and Interface Contracts

To coordinate resiliency configurations, gateways and orchestrators utilize standardized schemas. Below is a structured JSON API payload representing the runtime state of a Circuit Breaker and Bulkhead manager, followed by the YAML configuration:

1. Resiliency Config (YAML for Gateway)

gateway_resilience:
  billing_circuit_breaker:
    failure_rate_threshold: 50 # Transition to OPEN if 50% of requests fail
    slow_call_rate_threshold: 75 # Transition to OPEN if 75% of calls take > 500ms
    slow_call_duration_threshold_ms: 500
    sliding_window_size: 100 # Track last 100 calls
    wait_duration_in_open_state_ms: 10000 # Wait 10s before transitioning to HALF-OPEN
  billing_bulkhead:
    max_concurrent_calls: 20 # Max concurrent threads allowed
    max_wait_duration_ms: 50 # Block queue up to 50ms before failing fast

2. Saga State Update Payload (Orchestrator to Database)

POST /api/v1/sagas/update

{
  "saga_id": "saga-uuid-998811-order",
  "saga_type": "ORDER_CHECKOUT",
  "current_step": "INVENTORY_RESERVE",
  "status": "COMPENSATING",
  "steps": [
    {
      "step_name": "ORDER_CREATE",
      "status": "SUCCESS",
      "updated_at": "2026-06-16T18:38:35Z"
    },
    {
      "step_name": "BILLING_CHARGE",
      "status": "SUCCESS",
      "updated_at": "2026-06-16T18:38:36Z"
    },
    {
      "step_name": "INVENTORY_RESERVE",
      "status": "FAILED",
      "updated_at": "2026-06-16T18:38:37Z"
    }
  ]
}

High-Level Architecture

Distributed systems enforce safety boundaries by segmenting resource paths and state flows. A load balancer terminates TLS and routes traffic to the API Gateway. The API Gateway integrates a Circuit Breaker filter and a Bulkhead coordinator before making downstream network calls. For multi-service transactions, the Gateway delegates coordination to a Saga Orchestrator, which logs states in a durable transactional database and routes tasks via RabbitMQ or Kafka event topics.

1. Circuit Breaker State Machine

A Circuit Breaker operates in three states:

CLOSED: Traffic flows normally. Telemetry rates are monitored.
OPEN: Downstream calls fail fast immediately, returning a fallback response without making network calls.
HALF-OPEN: A small percentage of trial traffic is allowed through to test downstream health.

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: Error Rate > Threshold (Closed to Open)
    Open --> HalfOpen: Sleep Window Expires (Open to Half-Open)
    
    HalfOpen --> Closed: Trial Success Rate > Threshold
    HalfOpen --> Open: Trial Fails
    
    classDef stateColor fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
    class Closed,Open,HalfOpen stateColor;

2. Orchestrated Saga Transaction Flow

If the Order Service Orchestrator fails to reserve inventory at Step 2, it triggers Compensating Transactions in reverse order: issuing a refund to the Billing Service and canceling the order, returning the system to a clean state.

sequenceDiagram
    autonumber
    participant Orchestrator as Saga Orchestrator
    participant Billing as Billing Service
    participant Inventory as Inventory Service

    rect rgb(240, 248, 255)
        Note over Orchestrator, Inventory: Happy Path Executions
        Orchestrator->>Billing: Step 1: Charge Card
        Billing-->>Orchestrator: Charged Success
        Orchestrator->>Inventory: Step 2: Reserve Inventory
        Inventory--xOrchestrator: Out of Stock! (Failed)
    end
    
    rect rgb(255, 235, 235)
        Note over Orchestrator, Billing: Compensating Rollback Steps
        Orchestrator->>Billing: Step 3 (Compensate): Issue Refund
        Billing-->>Orchestrator: Refunded Completed
    end

Low-Level Design and Schema

Below is a production-ready, compilable Java class modeling a Resilient Gateway Interceptor. It implements a local Semaphore-based Bulkhead and a sliding-window Circuit Breaker fallback layer:

package com.codesprintpro.resilience;

import java.util.concurrent.Semaphore;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

public class ResilientGatewayHandler {
    private final Semaphore bulkheadSemaphore;
    private final int timeoutMs;
    
    // Simple state indicators for the Circuit Breaker
    private String circuitState = "CLOSED";
    private final AtomicInteger failureCounter = new AtomicInteger(0);
    private final int failureThreshold = 5;

    public ResilientGatewayHandler(int maxConcurrency, int timeoutMs) {
        this.bulkheadSemaphore = new Semaphore(maxConcurrency);
        this.timeoutMs = timeoutMs;
    }

    /**
     * Executes a downstream network request under strict bulkhead isolation
     * and circuit breaker gating boundaries.
     */
    public String executeWithResilience(DownstreamClient client) {
        // 1. Circuit Breaker check
        if ("OPEN".equals(this.circuitState)) {
            return executeFallback(); // Fail fast immediately
        }

        // 2. Bulkhead execution
        boolean acquired = false;
        try {
            acquired = this.bulkheadSemaphore.tryAcquire(this.timeoutMs, TimeUnit.MILLISECONDS);
            if (!acquired) {
                return executeFallback(); // Bulkhead queue saturated
            }

            // 3. Execute downstream network call
            String response = client.call();
            this.failureCounter.set(0); // Reset failures on success
            return response;

        } catch (Exception e) {
            int failures = this.failureCounter.incrementAndGet();
            if (failures >= this.failureThreshold) {
                this.circuitState = "OPEN"; // Trip circuit breaker
            }
            return executeFallback();
        } finally {
            if (acquired) {
                this.bulkheadSemaphore.release();
            }
        }
    }

    private String executeFallback() {
        return "{\"status\":\"FALLBACK_DEFAULT_VALUE\",\"message\":\"Service temporarily degraded.\"}";
    }
}

// Representational interface for compilation
interface DownstreamClient {
    String call() throws Exception;
}

Relational Schema Design (Saga Log Registry)

When implementing orchestrated Sagas, storing the active transaction steps, compensating actions, and current completion states in a durable SQL database is critical. This prevents data loss during service crashes.

CREATE TABLE saga_instances (
    saga_id VARCHAR(255) PRIMARY KEY,
    saga_type VARCHAR(100) NOT NULL, -- e.g., 'ORDER_CHECKOUT'
    global_status VARCHAR(50) NOT NULL, -- STARTED, SUCCESS, COMPENSATING, ROLLED_BACK, FAILED
    payload_json TEXT NOT NULL, -- Original transaction inputs
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE saga_steps (
    step_id VARCHAR(255) PRIMARY KEY,
    saga_id VARCHAR(255) REFERENCES saga_instances(saga_id) ON DELETE CASCADE,
    step_name VARCHAR(100) NOT NULL,
    execution_order INT NOT NULL,
    status VARCHAR(50) NOT NULL, -- PENDING, SUCCESS, FAILED, COMPENSATING, COMPENSATED
    action_class VARCHAR(255) NOT NULL, -- Java class string to execute
    compensation_class VARCHAR(255) NOT NULL, -- Java class for compensating rollback
    retry_count INT NOT NULL DEFAULT 0,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_saga_steps_saga_id ON saga_steps(saga_id);
CREATE INDEX idx_saga_instances_status ON saga_instances(global_status);

Scaling Challenges and Capacity Estimation

Deploying resilience patterns at scale exposes distinct distributed bottlenecks that impact network utilization and thread pooling capacity:

1. Retry Storms under Network Flaps

If a downstream database experiences a brief 100ms connection drop, and 1,000 gateway clients immediately retry their calls 3 times without waiting, the database will be hit with 3,000 requests as it recovers, saturating CPU pools and keeping the system down.

Mathematical Retry Jitter Bound: We calculate retry delays using exponential backoff with randomized jitter to spread requests uniformly. The formula for wait time $T_i$ on attempt $i$ is: $$T_i = \text{Min}(T_{\text{max}}, T_{\text{base}} \times 2^i) \pm \text{RandomJitter}$$ For $T_{\text{base}} = 100\text{ms}$, on attempt 3, the backoff is $100 \times 8 = 800\text{ms}$. By applying a random jitter between 0 and 200ms, the request is dispatched between 600ms and 1000ms. This prevents synchronization spikes.
Mitigation: Implement a Retry Budget. Set the gateway to only allow retries if total retry requests represent less than 10% of overall traffic. This prevents retries from overwhelming recovering downstream systems.

2. Thread-Context Overhead in Bulkhead Pools

Using thread-pool-isolated bulkheads requires allocating separate thread pools per downstream service. This creates high CPU context-switching overhead under massive concurrency.

Mitigation: Transition from thread-pool bulkheads to Semaphore-based Bulkheads. Semaphores enforce concurrency limits without spawning new threads, running executions directly on the active caller threads. This cuts context switching by a factor of greater than 10 under peak request rates.

Architectural Trade-offs

Selecting the optimal transaction coordination strategy requires balancing locking overheads and consistency:

Pattern	Read Isolation	Availability	Locking Overhead	Implementation Complexity
Two-Phase Commit (2PC)	Absolute (Pessimistic)	Low (CP)	Extremely High	Low (Handled by DB engine natively)
Orchestrated Saga	None (Eventually isolated)	High (AP)	Zero	High (Requires state machine database)
Choreographed Saga	None	High (AP)	Zero	Extremely High (Difficult to trace and debug)
Outbox + Reconciliation	None	Extremely High	Zero	Medium

Trade-off Evaluation

Orchestrated vs. Choreographed Sagas: In an orchestrated Saga, a single central service coordination component decides the flow of events and executes rollback commands. In a choreographed Saga, there is no central controller; services publish messages and react to event streams asynchronously. Choreography simplifies architecture because there is no single orchestrator bottleneck, but it is extremely difficult to trace and debug, as no single database tracks the global state of a transaction.

Failure Scenarios and Resilience

Resilience loops must survive process crashes and network anomalies during distributed transactions:

Scenario A: Saga Orchestrator Crash mid-Workflow

If the Saga Orchestrator crashes halfway through executing an order checkout transaction, the system will remain in an incomplete state with resources locked.

Resiliency Mitigation: Store the active Saga state changes in a strongly consistent database (such as PostgreSQL or a durable event store). When a standby orchestrator instance boots up, it reads the pending transaction logs and resumes the execution workflow.

Scenario B: Compensating Action Failures

If a compensating step (e.g. issuing a refund) fails due to downstream database downtime, the system remains in an inconsistent financial state.

Resiliency Mitigation: Set up a dedicated Dead Letter Queue (DLQ) for failed compensating actions, and deploy background reconciliation jobs to retry these actions or alert operators for manual intervention.

Scenario C: Circuit Breaker State Flapping

If the threshold parameters for sliding windows are too low (e.g., sliding window of 5 requests), a transient network drop of less than 1 second can cause the Circuit Breaker to trip to OPEN. As soon as it tests HALF-OPEN, another transient drop trips it again, causing state flapping that ruins user metrics.

Resiliency Mitigation: Ensure the sliding window is statistically significant (minimum 50 to 100 requests) and configure the HALF-OPEN test volume to allow a progressive trial pool before shifting back to CLOSED.

Staff Engineer Perspective

Verbal Script

Verbal Script: Resiliency and Transactional Patterns

Interviewer: "How would you prevent a latency spike in a downstream recommendation service from knocking out our entire e-commerce checkout API Gateway?"

Candidate: "I would implement a hybrid protection boundary combining the Circuit Breaker and Bulkhead patterns. First, I would wrap the recommendation network calls inside a Semaphore-based Bulkhead, limiting the maximum concurrent calls to the recommendation service to 10% of our gateway thread capacity. If the service experiences a latency spike and saturates those semaphore slots, subsequent requests will fail fast immediately, protecting our core checkout threads. Second, I would configure a Circuit Breaker on the recommendation path. If the failure or slow-call rate exceeds 50% over a sliding window of 100 requests, the Circuit Breaker will trip to the OPEN state, gating all calls locally without making network requests."

Interviewer: "Excellent. If the billing service successfully charges a user, but the shipping service fails during checkout, how would you roll back the transaction?"

Candidate: "Since we cannot use blocking distributed locks across microservices, I would coordinate the transaction using the Orchestrated Saga Pattern. When the checkout starts, a Saga Orchestrator commits the transaction log to a database. It charges the user via the Billing Service. If the subsequent shipping call fails, the orchestrator initiates a rollback loop by executing Compensating Transactions in reverse order. It calls the Billing Service to issue a refund and marks the order as canceled. If a compensating step fails due to network loss, the orchestrator publishes the event to a Dead Letter Queue for background retry loops, maintaining consistency."

Interviewer: "How do you design the retry mechanisms downstream of the circuit breaker to prevent worsening a downstream database failure?"

Candidate: "We must implement three mechanisms. First, we use exponential backoff to increase the time between retries, ensuring we do not spam the recovering service. Second, we add randomized jitter to the backoff duration. This breaks lockstep synchronization among retrying clients, smoothing the traffic spike. Third, we establish a retry budget at the gateway level. If the ratio of retried calls to total calls exceeds 10%, we fail subsequent retries immediately, preserving recovery capacity for the downstream service."

Microservices Patterns: Circuit Breaker, Retry, Bulkhead, and Saga

What you will learn

Mental Model

System Requirements

Functional Requirements

Non-Functional Requirements

API Design and Interface Contracts

1. Resiliency Config (YAML for Gateway)

2. Saga State Update Payload (Orchestrator to Database)

High-Level Architecture

1. Circuit Breaker State Machine

2. Orchestrated Saga Transaction Flow

Low-Level Design and Schema

Relational Schema Design (Saga Log Registry)

Scaling Challenges and Capacity Estimation

1. Retry Storms under Network Flaps

2. Thread-Context Overhead in Bulkhead Pools

Architectural Trade-offs

Trade-off Evaluation

Failure Scenarios and Resilience

Scenario A: Saga Orchestrator Crash mid-Workflow

Scenario B: Compensating Action Failures

Scenario C: Circuit Breaker State Flapping

Staff Engineer Perspective

Verbal Script

Verbal Script: Resiliency and Transactional Patterns

Sachin Sarawgi

Keep Learning

Modern Java GC: G1 vs. ZGC - Choosing the Right Collector

Retry Queues vs. DLQ: Architecting Resilient Message Consumers

Related Articles

API Pagination at Scale: Why OFFSET 100,000 is a Database Killer

Event-Driven Architecture: CQRS and Event Sourcing in Practice

Bypassing the Kernel: User-Space Networking for Sub-Microsecond Performance

HyperLogLog at Scale: Billion-Cardinality Estimation

More in System Design

LLD Mastery: The Factory Design Pattern

LLD Mastery: The Singleton Design Pattern

LLD Mastery: The Strategy Design Pattern

Microservices Patterns: Circuit Breaker, Retry, Bulkhead, and Saga

What you will learn

Mental Model

System Requirements

Functional Requirements

Non-Functional Requirements

API Design and Interface Contracts

1. Resiliency Config (YAML for Gateway)

2. Saga State Update Payload (Orchestrator to Database)

High-Level Architecture

1. Circuit Breaker State Machine

2. Orchestrated Saga Transaction Flow

Low-Level Design and Schema

Relational Schema Design (Saga Log Registry)

Scaling Challenges and Capacity Estimation

1. Retry Storms under Network Flaps

2. Thread-Context Overhead in Bulkhead Pools

Architectural Trade-offs

Trade-off Evaluation

Failure Scenarios and Resilience

Scenario A: Saga Orchestrator Crash mid-Workflow

Scenario B: Compensating Action Failures

Scenario C: Circuit Breaker State Flapping

Staff Engineer Perspective

Verbal Script

Verbal Script: Resiliency and Transactional Patterns

Get the next backend guide in your inbox

Sachin Sarawgi

Keep Learning

Modern Java GC: G1 vs. ZGC - Choosing the Right Collector

Retry Queues vs. DLQ: Architecting Resilient Message Consumers

Related Articles

API Pagination at Scale: Why OFFSET 100,000 is a Database Killer

Event-Driven Architecture: CQRS and Event Sourcing in Practice

Bypassing the Kernel: User-Space Networking for Sub-Microsecond Performance

HyperLogLog at Scale: Billion-Cardinality Estimation

More in System Design

LLD Mastery: The Factory Design Pattern

LLD Mastery: The Singleton Design Pattern

LLD Mastery: The Strategy Design Pattern