Lesson 51 of 105 16 minFlagship

System Design: Managing Distributed Transactions with the Saga Pattern

How to maintain consistency across microservices? Deep dive into the Saga pattern (Choreography vs. Orchestration) and handling failures with compensating transactions.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • If a step in the saga fails, the system executes a series of undo operations to revert the changes
  • Example: If the Payment Service fails after the Inventory Service has reserved stock, the system reverts
  • Crucial Rule: Compensating transactions must be idempotent, as they might be retried due to failures
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

In a traditional monolithic architecture, maintaining data consistency is a solved problem. You wrap your database operations in a single local database transaction, and the database engine guarantees ACID (Atomicity, Consistency, Isolation, Durability) compliance. If any operation fails, the database rollback mechanism instantly reverts all changes, leaving the system in a clean, consistent state.

In a microservices architecture, this single transaction boundary disappears. A single business process, such as placing an order, often spans multiple services: an Order Service to write order records, an Inventory Service to reserve items, a Payment Service to charge credit cards, and a Delivery Service to dispatch courier drivers. Each service has its own private, isolated database.

Attempting to enforce consistency using distributed database transactions (such as Two-Phase Commit or 2PC) at scale introduces major performance bottlenecks. They require locking database rows across multiple networks, which reduces write throughput, increases query latencies, and creates single points of failure. The Saga Pattern is the industry standard for maintaining eventual consistency across microservices without distributed locks.

This system design guide details the architectural blueprint for designing a resilient, high-throughput Saga coordination platform capable of managing 5,000 transactions per second.


System Requirements

To design an enterprise-grade Saga coordination system, we divide our requirements into functional capabilities, non-functional operational limits, and scale assumptions.

Functional Requirements

  • Sequence Coordination: Execute multi-step business transactions across isolated databases in a defined order.
  • Compensating Transactions: Automatically trigger compensating (rollback) tasks in reverse chronological order when any step in the Saga encounters a terminal failure.
  • Durable State Storage: Maintain the lifecycle state and execution history of every active Saga instance.
  • Asynchronous Execution: Decouple participant invocations using messaging queues to prevent blocking connection pools.
  • Idempotent Retries: Provide execution frameworks to retry failed participants and compensations safely without side effects.
  • Queryable Audits: Expose the execution timeline and state of active and completed Sagas for operational troubleshooting.

Non-Functional Requirements

  • Eventual Consistency: Accept that intermediate states will be temporarily inconsistent across databases, but guarantee that the system resolves to a final consistent state.
  • High Write Throughput: The coordination store must handle thousands of state transitions per second without locking tables.
  • Fail-Safe Recovery: If the coordinator process crashes, it must recover the state of in-progress Sagas from the database and resume executions without missing steps.
  • No Single Point of Failure (SPOF): Participant microservice failures must not stall unrelated Sagas or degrade overall system availability.

Scale Assumptions

  • Throughput: 5,000 active Saga executions per second at peak.
  • Average Steps: Each Saga involves 4 distinct microservice participants (e.g., Order, Inventory, Payment, Shipping).

API Design and Service Contracts

A Saga orchestrator coordinates execution using command/reply queues, while choreography systems utilize shared event channels.

1. Inbound Saga Trigger (POST /v1/sagas/orders)

Invoked by client systems to start an order placement Saga.

Request Payload:

{
  "sagaType": "order_placement",
  "customerId": "cust_uuid_99812",
  "items": [
    { "itemId": "item_8819", "quantity": 2 }
  ],
  "paymentMethodId": "pm_5521",
  "amountCents": 9900
}

Response Payload (202 Accepted):

{
  "sagaId": "saga_uuid_77182ab",
  "status": "PROCESSING",
  "startedAt": "2026-06-07T11:48:00Z"
}

2. Participant Command Contract (gRPC Protocol)

The orchestrator dispatches execution commands to participant microservices.

syntax = "proto3";

package codesprintpro.saga.v1;

service SagaParticipantService {
  rpc ExecuteStep (StepCommandRequest) returns (StepReplyResponse);
  rpc CompensateStep (CompensateCommandRequest) returns (CompensateReplyResponse);
}

message StepCommandRequest {
  string saga_id = 1;
  string step_name = 2;
  string payload_json = 3;
}

message StepReplyResponse {
  enum StepStatus {
    SUCCESS = 0;
    FAILURE = 1;
  }
  StepStatus status = 1;
  string output_json = 2;
  string error_message = 3;
}

message CompensateCommandRequest {
  string saga_id = 1;
  string step_name = 2;
  string payload_json = 3;
}

message CompensateReplyResponse {
  bool success = 1;
}

High-Level Architecture

There are two primary architectures for coordinating a Saga: Choreography and Orchestration.

In Choreography, services listen to Kafka event topics and trigger their local transactions independently. For example, when the Order Service publishes an ORDER_CREATED event, the Inventory Service consumes it, reserves stock, and publishes an INVENTORY_RESERVED event, which the Payment Service consumes.

In Orchestration, a centralized Saga Orchestrator manages the entire lifecycle. It reads the Saga definition, writes the state transitions to the Saga State DB, and pushes step commands into Kafka. Participant services process these commands and return results. If any step fails, the Orchestrator initiates compensating commands in reverse order.

Choreography (Event-Driven Decentralized Flow)

This sequence diagram shows decentralized participant cooperation via an event broker.

sequenceDiagram
    autonumber
    participant Order as Order Service
    participant Inventory as Inventory Service
    participant Payment as Payment Service
    participant Kafka as Kafka Event Broker
    
    Order->>Order: Write Pending Order
    Order->>Kafka: Publish ORDER_CREATED Event
    
    Kafka->>Inventory: Consume ORDER_CREATED Event
    Inventory->>Inventory: Reserve Items in DB
    Inventory->>Kafka: Publish INVENTORY_RESERVED Event
    
    Kafka->>Payment: Consume INVENTORY_RESERVED Event
    Payment->>Payment: Charge Card
    Payment->>Kafka: Publish PAYMENT_COMPLETED Event
    
    Kafka->>Order: Consume PAYMENT_COMPLETED Event
    Order->>Order: Mark Order as COMPLETED

Orchestration (Centralized Coordinator Flow)

This diagram illustrates the centralized command coordination and compensating rollback path during a payment failure.

sequenceDiagram
    autonumber
    participant Client as Checkout Client
    participant Orch as Saga Orchestrator
    participant State as Saga State DB
    participant Inv as Inventory Service
    participant Pay as Payment Service (Fails)
    
    Client->>Orch: POST /v1/sagas/orders
    Orch->>State: Create Saga Instance & Set status=STARTED
    Orch->>Inv: Send Command: RESERVE_STOCK
    Inv-->>Orch: Reply: STOCK_RESERVED (Success)
    Orch->>State: Log Step 1 Success & Set status=STOCK_RESERVED
    
    Orch->>Pay: Send Command: CHARGE_CARD
    Pay-->>Orch: Reply: INSUFFICIENT_FUNDS (Failure)
    Orch->>State: Log Step 2 Failure & Set status=COMPENSATING
    
    Note over Orch, Inv: Trigger Rollback Compensation
    Orch->>Inv: Send Command: RELEASE_STOCK (Compensate Step 1)
    Inv-->>Orch: Reply: STOCK_RELEASED (Success)
    Orch->>State: Set status=REJECTED
    Orch-->>Client: Return Order Rejected Status

Low-Level Design and Schema

For centralized orchestration, the database must track Saga state, participant execution, and history logs. We model this using a PostgreSQL relational database.

-- Tracks overall Saga instance executions
CREATE TABLE saga_instances (
    saga_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    saga_type VARCHAR(128) NOT NULL,
    status VARCHAR(32) NOT NULL DEFAULT 'STARTED', -- STARTED, COMPLETED, COMPENSATING, REJECTED, FAILED
    input_payload JSONB NOT NULL,
    current_step_index INT NOT NULL DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at TIMESTAMPTZ
);

CREATE INDEX idx_saga_instances_status 
ON saga_instances (status, created_at);

-- Tracks participant steps within a Saga instance
CREATE TABLE saga_steps (
    step_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    saga_id UUID NOT NULL REFERENCES saga_instances(saga_id) ON DELETE CASCADE,
    step_name VARCHAR(128) NOT NULL,
    step_index INT NOT NULL,
    status VARCHAR(32) NOT NULL DEFAULT 'PENDING', -- PENDING, SUCCESS, FAILED, COMPENSATED
    input_payload JSONB,
    output_payload JSONB,
    error_message TEXT,
    started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    CONSTRAINT uk_saga_step UNIQUE (saga_id, step_index)
);

CREATE INDEX idx_saga_steps_lookup 
ON saga_steps (saga_id, step_index ASC);

-- Outbox logs for publishing events atomically
CREATE TABLE transactional_outbox (
    event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(128) NOT NULL,
    aggregate_id VARCHAR(128) NOT NULL,
    event_type VARCHAR(128) NOT NULL,
    payload JSONB NOT NULL,
    status VARCHAR(32) NOT NULL DEFAULT 'PENDING', -- PENDING, PUBLISHED, FAILED
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    processed_at TIMESTAMPTZ
);

CREATE INDEX idx_outbox_pending 
ON transactional_outbox (status, created_at) 
WHERE status = 'PENDING';

Schema Rationale & Index Optimization

  1. uk_saga_step: Enforces uniqueness on (saga_id, step_index). This blocks duplicate step records for a single index in case participant timeouts cause retry commands to create duplicate step logs.
  2. idx_outbox_pending: A partial index restricted to status = 'PENDING'. A background outbox publisher (such as Debezium or a custom tailing daemon) continuously queries this table. The partial index keeps the search scan size extremely small, preventing DB performance degradation.
  3. ON DELETE CASCADE: Restructured so that deleting a finished Saga instance purges its child step logs in a single operational write, simplifying ledger database maintenance.

Scaling Challenges and Capacity Estimation

Coordinating distributed transactions at 5,000 TPS requires evaluating write IOPS, event size growth, and database sharding.

1. Database Write IOPS Capacity

  • Assumptions:

    • Peak Saga Rate ($S$) = $5,000$ Sagas/sec
    • Average Steps per Saga = $4$ steps
    • Each step writes $2$ database rows (1 to insert the step log, 1 to update the Saga instance status). Starting and completing the Saga adds $2$ writes.
  • Calculations: $$\text{Total Write Operations Per Second} = S \times (2 + 2 \times 4) = 5,000 \times 10 = 50,000\text{ writes/sec}$$

A single relational database instance cannot support 50,000 write operations per second under ordinary circumstances. To scale this write throughput:

  • We shard the saga_instances and saga_steps tables by hashing the saga_id across $16$ database instances.
  • We utilize an in-memory key-value store (e.g., Redis or key-value structures inside Flink) to manage active, hot Saga states, flushing logs to PostgreSQL asynchronously using a Kafka outbox queue.

2. Event Broker Network Bandwidth

  • Assumptions:

    • Peak Saga Rate = $5,000$ Sagas/sec
    • Average message payload size ($M$) = $1$ KB (contains step payload + trace contexts)
  • Calculations:

    • Each Saga produces 1 start message, 4 command messages, 4 reply messages, and 1 completion message. Total events per Saga = $10$ events. $$\text{Total Broker Messages Per Second} = 5,000 \times 10 = 50,000\text{ messages/sec}$$ $$\text{Network Bandwidth} = 50,000 \times 1\text{ KB} = 50,000\text{ KB/sec} = 50\text{ MB/sec} \approx 400\text{ Mbps}$$

This volume of network throughput requires configuring Kafka partitions. We run a Kafka cluster with at least $3$ broker nodes, dividing the order topics into $24$ partitions to spread the network load evenly.


Failure Scenarios and Resilience

Saga platforms must recover safely from process crashes and network partitions without generating transaction discrepancies.

1. Saga Orchestrator Crash Recovery

The orchestrator server crashes after sending a CHARGE_CARD command but before receiving the gateway reply.

  • The Threat: When the orchestrator recovers, it does not know if the card was charged, potentially leading to stuck orders or duplicate billing.
  • Resilience Design:
    • When the orchestrator restarts, it scans the saga_instances table for tasks where status = 'STARTED'.
    • It reads the saga_steps history logs to reconstruct the execution state.
    • To verify the state of the in-progress step, it queries the participant service using the saga_id to get the outcome, resuming execution from that point forward.

2. Participant Microservice Timeouts

The orchestrator sends a RESERVE_STOCK command to the Inventory Service, but the network connection drops and the request times out.

  • The Threat: We do not know if the inventory was reserved, potentially leading to orphan stock locks.
  • Resilience Design:
    • The orchestrator must never assume failure and trigger compensations immediately.
    • It queries the Inventory Service using the saga_id to check the lock status.
    • If the reservation was written, the orchestrator updates its state to success. If it was not, the orchestrator retries the request. If the inventory service returns a terminal error, the orchestrator initiates compensations.

3. Out-of-Order Event Deliveries

Due to network routing variances, a PAYMENT_COMPLETED reply arrives at the orchestrator before the INVENTORY_RESERVED confirmation event.

  • The Threat: The orchestrator state machine updates the Saga status out of sequence, violating execution constraints.
  • Resilience Design:
    • We use monotonically increasing sequence vectors on Saga events.
    • The orchestrator checks if the incoming event sequence matches the expected index step.
    • If it is late or out-of-order, the orchestrator buffers the event in a Redis cache and processes it only after the preceding steps complete.

4. Semantic Lock Conflicts (Lack of Isolation)

Because Saga steps commit local transactions immediately, a second Saga might read intermediate data (e.g., reserving stock that has already been provisionally reserved by a pending Saga).

  • The Threat: Over-selling inventory if the first Saga fails and releases the stock.
  • Resilience Design:
    • We implement Semantic Locks. When a Saga reserves stock, the Inventory Service updates the row status to PENDING_RESERVED and saves the saga_id.
    • If a second Saga attempts to purchase the same item, the inventory engine blocks it or prompts it to wait.
    • Once the first Saga completes or compensates, the status transitions to COMMITTED or AVAILABLE, preventing double-selling.

Architectural Trade-offs

Choosing the coordination model and consistency mechanism requires balancing operational overhead against system complexity.

Trade-off 1: Choreography (Decentralized) vs. Orchestration (Centralized)

Decentralized choreography relies on event reaction; centralized orchestration utilizes a dedicated workflow engine.

Aspect Choreography (Decentralized) Orchestration (Centralized)
System Coupling Low (Services only know about the event broker) High (Orchestrator coordinates all participants)
Operational Visibility Low (Must trace events across multiple logs) High (Single database tracks execution state)
Debugging Complexity High (Hard to recreate failure event chains) Low (Saga state logs show the exact failure point)
Single Point of Failure Low (No central coordinator node) High (If orchestrator DB fails, all Sagas halt)
Circular Dependencies High risk (Services can easily create event loops) Low (Orchestrator enforces a linear execution graph)

Trade-off 2: Two-Phase Commit (2PC) vs. Saga Pattern

2PC enforces strong consistency using distributed lock coordinates; Saga guarantees eventual consistency using local transactions and rollbacks.

Aspect Two-Phase Commit (2PC) Saga Pattern
Consistency Model Strong Consistency (ACID) Eventual Consistency (BASE)
Write Throughput Low (Blocked by network locks and disk writes) High (Local database transactions execute immediately)
Resource Locking High (Locks rows until all nodes confirm success) Low (No global database locks are held)
Failure Behavior Rolls back atomically during participant outages Compensates asynchronously; intermediate states are visible

Staff Engineer Perspective

Designing distributed state machines requires implementing strict safety barriers at the database layer.


Verbal Script

Interviewer: "How would you handle a scenario in an orchestration-based Saga where the orchestrator crashes immediately after sending a charge request to the Payment Service, and the service succeeds in charging the user?"

Candidate: "We solve this by separating command routing from execution confirmation, using idempotency keys and asynchronous orchestrator state recovery.

First, when the orchestrator initiates the payment step, it generates a unique transaction token (e.g., combining saga_id and the step name) and saves it to the saga_steps log as PENDING. It then passes this token as the idempotency key in the header to the Payment Service.

Second, when the orchestrator restarts, it scans the state database for Sagas in the STARTED status. It reads the history logs and finds that the payment step was sent but never acknowledged. Instead of retrying the charge or executing compensation immediately, the orchestrator queries the Payment Service using the transaction token to verify the status.

Because the Payment Service is idempotent, it returns a status of CAPTURED and the transaction details. The orchestrator updates its state database, logs the step as success, and proceeds to the next step, ensuring that the customer is not charged twice and the transaction completes safely."

Interviewer: "What is your strategy for handling out-of-order execution if a compensation message is received by a participant before the original command message?"

Candidate: "This is known as the Out-of-Order Message Arrival problem, and it occurs when networks partition or route messages via divergent paths. If the participant receives a compensation command (e.g., RELEASE_STOCK) before the original command (RESERVE_STOCK), executing the compensation first would result in a no-op because no stock is currently reserved. When the original reservation command arrives later, it would reserve the stock, leaving it locked indefinitely.

To prevent this, the participant database must maintain a Saga Execution Ledger. When the compensation command arrives first, the participant writes a row to its local table: (saga_id, status = COMPENSATED).

When the original reservation command arrives later, the participant checks the ledger first. Finding that the Saga has already been compensated, the participant discards the reservation write immediately, returning a success status to the orchestrator. This prevents the lock from being written and ensures consistency."

Interviewer: "How would you prevent data inconsistency if a compensating transaction fails permanently due to a terminal error in a third-party API?"

Candidate: "If a compensating transaction fails terminally (e.g., a payment refund is rejected by the acquiring bank due to account closure), the automated system cannot reconcile the transaction.

First, we do not let the orchestrator loop indefinitely, as this consumes resource threads. The orchestrator halts automated retries after a configured limit (e.g., 5 attempts).

Second, the orchestrator writes a record to the transactional_outbox indicating a terminal reconciliation failure, updating the Saga instance status to FAILED_COMPENSATION.

Finally, a background monitor consumes this outbox message, triggers an immediate high-priority alert to our operational team via PagerDuty, and logs the case details in our manual queue dashboard. The operations team can then manually review the case and execute a manual wire transfer or write-off, ensuring audit compliance."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.