System DesignAdvancedarticle

System Design: Building a Workflow Orchestration Platform

Design a production workflow orchestration platform with durable workflow state, retries, compensation, timeouts, signals, timers, idempotent activities, and operational recovery.

Sachin SarawgiApril 18, 202612 min read12 minute lesson

Many backend processes start life as a queue consumer and a couple of retries.

That works until the business process gets longer and stranger.

Now the flow needs:

  • wait for payment confirmation
  • call three downstream services
  • retry one step but not another
  • compensate if shipment creation fails
  • pause for manual approval
  • resume after a webhook arrives
  • survive worker crashes without forgetting what already happened

At that point, you no longer have "a background job." You have a workflow.

This guide designs a production workflow orchestration platform.

Problem Statement

Build a platform that executes long-running, multi-step workflows durably and safely.

Examples:

  • order fulfillment workflow
  • loan approval workflow
  • user onboarding workflow
  • payout approval and disbursement workflow
  • document verification workflow
  • account recovery workflow

The platform should support:

  • durable workflow state
  • sequential and branching steps
  • retries and backoff
  • timers and delays
  • manual approval pauses
  • external signals and callbacks
  • compensation for partially completed work
  • visibility into workflow progress

This is not just a scheduler and not just a queue. It is a durable state machine for business processes.

Requirements

Functional requirements:

  • start a workflow
  • execute activities step by step
  • persist workflow state
  • retry failed steps
  • wait on timers
  • accept external signals or callbacks
  • support workflow cancellation
  • support manual intervention
  • expose workflow history
  • support replay and debugging

Non-functional requirements:

  • survive process crashes
  • avoid losing in-progress business state
  • bound duplicate side effects
  • scale to many concurrent workflows
  • handle workflows that run for seconds, hours, or days
  • provide clear observability

The most important constraint:

workflow progress must not depend on the memory of one worker process.

Workflow vs Job Scheduler

A scheduler answers:

When should this task run?

A workflow engine answers:

What state is this long-running process in, and what should happen next?

Schedulers are great for:

  • run a job at 9 AM
  • retry a failed task later
  • trigger periodic cleanup

Workflow engines are for:

  • reserve inventory
  • charge payment
  • wait for webhook
  • create shipment
  • send confirmation
  • compensate if shipment fails after payment succeeded

The distinction matters because workflows need durable, queryable state transitions.

High-Level Architecture

API / Starter
   |
   v
Workflow Engine
   |
   +--> workflow state store
   +--> timer queue
   +--> activity task queue
   +--> signal/event inbox
   |
   v
Workers / Activity Runners
   |
   +--> call external services
   +--> report completion / failure
   |
   v
Downstream systems

Supporting systems:

  • workflow UI
  • audit/history store
  • dead-letter / stuck-workflow tooling

Core Concepts

Separate these ideas clearly:

Concept Meaning
Workflow definition The process model or code describing steps and transitions
Workflow execution One running instance of that process
Activity One side-effecting step, such as calling a payment API
Signal External event that wakes or changes a workflow
Timer Delayed wake-up for retries, waiting, or timeouts

Example:

workflow definition: order_fulfillment
workflow execution: wf_123
activities:
  - reserve_inventory
  - charge_payment
  - create_shipment
  - send_confirmation

Example Workflow

Order fulfillment:

  1. reserve inventory
  2. charge payment
  3. create order record
  4. create shipment
  5. send confirmation

Failure path:

  • if payment fails -> release inventory
  • if shipment creation fails after payment -> mark manual review or compensate

This is why orchestration exists: each step may succeed independently, but the business process still needs a coherent outcome.

Data Model

Workflow executions

CREATE TABLE workflow_executions (
  workflow_id UUID PRIMARY KEY,
  tenant_id TEXT NOT NULL,
  workflow_type TEXT NOT NULL,
  business_key TEXT,
  status TEXT NOT NULL,             -- RUNNING, WAITING, SUCCEEDED, FAILED, CANCELLED
  current_state TEXT NOT NULL,
  input JSONB NOT NULL,
  context JSONB NOT NULL DEFAULT '{}'::jsonb,
  started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at TIMESTAMPTZ,
  UNIQUE (tenant_id, workflow_type, business_key)
);

Workflow history

CREATE TABLE workflow_history (
  event_id BIGSERIAL PRIMARY KEY,
  workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
  event_type TEXT NOT NULL,         -- started, activity_scheduled, activity_completed, timer_fired, signaled
  state_before TEXT,
  state_after TEXT,
  payload JSONB NOT NULL DEFAULT '{}'::jsonb,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_workflow_history_lookup
  ON workflow_history (workflow_id, event_id);

Activity tasks

CREATE TABLE workflow_activity_tasks (
  task_id UUID PRIMARY KEY,
  workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
  activity_name TEXT NOT NULL,
  status TEXT NOT NULL,             -- PENDING, RUNNING, SUCCEEDED, FAILED, CANCELLED
  attempt_count INT NOT NULL DEFAULT 0,
  next_attempt_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  locked_by TEXT,
  locked_until TIMESTAMPTZ,
  input JSONB NOT NULL,
  result JSONB,
  last_error TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Timers / wakeups

CREATE TABLE workflow_timers (
  timer_id UUID PRIMARY KEY,
  workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
  timer_type TEXT NOT NULL,
  fire_at TIMESTAMPTZ NOT NULL,
  status TEXT NOT NULL DEFAULT 'PENDING', -- PENDING, FIRED, CANCELLED
  payload JSONB NOT NULL DEFAULT '{}'::jsonb,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_workflow_timers_due
  ON workflow_timers (status, fire_at)
  WHERE status = 'PENDING';

Signals / external events

CREATE TABLE workflow_signals (
  signal_id UUID PRIMARY KEY,
  workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
  signal_name TEXT NOT NULL,
  payload JSONB NOT NULL,
  processed BOOLEAN NOT NULL DEFAULT false,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_workflow_signals_unprocessed
  ON workflow_signals (workflow_id, processed)
  WHERE processed = false;

State Machine Design

A workflow is a durable state machine.

Example order workflow states:

STARTED
INVENTORY_RESERVED
PAYMENT_CHARGED
ORDER_CREATED
SHIPMENT_CREATED
COMPLETED

Failure branches:
PAYMENT_FAILED
SHIPMENT_FAILED
MANUAL_REVIEW
CANCELLED

Keep state names boring and explicit. If operators cannot read the workflow state and understand what happened, the system becomes an archaeology project.

Workflow Execution Model

There are two broad implementation styles:

Command/state-machine style

Store explicit state and let workers transition it.

Pros:

  • simple mental model

Cons:

  • more boilerplate
  • branching logic can sprawl

Event-history replay style

Store workflow event history and reconstruct current state by replay.

Pros:

  • strong determinism
  • easier debugging and replay

Cons:

  • more advanced engine design

This post stays implementation-agnostic enough to fit either style, but the reliability principles are the same.

Starting a Workflow

POST /v1/workflows
{
  "workflowType": "order_fulfillment",
  "tenantId": "merchant_42",
  "businessKey": "order_123",
  "input": {
    "orderId": "order_123",
    "paymentId": "pay_456",
    "userId": "user_77"
  }
}

Response:

{
  "workflowId": "wf_123",
  "status": "RUNNING",
  "state": "STARTED"
}

businessKey is important. It gives you idempotency for "start workflow for this order."

Scheduling Activities

An activity is the side-effecting part of a workflow.

Examples:

  • reserve inventory
  • charge payment
  • call KYC provider
  • create shipment

The workflow engine should schedule an activity task, not perform the side effect inline in the control transaction.

public void scheduleActivity(UUID workflowId, String activityName, JsonNode input) {
    activityTaskRepository.insert(
        ActivityTask.builder()
            .taskId(UUID.randomUUID())
            .workflowId(workflowId)
            .activityName(activityName)
            .status("PENDING")
            .input(input)
            .build()
    );

    workflowHistoryRepository.append(workflowId, "activity_scheduled", activityName);
}

This separation helps with retries and observability.

Claiming Activity Work With Leases

Workers must claim activity tasks safely.

UPDATE workflow_activity_tasks
SET status = 'RUNNING',
    locked_by = :worker_id,
    locked_until = now() + interval '30 seconds',
    attempt_count = attempt_count + 1,
    updated_at = now()
WHERE task_id = (
  SELECT task_id
  FROM workflow_activity_tasks
  WHERE status IN ('PENDING', 'FAILED')
    AND next_attempt_at <= now()
    AND (locked_until IS NULL OR locked_until < now())
  ORDER BY next_attempt_at ASC
  FOR UPDATE SKIP LOCKED
  LIMIT 1
)
RETURNING *;

Lease-based claiming is how the system recovers if a worker crashes after taking work.

Idempotent Activities

Workflow infrastructure can generally only give you at-least-once execution.

That means activities must be idempotent or externally deduplicated.

Example:

public PaymentResult chargePayment(String paymentId, UUID workflowId, UUID taskId) {
    String idempotencyKey = "workflow:" + workflowId + ":task:" + taskId;
    return paymentGateway.charge(paymentId, idempotencyKey);
}

Without idempotency keys, retries turn infrastructure recovery into duplicate charges.

Retries and Backoff

Not all failures are equal.

Retry categories:

  • transient: timeout, 502, temporary network issue
  • terminal: validation error, insufficient funds, invalid address
  • unknown: timed out after request submission, needs care

Retry policy example:

{
  "maxAttempts": 5,
  "initialDelayMs": 1000,
  "backoffMultiplier": 2.0,
  "maxDelayMs": 300000
}

The workflow engine should apply retry policy per activity, not globally for all steps.

Timers and Waiting

Workflows often need to wait:

  • wait 15 minutes for payment webhook
  • retry shipment creation in 5 minutes
  • expire an approval request in 24 hours

Instead of sleeping in a process, create a durable timer:

public void waitUntil(UUID workflowId, Instant fireAt, String timerType) {
    timerRepository.insert(
        WorkflowTimer.builder()
            .timerId(UUID.randomUUID())
            .workflowId(workflowId)
            .timerType(timerType)
            .fireAt(fireAt)
            .build()
    );

    workflowRepository.moveToWaitingState(workflowId, timerType);
}

This is a huge difference between robust orchestration and fragile in-memory logic.

Signals and External Events

Many workflows pause until an external event arrives.

Examples:

  • payment webhook
  • manual approval
  • KYC vendor callback
  • customer document upload

Signal API:

POST /v1/workflows/wf_123/signals
{
  "signalName": "payment_confirmed",
  "payload": {
    "paymentId": "pay_456",
    "status": "succeeded"
  }
}

Signals should be durable and idempotent. A signal that arrives before the workflow is ready should still be stored and processed when appropriate.

Compensation

This is where orchestration becomes much more than task retries.

Example:

  1. inventory reserved
  2. payment charged
  3. shipment creation fails permanently

Now you may need:

  • cancel charge or issue refund
  • release inventory
  • mark manual review

This is compensation, not rollback. Distributed systems do not magically undo side effects.

Example compensation flow:

reserve_inventory
charge_payment
create_shipment -> fails terminally
  -> compensate:
     refund_payment
     release_inventory
  -> mark workflow failed or manually resolved

Compensation rules should be explicit in workflow design, not improvised during incidents.

Manual Tasks and Human-in-the-Loop

Some workflows cannot be fully automated.

Examples:

  • compliance approval
  • fraud review
  • exception handling after repeated failures

Manual task model:

CREATE TABLE workflow_manual_tasks (
  manual_task_id UUID PRIMARY KEY,
  workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
  task_type TEXT NOT NULL,
  status TEXT NOT NULL,          -- OPEN, COMPLETED, CANCELLED
  assigned_to TEXT,
  input JSONB NOT NULL,
  result JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at TIMESTAMPTZ
);

The workflow should be able to pause in WAITING_FOR_MANUAL_ACTION and resume later.

Visibility and Querying

Operators need to answer:

  • what state is this workflow in?
  • which activity last failed?
  • how long has it been waiting?
  • what compensation already ran?
  • did the external signal arrive?

A workflow platform without good history and introspection becomes a black box.

Useful query:

SELECT event_type, state_before, state_after, payload, created_at
FROM workflow_history
WHERE workflow_id = :workflow_id
ORDER BY event_id ASC;

This should be a UI, not just a SQL habit.

Failure Modes

1. Worker crashes mid-activity

Fix:

  • lease expires
  • task becomes reclaimable
  • activity idempotency protects side effects

2. Timer lost after restart

Fix:

  • timers stored durably
  • timer scanner rebuilds due work from DB

3. Duplicate external signal

Fix:

  • signal dedupe key
  • state-aware signal handling

4. Compensation partially succeeds

Fix:

  • compensation itself must be observable and retryable
  • escalate to manual recovery if needed

5. Workflow stuck in waiting state forever

Fix:

  • waiting-state SLA monitoring
  • explicit timeout timers
  • dead-workflow sweeper

Orchestrator vs Choreography

Workflows can also be built through event choreography:

service A publishes event
service B reacts
service C reacts later

This can be fine for loosely coupled flows.

But orchestration is usually better when:

  • business process has clear owner
  • retries and compensation need central visibility
  • humans need to inspect state
  • compliance or auditability matters

Event choreography can drift into "nobody actually knows what the process is doing."

Observability

Track:

  • workflow start rate
  • success / failure / cancellation rate
  • activity retry counts
  • timer backlog
  • signal processing lag
  • stuck workflow count
  • manual task queue depth
  • workflow latency percentiles by type

Useful dashboards:

  • workflow states by type
  • top failing activity names
  • oldest waiting workflows
  • compensation activity volume

What I Would Build First

Phase 1:

  • workflow execution store
  • activity task queue
  • retries with backoff
  • workflow history

Phase 2:

  • timers
  • signals
  • compensation support
  • workflow UI

Phase 3:

  • manual tasks
  • richer replay/debug tooling
  • versioned workflow definitions
  • tenant-aware quotas and fairness

This order matters. Teams often rush to fancy visual workflow builders before they have durable state, retries, and observability nailed down.

Production Checklist

  • workflow execution state durable
  • activities claimed with leases
  • side-effecting activities idempotent
  • retries classified by error type
  • timers stored durably
  • signals deduplicated
  • compensation explicit
  • stuck workflows detected
  • workflow history queryable
  • manual recovery path exists

Final Takeaway

A workflow orchestration platform is how a distributed system remembers what a business process has already done and what it should do next.

If you design it well, long-running flows become understandable, recoverable, and safe.

If you design it poorly, every partial failure turns into custom repair scripts and support escalations.

📚

Recommended Resources

System Design Interview — Alex XuBest Seller

Step-by-step guide to ace system design interviews with real-world examples.

View on Amazon
Grokking System Design on Educative

Interactive course teaching system design with visual diagrams and practice problems.

View Course
Designing Data-Intensive Applications

Martin Kleppmann's book is essential reading for any system design role.

View on Amazon

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

System Design: Building a Distributed Job Scheduler

A job scheduler sounds simple: run this task at this time. In production, that becomes a distributed systems problem. Nodes crash after claiming work. Clocks drift. Jobs overlap. Cron expressions are ambiguous. Tenants n…

Apr 8, 202611 min read
Deep Dive
#system design#job scheduler#distributed systems
System DesignAdvanced

System Design: Building a Distributed Tracing Platform

Metrics tell you that latency is bad. Logs tell you that something failed somewhere. Traces tell you which request went where, in what order, and where the time actually disappeared. That is why tracing becomes essential…

Apr 18, 202611 min read
Deep Dive
#system design#distributed tracing#observability
System DesignAdvanced

System Design: Building a Service Discovery Platform

In a distributed system, naming a service is easy. Finding a healthy instance of that service, right now, in the right zone, during a deployment, while failures are happening, is the real problem. That problem is service…

Apr 18, 202610 min read
Deep Dive
#system design#service discovery#microservices
System DesignAdvanced

System Design: Building a Session Management Platform

Sessions feel simple until they become a security boundary. A user logs in. The system gives them a token or cookie. Requests work. End of story. Then reality arrives: - one user is logged in on five devices - an access…

Apr 18, 202611 min read
Deep Dive
#system design#session management#authentication

More in System Design

Category-based suggestions if you want to stay in the same domain.