Many backend processes start life as a queue consumer and a couple of retries.
That works until the business process gets longer and stranger.
Now the flow needs:
- wait for payment confirmation
- call three downstream services
- retry one step but not another
- compensate if shipment creation fails
- pause for manual approval
- resume after a webhook arrives
- survive worker crashes without forgetting what already happened
At that point, you no longer have "a background job." You have a workflow.
This guide designs a production workflow orchestration platform.
Problem Statement
Build a platform that executes long-running, multi-step workflows durably and safely.
Examples:
- order fulfillment workflow
- loan approval workflow
- user onboarding workflow
- payout approval and disbursement workflow
- document verification workflow
- account recovery workflow
The platform should support:
- durable workflow state
- sequential and branching steps
- retries and backoff
- timers and delays
- manual approval pauses
- external signals and callbacks
- compensation for partially completed work
- visibility into workflow progress
This is not just a scheduler and not just a queue. It is a durable state machine for business processes.
Requirements
Functional requirements:
- start a workflow
- execute activities step by step
- persist workflow state
- retry failed steps
- wait on timers
- accept external signals or callbacks
- support workflow cancellation
- support manual intervention
- expose workflow history
- support replay and debugging
Non-functional requirements:
- survive process crashes
- avoid losing in-progress business state
- bound duplicate side effects
- scale to many concurrent workflows
- handle workflows that run for seconds, hours, or days
- provide clear observability
The most important constraint:
workflow progress must not depend on the memory of one worker process.
Workflow vs Job Scheduler
A scheduler answers:
When should this task run?
A workflow engine answers:
What state is this long-running process in, and what should happen next?
Schedulers are great for:
- run a job at 9 AM
- retry a failed task later
- trigger periodic cleanup
Workflow engines are for:
- reserve inventory
- charge payment
- wait for webhook
- create shipment
- send confirmation
- compensate if shipment fails after payment succeeded
The distinction matters because workflows need durable, queryable state transitions.
High-Level Architecture
API / Starter
|
v
Workflow Engine
|
+--> workflow state store
+--> timer queue
+--> activity task queue
+--> signal/event inbox
|
v
Workers / Activity Runners
|
+--> call external services
+--> report completion / failure
|
v
Downstream systems
Supporting systems:
- workflow UI
- audit/history store
- dead-letter / stuck-workflow tooling
Core Concepts
Separate these ideas clearly:
| Concept | Meaning |
|---|---|
| Workflow definition | The process model or code describing steps and transitions |
| Workflow execution | One running instance of that process |
| Activity | One side-effecting step, such as calling a payment API |
| Signal | External event that wakes or changes a workflow |
| Timer | Delayed wake-up for retries, waiting, or timeouts |
Example:
workflow definition: order_fulfillment
workflow execution: wf_123
activities:
- reserve_inventory
- charge_payment
- create_shipment
- send_confirmation
Example Workflow
Order fulfillment:
- reserve inventory
- charge payment
- create order record
- create shipment
- send confirmation
Failure path:
- if payment fails -> release inventory
- if shipment creation fails after payment -> mark manual review or compensate
This is why orchestration exists: each step may succeed independently, but the business process still needs a coherent outcome.
Data Model
Workflow executions
CREATE TABLE workflow_executions (
workflow_id UUID PRIMARY KEY,
tenant_id TEXT NOT NULL,
workflow_type TEXT NOT NULL,
business_key TEXT,
status TEXT NOT NULL, -- RUNNING, WAITING, SUCCEEDED, FAILED, CANCELLED
current_state TEXT NOT NULL,
input JSONB NOT NULL,
context JSONB NOT NULL DEFAULT '{}'::jsonb,
started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
completed_at TIMESTAMPTZ,
UNIQUE (tenant_id, workflow_type, business_key)
);
Workflow history
CREATE TABLE workflow_history (
event_id BIGSERIAL PRIMARY KEY,
workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
event_type TEXT NOT NULL, -- started, activity_scheduled, activity_completed, timer_fired, signaled
state_before TEXT,
state_after TEXT,
payload JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_workflow_history_lookup
ON workflow_history (workflow_id, event_id);
Activity tasks
CREATE TABLE workflow_activity_tasks (
task_id UUID PRIMARY KEY,
workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
activity_name TEXT NOT NULL,
status TEXT NOT NULL, -- PENDING, RUNNING, SUCCEEDED, FAILED, CANCELLED
attempt_count INT NOT NULL DEFAULT 0,
next_attempt_at TIMESTAMPTZ NOT NULL DEFAULT now(),
locked_by TEXT,
locked_until TIMESTAMPTZ,
input JSONB NOT NULL,
result JSONB,
last_error TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Timers / wakeups
CREATE TABLE workflow_timers (
timer_id UUID PRIMARY KEY,
workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
timer_type TEXT NOT NULL,
fire_at TIMESTAMPTZ NOT NULL,
status TEXT NOT NULL DEFAULT 'PENDING', -- PENDING, FIRED, CANCELLED
payload JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_workflow_timers_due
ON workflow_timers (status, fire_at)
WHERE status = 'PENDING';
Signals / external events
CREATE TABLE workflow_signals (
signal_id UUID PRIMARY KEY,
workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
signal_name TEXT NOT NULL,
payload JSONB NOT NULL,
processed BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_workflow_signals_unprocessed
ON workflow_signals (workflow_id, processed)
WHERE processed = false;
State Machine Design
A workflow is a durable state machine.
Example order workflow states:
STARTED
INVENTORY_RESERVED
PAYMENT_CHARGED
ORDER_CREATED
SHIPMENT_CREATED
COMPLETED
Failure branches:
PAYMENT_FAILED
SHIPMENT_FAILED
MANUAL_REVIEW
CANCELLED
Keep state names boring and explicit. If operators cannot read the workflow state and understand what happened, the system becomes an archaeology project.
Workflow Execution Model
There are two broad implementation styles:
Command/state-machine style
Store explicit state and let workers transition it.
Pros:
- simple mental model
Cons:
- more boilerplate
- branching logic can sprawl
Event-history replay style
Store workflow event history and reconstruct current state by replay.
Pros:
- strong determinism
- easier debugging and replay
Cons:
- more advanced engine design
This post stays implementation-agnostic enough to fit either style, but the reliability principles are the same.
Starting a Workflow
POST /v1/workflows
{
"workflowType": "order_fulfillment",
"tenantId": "merchant_42",
"businessKey": "order_123",
"input": {
"orderId": "order_123",
"paymentId": "pay_456",
"userId": "user_77"
}
}
Response:
{
"workflowId": "wf_123",
"status": "RUNNING",
"state": "STARTED"
}
businessKey is important. It gives you idempotency for "start workflow for this order."
Scheduling Activities
An activity is the side-effecting part of a workflow.
Examples:
- reserve inventory
- charge payment
- call KYC provider
- create shipment
The workflow engine should schedule an activity task, not perform the side effect inline in the control transaction.
public void scheduleActivity(UUID workflowId, String activityName, JsonNode input) {
activityTaskRepository.insert(
ActivityTask.builder()
.taskId(UUID.randomUUID())
.workflowId(workflowId)
.activityName(activityName)
.status("PENDING")
.input(input)
.build()
);
workflowHistoryRepository.append(workflowId, "activity_scheduled", activityName);
}
This separation helps with retries and observability.
Claiming Activity Work With Leases
Workers must claim activity tasks safely.
UPDATE workflow_activity_tasks
SET status = 'RUNNING',
locked_by = :worker_id,
locked_until = now() + interval '30 seconds',
attempt_count = attempt_count + 1,
updated_at = now()
WHERE task_id = (
SELECT task_id
FROM workflow_activity_tasks
WHERE status IN ('PENDING', 'FAILED')
AND next_attempt_at <= now()
AND (locked_until IS NULL OR locked_until < now())
ORDER BY next_attempt_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING *;
Lease-based claiming is how the system recovers if a worker crashes after taking work.
Idempotent Activities
Workflow infrastructure can generally only give you at-least-once execution.
That means activities must be idempotent or externally deduplicated.
Example:
public PaymentResult chargePayment(String paymentId, UUID workflowId, UUID taskId) {
String idempotencyKey = "workflow:" + workflowId + ":task:" + taskId;
return paymentGateway.charge(paymentId, idempotencyKey);
}
Without idempotency keys, retries turn infrastructure recovery into duplicate charges.
Retries and Backoff
Not all failures are equal.
Retry categories:
- transient: timeout, 502, temporary network issue
- terminal: validation error, insufficient funds, invalid address
- unknown: timed out after request submission, needs care
Retry policy example:
{
"maxAttempts": 5,
"initialDelayMs": 1000,
"backoffMultiplier": 2.0,
"maxDelayMs": 300000
}
The workflow engine should apply retry policy per activity, not globally for all steps.
Timers and Waiting
Workflows often need to wait:
- wait 15 minutes for payment webhook
- retry shipment creation in 5 minutes
- expire an approval request in 24 hours
Instead of sleeping in a process, create a durable timer:
public void waitUntil(UUID workflowId, Instant fireAt, String timerType) {
timerRepository.insert(
WorkflowTimer.builder()
.timerId(UUID.randomUUID())
.workflowId(workflowId)
.timerType(timerType)
.fireAt(fireAt)
.build()
);
workflowRepository.moveToWaitingState(workflowId, timerType);
}
This is a huge difference between robust orchestration and fragile in-memory logic.
Signals and External Events
Many workflows pause until an external event arrives.
Examples:
- payment webhook
- manual approval
- KYC vendor callback
- customer document upload
Signal API:
POST /v1/workflows/wf_123/signals
{
"signalName": "payment_confirmed",
"payload": {
"paymentId": "pay_456",
"status": "succeeded"
}
}
Signals should be durable and idempotent. A signal that arrives before the workflow is ready should still be stored and processed when appropriate.
Compensation
This is where orchestration becomes much more than task retries.
Example:
- inventory reserved
- payment charged
- shipment creation fails permanently
Now you may need:
- cancel charge or issue refund
- release inventory
- mark manual review
This is compensation, not rollback. Distributed systems do not magically undo side effects.
Example compensation flow:
reserve_inventory
charge_payment
create_shipment -> fails terminally
-> compensate:
refund_payment
release_inventory
-> mark workflow failed or manually resolved
Compensation rules should be explicit in workflow design, not improvised during incidents.
Manual Tasks and Human-in-the-Loop
Some workflows cannot be fully automated.
Examples:
- compliance approval
- fraud review
- exception handling after repeated failures
Manual task model:
CREATE TABLE workflow_manual_tasks (
manual_task_id UUID PRIMARY KEY,
workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id),
task_type TEXT NOT NULL,
status TEXT NOT NULL, -- OPEN, COMPLETED, CANCELLED
assigned_to TEXT,
input JSONB NOT NULL,
result JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
completed_at TIMESTAMPTZ
);
The workflow should be able to pause in WAITING_FOR_MANUAL_ACTION and resume later.
Visibility and Querying
Operators need to answer:
- what state is this workflow in?
- which activity last failed?
- how long has it been waiting?
- what compensation already ran?
- did the external signal arrive?
A workflow platform without good history and introspection becomes a black box.
Useful query:
SELECT event_type, state_before, state_after, payload, created_at
FROM workflow_history
WHERE workflow_id = :workflow_id
ORDER BY event_id ASC;
This should be a UI, not just a SQL habit.
Failure Modes
1. Worker crashes mid-activity
Fix:
- lease expires
- task becomes reclaimable
- activity idempotency protects side effects
2. Timer lost after restart
Fix:
- timers stored durably
- timer scanner rebuilds due work from DB
3. Duplicate external signal
Fix:
- signal dedupe key
- state-aware signal handling
4. Compensation partially succeeds
Fix:
- compensation itself must be observable and retryable
- escalate to manual recovery if needed
5. Workflow stuck in waiting state forever
Fix:
- waiting-state SLA monitoring
- explicit timeout timers
- dead-workflow sweeper
Orchestrator vs Choreography
Workflows can also be built through event choreography:
service A publishes event
service B reacts
service C reacts later
This can be fine for loosely coupled flows.
But orchestration is usually better when:
- business process has clear owner
- retries and compensation need central visibility
- humans need to inspect state
- compliance or auditability matters
Event choreography can drift into "nobody actually knows what the process is doing."
Observability
Track:
- workflow start rate
- success / failure / cancellation rate
- activity retry counts
- timer backlog
- signal processing lag
- stuck workflow count
- manual task queue depth
- workflow latency percentiles by type
Useful dashboards:
- workflow states by type
- top failing activity names
- oldest waiting workflows
- compensation activity volume
What I Would Build First
Phase 1:
- workflow execution store
- activity task queue
- retries with backoff
- workflow history
Phase 2:
- timers
- signals
- compensation support
- workflow UI
Phase 3:
- manual tasks
- richer replay/debug tooling
- versioned workflow definitions
- tenant-aware quotas and fairness
This order matters. Teams often rush to fancy visual workflow builders before they have durable state, retries, and observability nailed down.
Production Checklist
- workflow execution state durable
- activities claimed with leases
- side-effecting activities idempotent
- retries classified by error type
- timers stored durably
- signals deduplicated
- compensation explicit
- stuck workflows detected
- workflow history queryable
- manual recovery path exists
Final Takeaway
A workflow orchestration platform is how a distributed system remembers what a business process has already done and what it should do next.
If you design it well, long-running flows become understandable, recoverable, and safe.
If you design it poorly, every partial failure turns into custom repair scripts and support escalations.
