Lesson 83 of 105 13 minFlagship

System Design: Building a Distributed Job Scheduler

Design a production job scheduler like Quartz, Airflow, or Cloud Scheduler: cron parsing, durable job storage, leases, retries, idempotency, delayed execution, worker pools, multi-tenancy, observability, and failure recovery.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • create one-time jobs
  • create recurring jobs with cron-like schedules
  • pause and resume jobs
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Introduction: The Distributed Scheduling Problem

At small scales, running scheduled tasks is simple. Developers typically configure a single background thread using tools like Linux cron or in-process timers (e.g., Java’s ScheduledExecutorService). However, in large-scale microservice architectures, this monolithic approach fails.

If you run multiple instances of a service, each executing an in-process timer, they will run the same task simultaneously, leading to duplicate invoice runs, double-sent notifications, or database locking contention.

To prevent duplicate execution while guaranteeing reliability, we need a Distributed Job Scheduler. The scheduler must manage job registrations, calculate future execution triggers, dispatch work to a cluster of worker nodes, track attempts, handle worker crashes gracefully via leases, and prevent tenant starvation. It must provide at-least-once execution guarantees while supporting application-level idempotency to ensure safety.


Requirements and System Goals

Functional Requirements

  1. Flexible Schedule Types: Support both one-time delayed jobs (e.g., executing a task in 3 hours) and recurring cron-based schedules (e.g., executing at 0 9 * * * every day).
  2. Trigger Management (State Controls): Provide administrative APIs to register, pause, resume, cancel, and manually re-run jobs.
  3. Execution Tracking: Maintain complete history logs of execution attempts, run durations, error codes, and worker allocation targets.
  4. Misfire Policy Handling: Allow configuring misfire policies (e.g., fire immediately, skip missed runs, backfill all) when a job fails to run on time due to system outages.
  5. Dynamic Payload Binding: Support attaching arbitrary JSON payloads to job definitions, which are passed to workers at execution time.

Non-Functional Requirements

  1. At-Least-Once Delivery: Guarantee that every scheduled trigger is executed at least once by a worker node, even under network partitions or node failures.
  2. Fairness (Multi-Tenant Isolation): Prevent a single tenant from flooding the scheduling queue and starving other tenants’ workloads.
  3. High Scheduling Accuracy (Low Jitter): Jobs should execute within 1 second of their scheduled time.
  4. Durable, Lock-Safe Dispatch: Prevent multiple workers from claiming the same job attempt, avoiding split-brain runs without bottlenecking throughput.

API Interfaces and Service Contracts

We define the interface contracts for scheduling administration and worker dispatch.

1. Register Job Endpoint (REST)

Clients register new cron-based recurring jobs using this endpoint:

  • Endpoint: POST /api/v1/jobs
  • Request Payload (JSON):
{
  "tenant_id": "tenant_enterprise_77",
  "job_name": "daily-invoice-settlement",
  "job_type": "billing",
  "schedule_type": "CRON",
  "cron_expression": "0 9 * * *",
  "timezone": "America/New_York",
  "max_attempts": 3,
  "allow_overlap": false,
  "misfire_policy": "FIRE_ONCE",
  "payload": {
    "batch_size": 500,
    "payment_gateway": "stripe"
  }
}
  • Response Payload (JSON - 201 Created):
{
  "job_id": "job_90182a93-80f2",
  "status": "ACTIVE",
  "next_scheduled_run": "2026-06-07T09:00:00-04:00"
}

2. Worker Claim Task API (gRPC)

Workers pull tasks or coordinators dispatch tasks using structured protobufs:

syntax = "proto3";

package database.scheduler.v1;

service SchedulerDispatcher {
  // Claim a batch of due job triggers
  rpc ClaimTriggers(ClaimTriggersRequest) returns (ClaimTriggersResponse);
  
  // Update execution status of an attempt
  rpc UpdateAttemptStatus(UpdateAttemptRequest) returns (UpdateAttemptResponse);
}

message ClaimTriggersRequest {
  string worker_id = 1;
  int32 max_tasks = 2;
  int64 lease_duration_seconds = 3;
}

message TaskPayload {
  string trigger_id = 1;
  string job_id = 2;
  string tenant_id = 3;
  string job_type = 4;
  string payload_json = 5;
  int64 scheduled_time_epoch = 6;
}

message ClaimTriggersResponse {
  repeated TaskPayload tasks = 1;
}

message UpdateAttemptRequest {
  string attempt_id = 1;
  string trigger_id = 2;
  string status = 3;         // e.g., "SUCCEEDED", "FAILED"
  string error_message = 4;
}

message UpdateAttemptResponse {
  bool success = 1;
}

High-Level Design and Visualizations

The platform separates schedule generation (calculating future triggers) from execution (workers claiming and running tasks).

Distributed Job Scheduler Architecture

graph TD
    Client[Admin Client / Services] -->|1. Register Job REST| API[API Gateway / Admin Service]
    API -->|2. Persist Definitions| DB[(Metadata & Triggers DB: PostgreSQL)]
    
    subgraph Control Plane
        Coord[Scheduler Coordinator Nodes] -->|3. Lookahead Cron Parsing / Create Triggers| DB
    end

    subgraph Worker Pool
        W1[Worker Node 1] -->|4. ClaimTriggers FOR UPDATE SKIP LOCKED| DB
        W2[Worker Node 2] -->|4. ClaimTriggers FOR UPDATE SKIP LOCKED| DB
        W3[Worker Node 3] -->|4. ClaimTriggers FOR UPDATE SKIP LOCKED| DB
    end

    W1 -->|5. Execute Job Handler| TargetSvc[Downstream Microservice]

    style Client fill:#f8f9fa,stroke:#343a40
    style DB fill:#f8d7da,stroke:#dc3545
    style Coord fill:#fff3cd,stroke:#ffc107
    style W1 fill:#d4edda,stroke:#28a745
    style W2 fill:#d4edda,stroke:#28a745
    style W3 fill:#d4edda,stroke:#28a745

Database Row Claiming Cycle (SKIP LOCKED)

Multiple worker threads compete to claim pending triggers. We use pessimistic lock skipping to prevent thread blocking.

sequenceDiagram
    participant Worker1 as Worker Node A
    participant Worker2 as Worker Node B
    participant DB as SQL Trigger Table

    Worker1->>DB: Query due triggers (Limit 1, SKIP LOCKED)
    Worker2->>DB: Query due triggers (Limit 1, SKIP LOCKED)
    
    Note over DB: Row 101: Pending<br/>Row 102: Pending

    DB-->>Worker1: Returns Row 101 & Locks it
    DB-->>Worker2: Skips Row 101, Returns Row 102 & Locks it
    
    Worker1->>DB: UPDATE Row 101 (Status: RUNNING, Lease: +5m)
    Worker2->>DB: UPDATE Row 102 (Status: RUNNING, Lease: +5m)
    
    Note over Worker1,Worker2: Workers execute task handlers in parallel

Low-Level Design and Schema Strategies

To support retries, auditability, and time tracking, we normalize our scheduling tables into definitions, scheduled triggers, and attempt logs.

SQL Database Schema

CREATE TABLE scheduled_jobs (
    job_id UUID PRIMARY KEY,
    tenant_id VARCHAR(64) NOT NULL,
    job_name VARCHAR(128) NOT NULL,
    job_type VARCHAR(64) NOT NULL,
    schedule_type VARCHAR(16) NOT NULL CHECK (schedule_type IN ('ONCE', 'CRON')),
    cron_expression VARCHAR(64),
    timezone VARCHAR(64) NOT NULL DEFAULT 'UTC',
    run_at TIMESTAMP WITH TIME ZONE,
    payload_json TEXT NOT NULL DEFAULT '{}',
    job_status VARCHAR(16) NOT NULL DEFAULT 'ACTIVE' CHECK (job_status IN ('ACTIVE', 'PAUSED', 'CANCELLED')),
    max_attempts INT NOT NULL DEFAULT 5,
    allow_overlap BOOLEAN NOT NULL DEFAULT FALSE,
    misfire_policy VARCHAR(32) NOT NULL DEFAULT 'FIRE_ONCE',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    UNIQUE (tenant_id, job_name)
);

CREATE TABLE job_triggers (
    trigger_id UUID PRIMARY KEY,
    job_id UUID NOT NULL REFERENCES scheduled_jobs(job_id) ON DELETE CASCADE,
    tenant_id VARCHAR(64) NOT NULL,
    scheduled_for TIMESTAMP WITH TIME ZONE NOT NULL,
    trigger_status VARCHAR(16) NOT NULL DEFAULT 'PENDING' CHECK (trigger_status IN ('PENDING', 'RUNNING', 'SUCCEEDED', 'FAILED', 'CANCELLED')),
    attempt_count INT NOT NULL DEFAULT 0,
    next_attempt_at TIMESTAMP WITH TIME ZONE NOT NULL,
    locked_by VARCHAR(128),
    locked_until TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    UNIQUE (job_id, scheduled_for)
);

CREATE TABLE job_attempts (
    attempt_id UUID PRIMARY KEY,
    trigger_id UUID NOT NULL REFERENCES job_triggers(trigger_id) ON DELETE CASCADE,
    job_id UUID NOT NULL,
    worker_id VARCHAR(128) NOT NULL,
    attempt_status VARCHAR(16) NOT NULL CHECK (attempt_status IN ('STARTED', 'SUCCEEDED', 'FAILED')),
    started_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    finished_at TIMESTAMP WITH TIME ZONE,
    error_message TEXT
);

-- Indices optimized for worker claiming query
CREATE INDEX idx_triggers_claim ON job_triggers (trigger_status, next_attempt_at)
WHERE trigger_status IN ('PENDING', 'FAILED');

Atomically Claiming Work via PostgreSQL SKIP LOCKED

Workers poll the database using the following lock-skipping transaction:

WITH target_trigger AS (
    SELECT trigger_id 
    FROM job_triggers
    WHERE trigger_status IN ('PENDING', 'FAILED')
      AND next_attempt_at <= NOW()
      -- Verify lease is expired or never set
      AND (locked_until IS NULL OR locked_until < NOW())
    ORDER BY next_attempt_at ASC
    LIMIT :batch_size
    FOR UPDATE SKIP LOCKED
)
UPDATE job_triggers
SET trigger_status = 'RUNNING',
    locked_by = :worker_id,
    locked_until = NOW() + INTERVAL '5 minutes',
    attempt_count = attempt_count + 1,
    updated_at = NOW()
WHERE trigger_id IN (SELECT trigger_id FROM target_trigger)
RETURNING *;

Scaling and Operational Challenges: Calculations & Formulations

As the scheduling system grows, database polling queries can saturate disk I/O. Let us calculate this throughput limitation.

Back-of-the-Envelope Dispatch Calculation

Let us define:

  • $N_{\text{jobs}}$: Number of active jobs dispatched per minute (e.g., 20,000 jobs/minute $\approx$ 333 jobs/second).
  • $N_{\text{workers}}$: Number of parallel worker threads polling the database (e.g., 100 workers).
  • $T_{\text{poll}}$: Polling interval for each worker (e.g., 1 second).

Step 1: Calculate Read QPS on the database

Each worker queries the database once per polling interval:

$$QPS_{\text{poll}} = \frac{N_{\text{workers}}}{T_{\text{poll}}} = \frac{100}{1 \text{ second}} = 100 \text{ read queries/second}$$

Even if no jobs are due, the database must execute 100 index scans per second. Under peak load, when 333 jobs are due per second, we must execute:

$$\text{DB Write IOPS} \ge 333 \text{ update writes/second} \times 2 \text{ (trigger status + attempt log)} = 666 \text{ IOPS}$$

For a standard cloud SSD block storage volume (typically capped at 3,000 IOPS), this polling and updating rate consumes over 22 percent of total disk I/O, slowing down other business tables.

Mitigation: Hybrid Push Dispatch (Redis Delay Queue)

To scale dispatching capacity to 100,000+ jobs per minute:

  1. Durable Definition: Persistent job metadata remains in SQL tables.
  2. Push Queueing: The scheduler coordinator parses cron expressions and pushes upcoming trigger metadata into a Redis Sorted Set (ZSET):
    • Key: scheduler:delay_queue
    • Member: trigger_id
    • Score: scheduled_for_timestamp
  3. Low-Latency Polling: Workers run a light ZRANGEBYSCORE query against Redis memory instead of SQL indexes:

$$\text{Redis Polling Latency} \approx 1.0 \text{ ms}$$

  1. Async Execution: Once a worker claims a trigger_id from Redis, it executes the job and writes attempt histories to the SQL database asynchronously. This reduces database read QPS from 100 to near zero, preserving disk resources for write audits.

Trade-offs and Architectural Alternatives

Choosing the correct distributed scheduling framework requires evaluating execution complexity and durability requirements.

Scheduling Framework Comparison

Dimension / Choice Pull-Based DB Polling (SQL) Push-Based Timer Queue (Redis ZSET) Distributed Workflow Orchestrator (Temporal)
Philosophy Workers query database for due tasks Central coordinator pushes tasks to workers State machine workflows with event histories
Execution Jitter Medium (Dependent on polling frequency) Low (Immediate Redis timer check) Very Low (Optimized queue dispatch)
Worker Coordination Low (SKIP LOCKED manages isolation) Medium (Requires Redis locking coordinator) High (Temporal Server manages state)
Long-Running Job Safety High (Through lease renewal updates) Poor (Requires out-of-band lock renewal) Excellent (Durable executions span months)
Operational Overhead Low (Uses existing database setup) Medium (Requires managing Redis cluster) High (Requires deploying Temporal cluster)

Key Trade-offs

  1. DB Polling vs. Distributed Workflows (Temporal):
    • DB Polling: Best for simple, independent tasks (e.g., sending daily emails, clearing transaction logs) where operations are stateless and can fail/retry without complex workflows.
    • Temporal: Mandatory for complex, multi-stage transactions (e.g., customer onboarding flows, payment collections) where tasks depend on each other and execution state must survive long outages.

Failure Modes and Fault Tolerance Strategies

Operating distributed schedulers exposes the platform to lease expirations and clock drifts.

1. Worker Node Crashes Mid-Job (Lease Expirations)

If Worker A claims a billing trigger, starts executing, but crashes or encounters a hardware failure 1 minute later, the trigger status remains stuck in RUNNING. No other worker attempts it, resulting in a silent failure.

  • Mitigation: Implement Lease Heartbeats. When Worker A claims a task, the coordinator sets a lease expiration: locked_until = NOW() + 5 minutes. For long-running tasks, Worker A’s execution thread runs a secondary daemon that updates locked_until by adding 2 minutes every 60 seconds. If Worker A crashes, the heartbeat stops. After 5 minutes, the lease expires. Another worker running the claim query detects the expired lease and retries the job trigger.

2. Timezone Shifts and Daylight Saving Time (DST)

If a job is scheduled for 0 2 * * * (2:00 AM) in a local timezone that undergoes a DST shift (where clocks jump forward or backward), the trigger might execute twice or be skipped entirely.

  • Mitigation: Always compute schedule offsets in Coordinated Universal Time (UTC). When registering a job, store the user's local timezone (e.g., America/New_York). The scheduler coordinator maps local cron expressions to absolute UTC timestamps, adjusting for local DST rules at generation time.

3. Multi-Tenant Starvation

If Tenant A registers 50,000 tasks due at 9:00 AM, and Tenant B registers 5 tasks due at 9:00 AM, Tenant A’s tasks will saturate the worker pool, delaying Tenant B's tasks.

  • Mitigation: Implement Partitioned Fair Queues. The claim query selects tasks grouped by Tenant ID, enforcing a maximum limit (e.g., LIMIT 5) per tenant per query iteration, preventing any single tenant from monopolizing worker capacity.


Verbal Script

Interviewer: "How would you design a distributed job scheduler capable of handling millions of scheduled tasks across a cluster of microservices?"

Candidate: "To design a distributed job scheduler, I would separate the architecture into three layers: an administrative API Service, a metadata and trigger SQL datastore, and a stateless worker pool.

To ensure reliability, we model execution using three tables: scheduled_jobs (definitions), job_triggers (individual runs), and job_attempts (execution history).

For a recurring cron job, a background coordinator parses the schedule and generates upcoming trigger rows 15 minutes in advance, writing them to the job_triggers table.

To distribute work safely without split-brain executions, worker nodes poll the database for due triggers using a pessimistic lock-skipping query: SELECT trigger_id FROM job_triggers WHERE trigger_status = 'PENDING' AND next_attempt_at <= NOW() LIMIT 10 FOR UPDATE SKIP LOCKED.

The worker that wins the rows updates their status to RUNNING and sets a lease lock expiration of 5 minutes (locked_until).

If a worker crashes mid-execution, the lease expires, and another worker claims the task automatically.

To prevent database disk saturation from continuous polling at scale, I would optimize this path by pushing upcoming triggers into a Redis Sorted Set delay queue.

Workers read from Redis memory instead of SQL indexes. This hybrid model provides sub-second scheduling accuracy, low latency, and absolute write audit trails."

Interviewer: "How do you prevent a single tenant from starving other tenants' scheduled jobs when they suddenly register a massive burst of tasks?"

Candidate: "We prevent tenant starvation by implementing Fair Share Scheduling at the claiming layer.

Instead of running a simple SKIP LOCKED query that pulls the oldest tasks globally, we partition task selection.

First, we maintain a configuration table that tracks active in-flight task limits per tenant (e.g., maximum 50 concurrent tasks).

Second, when workers query for work, they execute a two-step query.

We query for a list of active tenants who currently have available execution capacity in the cluster.

We then join this list with our due triggers table, fetching only the top 5 oldest tasks per eligible tenant.

This ensures that even if Tenant A submits 50,000 tasks, they are throttled by their concurrency quota.

Tenant B's 5 tasks are dispatched immediately in the next polling cycle because Tenant B has active capacity.

This guarantees multi-tenant isolation and resource fairness across the cluster."

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.