System DesignAdvancedcase studyPart 12 of 29 in Backend Systems Mastery

System Design: Designing a Distributed Task Scheduler

How to run millions of scheduled jobs reliably. A deep dive into Timing Wheels, Distributed Locking, and 'At-least-once' vs 'Exactly-once' execution guarantees.

Sachin SarawgiApril 20, 20266 min read6 minute lesson

System Design Masterclass: Designing a Distributed Task Scheduler

Every backend engineer has written a cron job. It's simple: you put a script on a Linux server and tell the OS to run it every night at midnight.

But what happens when you are building a SaaS platform that needs to send 50 million scheduled push notifications per day, process millions of recurring billing invoices, and execute delayed Webhooks? A single Linux server with cron will crash, run out of memory, or represent a catastrophic Single Point of Failure (SPOF).

Designing a Distributed Task Scheduler (similar to AWS EventBridge, Apache Airflow, or Celery) is a foundational system design interview question. It tests your knowledge of Distributed Locking, Message Queues, and Fault Tolerance.


1. Capacity Estimation (The Math)

Let's assume we are building a scheduler for a massive e-commerce platform.

Assumptions:

  • Task Volume: 100 Million tasks scheduled per day.
  • QPS: 100,000,000 / 86400 = ~1,150 QPS.
  • Peak QPS: Due to the "Thundering Herd" problem (everyone schedules tasks on the top of the hour, e.g., 12:00:00), Peak QPS could easily hit 50,000 QPS.
  • Task Payload: Each task requires an arbitrary JSON payload (e.g., email data). Average 1KB.

Storage Estimates:

  • 100M * 1KB = 100GB per day.
  • If we retain executed task history for 30 days for auditing: 3 Terabytes.

Conclusion: 50,000 Peak QPS and 3TB of data means we need a horizontally scalable database and a message broker to decouple the scheduling logic from the execution logic.


2. High-Level Architecture

The golden rule of a task scheduler is decoupling the Scheduling (deciding when a task should run) from the Execution (actually running the code).

graph TD
    Client[Client App/Microservice] -->|1. Submit Task| API[API Gateway]
    
    API -->|2. Save Metadata| DB[(Task Metadata DB)]
    
    Poller[Scheduler / Poller Nodes] -->|3. Find due tasks| DB
    
    Poller -->|4. Publish to Queue| Kafka{Kafka / RabbitMQ}
    
    Kafka -->|5. Consume & Execute| Worker[Worker Nodes]
    Worker -->|6. Update Status| DB
    
    style DB fill:#047857,stroke:#fff,stroke-width:2px,color:#fff
    style Kafka fill:#b91c1c,stroke:#fff,stroke-width:2px,color:#fff
    style Poller fill:#1e40af,stroke:#fff,stroke-width:2px,color:#fff

3. The Deep Dive: The Scheduler / Poller

The hardest part of this system is the Poller. We need a cluster of servers that constantly check the database to see if any tasks are due to run right now.

The Naive Approach

Every 1 second, 10 Poller nodes execute SELECT * FROM tasks WHERE execute_at <= NOW() AND status = 'PENDING'.

  • The Problem: If 10 nodes execute this simultaneously, they will all pull the exact same 1,000 tasks. They will all push those 1,000 tasks to Kafka. Your workers will execute the tasks 10 times. You just billed a customer 10 times.

The Solution: Distributed Locking (Optimistic Concurrency Control)

How do we ensure only one Poller picks up a task?

We use the database itself as a lock via an UPDATE statement with a WHERE clause.

UPDATE tasks 
SET status = 'QUEUED', locked_by = 'poller_node_42' 
WHERE execute_at <= NOW() AND status = 'PENDING' 
LIMIT 100;

Because relational databases (like PostgreSQL) use row-level locking, even if 10 Pollers run this exact query at the exact same millisecond, the database engine will serialize the updates.

  • Poller 1 will successfully update 100 rows and receive a success count of 100.
  • Pollers 2-10 will receive a success count of 0.

This guarantees that a task is transitioned to the QUEUED state exactly once.


4. The "Thundering Herd" and Hashed Timing Wheels

Let's revisit the 50,000 Peak QPS at 12:00:00. If your Poller uses LIMIT 100, it will take 500 database round-trips to pick up all the tasks. The tasks scheduled for 12:00:00 might not actually get queued until 12:05:00. This violates the latency requirement.

The Timing Wheel (Kafka/Netty approach)

Instead of hammering the database every second, the Poller queries the database every 5 minutes and pulls all tasks due within the next 5-minute window.

The Poller loads these tasks into a local, in-memory data structure called a Hashed Timing Wheel.

  • A Timing Wheel is essentially a circular array (a clock face).
  • If the array has 60 slots (one for each second), a task due in 15 seconds is placed into the current_index + 15 bucket.
  • The Poller simply ticks an internal pointer forward 1 bucket per second. Any tasks sitting in that bucket are instantly fired into Kafka.

This completely eliminates database latency from the critical path of execution.


5. Worker Execution and Fault Tolerance

Once the Poller drops a task into Kafka, the Worker Nodes take over.

At-Least-Once Delivery

In distributed systems, achieving mathematical "Exactly-Once" execution is practically impossible without distributed transactions (which are too slow). We must settle for At-Least-Once delivery.

  1. A Worker pulls a task from Kafka.
  2. The Worker executes the code (e.g., sends an email).
  3. The Worker sends an ACK to Kafka and updates the Task DB to status = COMPLETED.

What if the Worker dies during step 2? Kafka has a visibility timeout. If the worker doesn't ACK within 5 minutes, Kafka assumes the worker died and makes the message visible to a different Worker.

The Idempotency Requirement

Because the Worker might die after sending the email but before sending the ACK, the second Worker will pick up the task and send the email again.

To prevent this, the actual execution logic must be Idempotent. Before sending the email, the worker should check a deduplication table (using the task_id) to verify it hasn't been executed recently.


6. Which Database to Choose?

Because the Poller heavily relies on indexed range queries (WHERE execute_at <= NOW()) and row-level locking (UPDATE ... WHERE status = 'PENDING'), NoSQL databases like Cassandra or DynamoDB are terrible choices.

NoSQL databases do not support global secondary indexes sorted by time efficiently, nor do they support strong ACID row-level locking without massive performance penalties.

You must use a strongly consistent Relational Database.

  • For moderate scale: PostgreSQL or MySQL.
  • For massive scale: CockroachDB or Google Cloud Spanner (Distributed SQL).

Summary Checklist for the Interview

When designing a Distributed Task Scheduler, ensure you cover:

  1. Decoupling: Clearly separate the Poller (Scheduling) from the Worker (Execution) using a Message Broker (Kafka).
  2. Concurrency: Solve duplicate processing using Optimistic Concurrency Control (UPDATE WHERE status='PENDING').
  3. Thundering Herds: Implement a Hashed Timing Wheel to buffer database queries.
  4. Idempotency: Force the interviewer to acknowledge that workers must be idempotent to survive node crashes and network partitions.
📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Continue Series

Backend Systems Mastery

Lesson 12 of 29 in this learning sequence.

Next in series
1

Beginner

What is Load Balancing? A Simple Guide for Backend Engineers

What is Load Balancing? Load balancing is a core component of any distributed system. It acts as a traffic cop sitting in front of your servers and routing client requests across all servers capable of fulfilling those r…

2

Beginner

System Design: Designing a Distributed ID Generator (Snowflake)

Designing a Distributed ID Generator > Prerequisite: To understand why distributed IDs are hard, first read about Database Sharding and Partitioning. In a distributed system, you often need to generate unique identifiers…

3

Beginner

gRPC vs REST: A Decision-Maker's Guide for Backend Architecture

gRPC vs REST: Which One for Your Microservices? > Prerequisite: Before diving into protocols, ensure you understand the fundamentals of Load Balancing and API Idempotency. Choosing between REST and gRPC is one of the mos…

4

Advanced

SQL vs NoSQL: Which One for Your Next Production MVP?

SQL vs NoSQL: Making the Right Choice One of the most debated topics in software engineering is whether to use a Relational (SQL) or Non-Relational (NoSQL) database. As a senior engineer, your choice shouldn't be based o…

5

Intermediate

System Design: Designing a URL Shortener (TinyURL)

System Design Masterclass: Designing a URL Shortener (TinyURL) Designing a URL shortener like TinyURL or Bitly is the most ubiquitous System Design interview question in the world. While it sounds trivial on the surface…

6

Advanced

Database Indexing Deep Dive: B-Trees, Hash Indexes, and Query Planning

Indexes are the single most impactful optimization in database performance. A 10-second query becomes 20ms with the right index. A wrong index slows writes and misleads the query planner. Understanding the internals — no…

7

Advanced

System Design: Designing a Global Distributed Rate Limiter

System Design Masterclass: Designing a Distributed Rate Limiter In a distributed environment, a single malicious script, a misconfigured client, or a massive traffic spike can easily overwhelm your backend servers, bring…

8

Advanced

Designing a Database Sharding Strategy for 100 Million Users

Vertical scaling has a ceiling. For most applications, that ceiling arrives somewhere between 1 million and 10 million users, depending on write patterns and data size. At 100 million users, the question is not whether t…

9

Beginner

gRPC vs REST: The Decision-Maker's Guide for Backend Architecture

gRPC vs REST: Which One for Your Microservices? In modern backend architecture, how services talk is as important as what they say. Choosing between REST and gRPC isn't just about syntax; it's about the trade-off between…

10

Advanced

System Design: Designing a Global Payment Gateway (Stripe Scale)

System Design Masterclass: Designing a Payment Gateway (Stripe) Designing a system to serve photos or short URLs is fundamentally about optimizing for read-latency and disk space. If a user's photo fails to load, they re…

11

Intermediate

Optimistic vs. Pessimistic Locking: Concurrency Control in Practice

Optimistic vs. Pessimistic Locking Imagine two users trying to book the last seat on a flight at the same time. If both read the count as "1" and decrement it, you've oversold the flight. This is the Lost Update Problem,…

12

Advanced

System Design: Designing a Distributed Task Scheduler

System Design Masterclass: Designing a Distributed Task Scheduler Every backend engineer has written a cron job. It's simple: you put a script on a Linux server and tell the OS to run it every night at midnight. But what…

13

Intermediate

Docker for Java Developers: A Production Guide to Containerization

Docker for Java Developers: Production Guide A common mistake in Java containerization is copying a fat JAR into a single-layer image. This results in 200MB+ images and slow deployment cycles. Here is how to build produc…

14

Advanced

Beyond CAP: Why PACELC is the Real Rule for Distributed Databases

Beyond CAP: Understanding the PACELC Theorem The CAP theorem (Consistency, Availability, Partition-tolerance) is a useful abstraction, but it only describes what happens when the network is broken. In the real world, the…

15

Advanced

Distributed Caching at Scale: Mitigating the Thundering Herd

Distributed Caching at Scale In a distributed system, caching is often the difference between a sub-100ms response and a total system collapse. However, most developers treat Redis as a simple "key-value bucket." At scal…

16

Advanced

The Transactional Outbox Pattern: Reliability in Microservices

The Transactional Outbox Pattern In a microservice, you often need to save data to a database (e.g., Order) and send an event to Kafka (e.g., OrderCreated). If the DB write succeeds but the Kafka send fails, your system…

17

Intermediate

API Pagination at Scale: Why OFFSET 100,000 is a Database Killer

API Pagination at Scale: Moving Beyond OFFSET Designing a paginated API seems simple: just use LIMIT 20 OFFSET 100. This works perfectly for the first few pages. However, once your users reach page 5,000, your database p…

18

Advanced

Inside the Linux Page Cache: The Invisible Database Accelerator

Inside the Linux Page Cache When your database (PostgreSQL, MongoDB, etc.) reads a row from disk, it doesn't just read the bytes and forget them. The Linux kernel intercepts the request and caches the data in a region of…

19

Intermediate

System Design: Designing Stateless Authentication

System Design: Designing Stateless Authentication In a microservices architecture, you can't rely on server-side sessions (stored in memory/database) because every request might hit a different service instance. Stateles…

20

Advanced

The Shadow Database Pattern: Verifying Schema Changes with Production Traffic

The Shadow Database Pattern Changing the schema of a 10TB database that is processing 50,000 requests per second is a high-stakes operation. Even with perfect testing in a staging environment, production traffic often re…

21

Intermediate

Kubernetes Networking: What Happens Between the Load Balancer and Your Pod?

Kubernetes Networking for Backend Developers As a backend engineer, you usually stop thinking about a request once it hits the Load Balancer. In Kubernetes, that is just the beginning. Understanding the network hop betwe…

22

Expert

S3 Express One Zone: When to Use it for Stateful Workloads

S3 Express One Zone Amazon S3 Express One Zone stores data in a single AZ, reducing network hops and latency. It's not a general-purpose storage; it's a specialized tool. 1. Use Case: Transient Data Perfect for Spark Shu…

23

Advanced

Service Mesh Internals: How Envoy and Istio Manage the Mesh

Service Mesh Internals A Service Mesh is a dedicated infrastructure layer for handling service-to-service communication. It's responsible for the reliable delivery of requests through a complex topology of services. 1. T…

24

Advanced

S3 Express One Zone: When to use it

S3 Express One Zone For stateful data processing (like Spark shuffle files), standard S3 latency is too high. S3 Express One Zone offers sub-millisecond access for transient data.

25

Advanced

Testing Distributed Systems: Chaos Mesh and Failure Injection

Testing Distributed Systems: Embracing Chaos In a distributed system, failure is the default state. To build resilient systems, you must move beyond unit tests and proactively inject failure into your production-like env…

26

Advanced

Terraform for Backend Engineers: Managing Your Own Infra

Terraform for Backend Engineers In modern engineering teams, the boundary between "Code" and "Infra" is blurring. As a backend developer, you should be able to spin up your own SQS queues or Postgres instances without op…

27

Advanced

The Expand-Contract Pattern: Zero-Downtime Database Schema Changes

The Expand-Contract Pattern: Zero-Downtime Migration The most dangerous operation in backend engineering is a breaking database schema change (e.g., renaming a column). If you just rename it, your existing application co…

28

Intermediate

System Design: Designing Idempotent APIs for Reliable Services

System Design: Designing Idempotent APIs In a distributed system, network failures are inevitable. A common failure scenario is: "The client sends a request -> The server processes it -> The server's response fails to re…

29

Advanced

LSM-Tree Compaction Strategies: Leveled vs. Size-Tiered

LSM-Tree Compaction Strategies LSM-tree based databases (Cassandra, RocksDB, ScyllaDB) don't update data in place. They write immutable SSTables. Over time, these files must be merged to reclaim space and improve reads.…

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

System Design: Designing a Global Distributed Rate Limiter

System Design Masterclass: Designing a Distributed Rate Limiter In a distributed environment, a single malicious script, a misconfigured client, or a massive traffic spike can easily overwhelm your backend servers, bring…

Apr 20, 20266 min read
Case StudyBackend Systems Mastery
#system-design#rate-limiting#redis
System DesignBeginner

System Design: Designing a Distributed ID Generator (Snowflake)

Designing a Distributed ID Generator > Prerequisite: To understand why distributed IDs are hard, first read about Database Sharding and Partitioning. In a distributed system, you often need to generate unique identifiers…

Apr 20, 20262 min read
Deep DiveBackend Systems Mastery
#system-design#snowflake#distributed-id
System DesignAdvanced

System Design: Designing a Global Payment Gateway (Stripe Scale)

System Design Masterclass: Designing a Payment Gateway (Stripe) Designing a system to serve photos or short URLs is fundamentally about optimizing for read-latency and disk space. If a user's photo fails to load, they re…

Apr 20, 20265 min read
Case StudyBackend Systems Mastery
#system-design#fintech#payment-gateway
System DesignIntermediate

System Design: Designing a URL Shortener (TinyURL)

System Design Masterclass: Designing a URL Shortener (TinyURL) Designing a URL shortener like TinyURL or Bitly is the most ubiquitous System Design interview question in the world. While it sounds trivial on the surface…

Apr 20, 20266 min read
Case StudyBackend Systems Mastery
#system-design#tinyurl#url-shortener

More in System Design

Category-based suggestions if you want to stay in the same domain.