System Design Masterclass: Designing a Distributed Task Scheduler
Every backend engineer has written a cron job. It's simple: you put a script on a Linux server and tell the OS to run it every night at midnight.
But what happens when you are building a SaaS platform that needs to send 50 million scheduled push notifications per day, process millions of recurring billing invoices, and execute delayed Webhooks? A single Linux server with cron will crash, run out of memory, or represent a catastrophic Single Point of Failure (SPOF).
Designing a Distributed Task Scheduler (similar to AWS EventBridge, Apache Airflow, or Celery) is a foundational system design interview question. It tests your knowledge of Distributed Locking, Message Queues, and Fault Tolerance.
1. Capacity Estimation (The Math)
Let's assume we are building a scheduler for a massive e-commerce platform.
Assumptions:
- Task Volume: 100 Million tasks scheduled per day.
- QPS:
100,000,000 / 86400 = ~1,150 QPS. - Peak QPS: Due to the "Thundering Herd" problem (everyone schedules tasks on the top of the hour, e.g., 12:00:00), Peak QPS could easily hit
50,000 QPS. - Task Payload: Each task requires an arbitrary JSON payload (e.g., email data). Average 1KB.
Storage Estimates:
100M * 1KB = 100GBper day.- If we retain executed task history for 30 days for auditing:
3 Terabytes.
Conclusion: 50,000 Peak QPS and 3TB of data means we need a horizontally scalable database and a message broker to decouple the scheduling logic from the execution logic.
2. High-Level Architecture
The golden rule of a task scheduler is decoupling the Scheduling (deciding when a task should run) from the Execution (actually running the code).
graph TD
Client[Client App/Microservice] -->|1. Submit Task| API[API Gateway]
API -->|2. Save Metadata| DB[(Task Metadata DB)]
Poller[Scheduler / Poller Nodes] -->|3. Find due tasks| DB
Poller -->|4. Publish to Queue| Kafka{Kafka / RabbitMQ}
Kafka -->|5. Consume & Execute| Worker[Worker Nodes]
Worker -->|6. Update Status| DB
style DB fill:#047857,stroke:#fff,stroke-width:2px,color:#fff
style Kafka fill:#b91c1c,stroke:#fff,stroke-width:2px,color:#fff
style Poller fill:#1e40af,stroke:#fff,stroke-width:2px,color:#fff
3. The Deep Dive: The Scheduler / Poller
The hardest part of this system is the Poller. We need a cluster of servers that constantly check the database to see if any tasks are due to run right now.
The Naive Approach
Every 1 second, 10 Poller nodes execute SELECT * FROM tasks WHERE execute_at <= NOW() AND status = 'PENDING'.
- The Problem: If 10 nodes execute this simultaneously, they will all pull the exact same 1,000 tasks. They will all push those 1,000 tasks to Kafka. Your workers will execute the tasks 10 times. You just billed a customer 10 times.
The Solution: Distributed Locking (Optimistic Concurrency Control)
How do we ensure only one Poller picks up a task?
We use the database itself as a lock via an UPDATE statement with a WHERE clause.
UPDATE tasks
SET status = 'QUEUED', locked_by = 'poller_node_42'
WHERE execute_at <= NOW() AND status = 'PENDING'
LIMIT 100;
Because relational databases (like PostgreSQL) use row-level locking, even if 10 Pollers run this exact query at the exact same millisecond, the database engine will serialize the updates.
- Poller 1 will successfully update 100 rows and receive a success count of 100.
- Pollers 2-10 will receive a success count of 0.
This guarantees that a task is transitioned to the QUEUED state exactly once.
4. The "Thundering Herd" and Hashed Timing Wheels
Let's revisit the 50,000 Peak QPS at 12:00:00. If your Poller uses LIMIT 100, it will take 500 database round-trips to pick up all the tasks. The tasks scheduled for 12:00:00 might not actually get queued until 12:05:00. This violates the latency requirement.
The Timing Wheel (Kafka/Netty approach)
Instead of hammering the database every second, the Poller queries the database every 5 minutes and pulls all tasks due within the next 5-minute window.
The Poller loads these tasks into a local, in-memory data structure called a Hashed Timing Wheel.
- A Timing Wheel is essentially a circular array (a clock face).
- If the array has 60 slots (one for each second), a task due in 15 seconds is placed into the
current_index + 15bucket. - The Poller simply ticks an internal pointer forward 1 bucket per second. Any tasks sitting in that bucket are instantly fired into Kafka.
This completely eliminates database latency from the critical path of execution.
5. Worker Execution and Fault Tolerance
Once the Poller drops a task into Kafka, the Worker Nodes take over.
At-Least-Once Delivery
In distributed systems, achieving mathematical "Exactly-Once" execution is practically impossible without distributed transactions (which are too slow). We must settle for At-Least-Once delivery.
- A Worker pulls a task from Kafka.
- The Worker executes the code (e.g., sends an email).
- The Worker sends an
ACKto Kafka and updates the Task DB tostatus = COMPLETED.
What if the Worker dies during step 2?
Kafka has a visibility timeout. If the worker doesn't ACK within 5 minutes, Kafka assumes the worker died and makes the message visible to a different Worker.
Because the Worker might die after sending the email but before sending the ACK, the second Worker will pick up the task and send the email again.
To prevent this, the actual execution logic must be Idempotent. Before sending the email, the worker should check a deduplication table (using the task_id) to verify it hasn't been executed recently.
6. Which Database to Choose?
Because the Poller heavily relies on indexed range queries (WHERE execute_at <= NOW()) and row-level locking (UPDATE ... WHERE status = 'PENDING'), NoSQL databases like Cassandra or DynamoDB are terrible choices.
NoSQL databases do not support global secondary indexes sorted by time efficiently, nor do they support strong ACID row-level locking without massive performance penalties.
You must use a strongly consistent Relational Database.
- For moderate scale: PostgreSQL or MySQL.
- For massive scale: CockroachDB or Google Cloud Spanner (Distributed SQL).
Summary Checklist for the Interview
When designing a Distributed Task Scheduler, ensure you cover:
- Decoupling: Clearly separate the Poller (Scheduling) from the Worker (Execution) using a Message Broker (Kafka).
- Concurrency: Solve duplicate processing using Optimistic Concurrency Control (
UPDATE WHERE status='PENDING'). - Thundering Herds: Implement a Hashed Timing Wheel to buffer database queries.
- Idempotency: Force the interviewer to acknowledge that workers must be idempotent to survive node crashes and network partitions.
