System Design: Designing an Online Judge
An Online Judge system like LeetCode, Codeforces, or HackerRank allows users to submit code, which is then executed and verified against hidden test cases. The core technical challenge is Security (running untrusted code) and Scalability (handling bursts of submissions during contests).
1. Core Requirements
- Submission: Users can submit code in various languages (Java, Python, C++, etc.).
- Execution: Running the code against multiple test cases.
- Verification: Comparing the output with the expected result.
- Limits: Enforcing Time Limits (TLE) and Memory Limits (MLE).
- Security: Preventing the user code from accessing the host file system or network.
2. High-Level Architecture
- Web API: Receives code and queues the submission.
- Message Queue (Kafka): Buffers submissions to handle traffic spikes.
- Judge Workers: Pull submissions from the queue and execute them in a secure sandbox.
- Result Store (Postgres/Redis): Stores submission results and real-time status.
3. Code Isolation (The Security Heart)
Running untrusted user code on your server is dangerous. You must isolate it completely.
- Option A: Docker Containers. A good starting point, but "Root" in a container can sometimes escape to the host.
- Option B: gVisor (Google's choice). A user-space kernel that intercepts system calls, providing a much stronger security boundary than standard Docker.
- Option C: Firecracker (AWS Lambda style). MicroVMs that provide hardware-level isolation with the speed of containers.
4. Enforcing Limits (Resource Management)
- Time Limits: Use OS-level timers or
cgroupsto kill the process if it exceeds 1-2 seconds. - Memory Limits: Use
cgroupsto set a hard RAM limit (e.g., 256MB). If exceeded, the process is killed (MLE). - Fork Bombs: Limit the number of processes/threads the user code can create to prevent
while(true) { fork(); }attacks.
5. Handling Test Cases
For a single problem, there might be 100+ test cases.
- Input/Output Store: Store large test cases in Amazon S3 and cache them on the Judge Workers to avoid network latency for every submission.
- Sequential vs. Parallel: To save time, run different test cases for the same submission in parallel across multiple threads/containers.
6. Real-time Status (WebSockets)
Users want to see "Running Test Case 5/100..." in real-time.
- The Flow: The Judge Worker publishes progress to a Redis Pub/Sub channel. The Web API listens and pushes updates to the client via WebSockets.
Summary
The engineering of an Online Judge is about Defensive Programming. By using advanced sandboxing technologies like gVisor and robust resource management via cgroups, you can build a platform that allows the world to code on your servers with absolute safety and massive scale.
