System Design Masterclass: Designing a Distributed Rate Limiter
In a distributed environment, a single malicious script, a misconfigured client, or a massive traffic spike can easily overwhelm your backend servers, bringing down your entire business.
A Rate Limiter is the shield that protects your infrastructure. It dictates how many requests a specific user, IP address, or API key is allowed to make within a specific time window.
Designing a rate limiter that operates globally across thousands of microservices, with sub-millisecond latency and perfect accuracy, is a classic test of a senior engineer's ability to handle concurrency and distributed state.
1. Capacity Estimation (The Math)
Let's assume we are building a rate limiter for a public API like Stripe or Twitter.
Assumptions:
- DAU (Daily Active Users): 10 Million.
- Traffic: Each user makes an average of 50 requests per day. Peak QPS is
20,000 requests/sec. - Limits: Free tier allows 100 requests/minute. Premium tier allows 1,000 requests/minute.
Memory Storage Estimates: To track the rate limit, we need to store data in memory (Redis). For a simple counter approach, we need to store:
user_id(8 bytes)count(4 bytes integer)timestamp(8 bytes) Total = 20 bytes per user.
For 10 Million users actively making requests in the same minute: 10M * 20 bytes = 200 Megabytes.
Conclusion: The memory footprint is tiny. A single Redis node can easily hold this data, but we will need a Redis Cluster to handle the 20,000 QPS network throughput.
2. High-Level Architecture
The rate limiter should sit as close to the edge as possible to prevent malicious traffic from ever reaching your application servers. It is typically integrated directly into the API Gateway.
graph TD
Client1[Client App] --> GW[API Gateway]
Client2[Malicious Bot] --> GW
subgraph Rate Limiting Subsystem
GW -->|1. Check Limit| Redis[(Redis Cluster)]
Redis -->|2. Allow or Deny| GW
end
GW -.->|3a. HTTP 429 Too Many Requests| Client2
GW -->|3b. Forward Request| Services[Backend Microservices]
style GW fill:#1e40af,stroke:#fff,stroke-width:2px,color:#fff
style Redis fill:#b91c1c,stroke:#fff,stroke-width:2px,color:#fff
3. The Core Algorithms
You cannot design a rate limiter without selecting an underlying mathematical algorithm. Here are the three most common.
Algorithm A: Fixed Window Counter (The Flawed Approach)
You divide time into fixed windows (e.g., 12:00 to 12:01). You increment a counter for every request. If the counter hits 100, reject. When the minute rolls over, reset to 0.
- The Problem: Traffic spikes at the edges. A user could send 100 requests at
12:00:59and another 100 requests at12:01:01. They just pushed 200 requests in 2 seconds, completely bypassing the intended 100 req/min limit and crashing your database.
Algorithm B: Sliding Window Log (The Expensive Approach)
Instead of a fixed counter, you store the exact timestamp of every single request the user makes in a Redis Sorted Set. When a new request arrives, you delete all timestamps older than 1 minute, and count the remaining timestamps.
- The Problem: It uses a massive amount of memory. Storing 1,000 individual timestamps for a premium user takes
1000 * 8 bytes = 8KBper user, compared to 20 bytes for a counter.
Algorithm C: Token Bucket (The Industry Standard)
This is the algorithm used by Amazon API Gateway and Stripe.
Imagine a bucket associated with a user. The bucket has a capacity of C tokens.
- A background process "refills" the bucket at a constant rate of
Rtokens per minute. - When a request arrives, it takes 1 token out of the bucket.
- If the bucket is empty, the request is dropped.
Why it's the best: It requires very little memory (just the current token count and the last refill timestamp), and it naturally handles bursts of traffic smoothly.
4. The Deep Dive: Distributed Concurrency and Lua Scripts
Here is where junior engineers fail the interview. Let's assume you are using Redis to store the Token Bucket data.
Your API Gateway receives a request, reads the token count (GET tokens), sees count = 5, decrements it to 4, and saves it (SET tokens 4).
In a distributed system, you have 50 API Gateway instances processing requests concurrently. If Gateway A and Gateway B both receive a request for the same user at the exact same millisecond, they both run GET tokens. They both see 5. They both calculate 4, and they both run SET tokens 4.
Two requests were processed, but the token count only went down by 1. The user has successfully bypassed the rate limiter.
The Solution: Redis Lua Scripting
To solve the race condition without using slow distributed locks, we use Redis Lua Scripts.
Redis executes Lua scripts atomically. Because Redis is single-threaded, while a Lua script is running, no other Redis command from any other server can execute.
You write a tiny Lua script that performs the GET, does the math to check the bucket, and performs the SET. You send this script from the API Gateway to Redis. It executes in microseconds, completely eliminating the race condition.
5. Scaling to Global Traffic
What happens when your API is deployed in US-East, Europe, and Asia? Do you have one global Redis cluster in the US?
If a user in Tokyo hits the Asia API Gateway, and the Gateway has to perform a Lua script execution against a Redis cluster in Virginia, the network latency will be ~200ms. Adding 200ms of latency to every single API request is unacceptable.
The Asynchronous Sync Pattern (Local + Global)
To solve global latency, you must deploy a Redis cluster in every region.
- The Asia Gateway checks the Asia Redis cluster (
1mslatency). - But what if the user routes traffic to Asia, uses up their 100 requests, and then routes traffic to Europe to get another 100 requests?
The solution is Eventual Consistency. Each regional Redis cluster enforces the limit locally. Every few seconds, a background worker synchronizes the counters across all global regions using a message broker like Kafka.
Yes, a malicious user might briefly exceed their limit by a few percentage points if they rapidly switch continents before the sync occurs. But in system design, allowing a 5% inaccuracy to guarantee 1ms global latency is the correct engineering trade-off.
Summary Checklist for the Interview
When designing a rate limiter, ensure you cover:
- Choose Token Bucket or Sliding Window Counter to optimize memory.
- Defend Redis as the storage layer due to its in-memory speed.
- Explicitly solve the distributed race condition using Atomic Lua Scripts.
- Address multi-region latency by proposing local Redis clusters with asynchronous global synchronization.
