System Design: Designing a Distributed Lock Manager (DLM)
In a microservices architecture, multiple instances of a service often need to access a shared resource (like an inventory item or a single-use coupon) simultaneously. Standard language-level locks (like Java’s synchronized) do not work across multiple servers. We need a Distributed Lock.
1. Core Requirements
- Safety: Mutual exclusion — only one client can hold the lock at any time.
- Liveness (Deadlock-free): A lock must eventually be released, even if the client holding it crashes.
- Performance: Acquiring and releasing locks must have low latency.
- Fault Tolerance: The locking service itself must remain available even if some nodes fail.
2. Redis-based Locking (The Performance Choice)
Redis is the most common choice due to its extreme performance.
- Implementation: Using
SET resource_name my_random_value NX PX 30000. This atomically sets the key only if it doesn't exist (NX) with an expiry (PX) of 30 seconds to ensure deadlock freedom. - Redlock Algorithm: To make it fault-tolerant, Redis author Antirez proposed Redlock, where a client acquires locks from a majority of independent Redis masters.
- The Catch: Redlock is controversial. Critics argue it relies too heavily on system clock synchronization, which can fail in distributed environments.
3. Zookeeper: The Consistency Choice
Zookeeper is designed for coordination and provides strong consistency.
- Implementation: A client creates an "ephemeral" node in the Zookeeper hierarchy. If the client disconnects or crashes, Zookeeper automatically deletes the node, releasing the lock.
- Pros: Robust against network partitions, provides "watchers" (event notifications) so clients don't have to poll for lock availability.
- Cons: Higher latency than Redis; managing a Zookeeper cluster adds operational complexity.
4. The Fencing Token (The Safety Essential)
Regardless of the tool, a process might lose its lock (e.g., due to a long GC pause) but still think it owns it. This leads to Split-Brain writes.
- The Solution: Every time a lock is acquired, the lock manager returns a Fencing Token (a monotonically increasing version number). When the client writes to the shared resource, it must include this token. The resource rejects any write with an old token, effectively "fencing out" the process that lost its lock.
Summary
- Redis: Use for high-performance, short-lived locks where minor risks are acceptable.
- Zookeeper: Use for mission-critical coordination where consistency is paramount.
- Postgres: Use for simple, low-throughput systems where extra infrastructure is unnecessary.
