Distributed Locking: The Danger of Fencing Tokens
The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can make your process think it holds a lock long after it has expired.
1. The Pause Problem
- Process A acquires a lock (expires in 5s).
- Process A enters a 6-second GC pause.
- Lock expires. Process B acquires the lock.
- Process A wakes up, unaware it lost the lock, and writes to the database. Corrupt Data.
2. The Solution: Fencing Tokens
Every time a lock service grants a lock, it returns a monotonically increasing token (version number).
-
When you write to the DB, you include the token in the clause:
-
If Process A (with token 12344) tries to write after Process B (with token 12345) has finished, the database will reject the write.
3. Why locks alone are insufficient
Distributed lock services (Redis, Zookeeper, etcd) protect coordination, but they cannot fully prevent stale clients from acting after lease expiry.
Causes include:
- GC pauses
- process suspension
- network delays/partitions
- clock drift assumptions
Fencing shifts protection to the resource itself, where correctness can be enforced deterministically.
4. Resource-side enforcement pattern
Fencing only works if downstream systems check tokens:
- DB row includes
last_token - write condition enforces
incoming_token > last_token - accepted writes update
last_token
This converts stale-writer risk into predictable rejected writes.
5. Integrating with SQL and storage layers
Examples:
- SQL
UPDATE ... WHERE id=? AND ? > last_token - object store metadata version check
- message processor compares token before commit
If the protected resource cannot enforce token ordering, lock safety is weaker than expected.
6. Token source requirements
Token issuer must guarantee:
- monotonic increase across lock grants for same resource
- no token reuse
- durability through leader failover/restart
Weak token generation invalidates fencing semantics.
7. Operational considerations
Track:
- stale-token write rejection rate
- lock acquisition latency
- lease expiry while processing
- lock contention hotspots
High stale-token rejections can indicate pauses, overloaded workers, or bad lease configuration.
8. Common mistakes
- using lock TTL without fencing
- generating tokens at clients instead of lock authority
- not persisting
last_tokenatomically with write - treating rejected stale write as generic retryable error
Rejected stale writes are correctness signals and should be handled explicitly.
9. Practical guidance
Use distributed locks for coordination and fencing tokens for correctness.
If you must choose one for data integrity, choose resource-enforced fencing.
