Distributed Locking: The Danger of Fencing Tokens

The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can make your process think it holds a lock long after it has expired.

1. The Pause Problem

Process A acquires a lock (expires in 5s).
Process A enters a 6-second GC pause.
Lock expires. Process B acquires the lock.
Process A wakes up, unaware it lost the lock, and writes to the database. Corrupt Data.

2. The Solution: Fencing Tokens

Every time a lock service grants a lock, it returns a monotonically increasing token (version number).

When you write to the DB, you include the token in the clause:
If Process A (with token 12344) tries to write after Process B (with token 12345) has finished, the database will reject the write.

3. Why locks alone are insufficient

Distributed lock services (Redis, Zookeeper, etcd) protect coordination, but they cannot fully prevent stale clients from acting after lease expiry.

Causes include:

GC pauses
process suspension
network delays/partitions
clock drift assumptions

Fencing shifts protection to the resource itself, where correctness can be enforced deterministically.

4. Resource-side enforcement pattern

Fencing only works if downstream systems check tokens:

DB row includes last_token
write condition enforces incoming_token > last_token
accepted writes update last_token

This converts stale-writer risk into predictable rejected writes.

5. Integrating with SQL and storage layers

Examples:

SQL UPDATE ... WHERE id=? AND ? > last_token
object store metadata version check
message processor compares token before commit

If the protected resource cannot enforce token ordering, lock safety is weaker than expected.

6. Token source requirements

Token issuer must guarantee:

monotonic increase across lock grants for same resource
no token reuse
durability through leader failover/restart

Weak token generation invalidates fencing semantics.

7. Operational considerations

Track:

stale-token write rejection rate
lock acquisition latency
lease expiry while processing
lock contention hotspots

High stale-token rejections can indicate pauses, overloaded workers, or bad lease configuration.

8. Common mistakes

using lock TTL without fencing
generating tokens at clients instead of lock authority
not persisting last_token atomically with write
treating rejected stale write as generic retryable error

Rejected stale writes are correctness signals and should be handled explicitly.

9. Practical guidance

Use distributed locks for coordination and fencing tokens for correctness.
If you must choose one for data integrity, choose resource-enforced fencing.

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Distributed Locking: The Danger of Fencing Tokens

Distributed Locking: The Danger of Fencing Tokens

1. The Pause Problem

2. The Solution: Fencing Tokens

3. Why locks alone are insufficient

4. Resource-side enforcement pattern

5. Integrating with SQL and storage layers

6. Token source requirements

7. Operational considerations

8. Common mistakes

9. Practical guidance

Sachin Sarawgi

Reliability Engineering Mastery

Distributed Snapshots: Chandy-Lamport Algorithm

System Design: Designing Multi-Region Active-Active Architectures