System DesignExpertarticlePart 3 of 7 in Reliability Engineering Mastery

Distributed Locking: The Danger of Fencing Tokens

Why TTLs are not enough for distributed locks. A deep dive into Fencing Tokens and why your lock-protected service is still corrupting data.

Sachin SarawgiApril 20, 20263 min read3 minute lesson

Distributed Locking: The Danger of Fencing Tokens

The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can make your process think it holds a lock long after it has expired.

1. The Pause Problem

  1. Process A acquires a lock (expires in 5s).
  2. Process A enters a 6-second GC pause.
  3. Lock expires. Process B acquires the lock.
  4. Process A wakes up, unaware it lost the lock, and writes to the database. Corrupt Data.

2. The Solution: Fencing Tokens

Every time a lock service grants a lock, it returns a monotonically increasing token (version number).

  • When you write to the DB, you include the token in the clause:

  • If Process A (with token 12344) tries to write after Process B (with token 12345) has finished, the database will reject the write.

3. Why locks alone are insufficient

Distributed lock services (Redis, Zookeeper, etcd) protect coordination, but they cannot fully prevent stale clients from acting after lease expiry.

Causes include:

  • GC pauses
  • process suspension
  • network delays/partitions
  • clock drift assumptions

Fencing shifts protection to the resource itself, where correctness can be enforced deterministically.

4. Resource-side enforcement pattern

Fencing only works if downstream systems check tokens:

  • DB row includes last_token
  • write condition enforces incoming_token > last_token
  • accepted writes update last_token

This converts stale-writer risk into predictable rejected writes.

5. Integrating with SQL and storage layers

Examples:

  • SQL UPDATE ... WHERE id=? AND ? > last_token
  • object store metadata version check
  • message processor compares token before commit

If the protected resource cannot enforce token ordering, lock safety is weaker than expected.

6. Token source requirements

Token issuer must guarantee:

  • monotonic increase across lock grants for same resource
  • no token reuse
  • durability through leader failover/restart

Weak token generation invalidates fencing semantics.

7. Operational considerations

Track:

  • stale-token write rejection rate
  • lock acquisition latency
  • lease expiry while processing
  • lock contention hotspots

High stale-token rejections can indicate pauses, overloaded workers, or bad lease configuration.

8. Common mistakes

  • using lock TTL without fencing
  • generating tokens at clients instead of lock authority
  • not persisting last_token atomically with write
  • treating rejected stale write as generic retryable error

Rejected stale writes are correctness signals and should be handled explicitly.

9. Practical guidance

Use distributed locks for coordination and fencing tokens for correctness.
If you must choose one for data integrity, choose resource-enforced fencing.

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Continue Series

Reliability Engineering Mastery

Lesson 3 of 7 in this learning sequence.

Next in series
1

Expert

Distributed Snapshots: Chandy-Lamport Algorithm

Distributed Snapshots: Chandy-Lamport How do you take a "global photo" of a system where every node has a different time and no central master? 1. The Problem You need to save the state of a system for debugging or check…

2

Advanced

System Design: Designing Multi-Region Active-Active Architectures

Multi-Region Active-Active: The Global Scale Deploying to multiple regions is the only way to survive a total regional failure and provide sub-100ms latency to a global user base. An Active-Active setup means every regio…

3

Expert

Distributed Locking: The Danger of Fencing Tokens

Distributed Locking: The Danger of Fencing Tokens The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can m…

4

Expert

Distributed Garbage Collection: Managing References Across Networks

Distributed Garbage Collection In a microservices world, if Service A creates a resource in Service B, who is responsible for deleting it? If Service A crashes, that resource leaks forever. This is Distributed Memory Man…

5

Expert

Backpressure Propagation: Designing Flow Control in Microservices

Backpressure Propagation When your database is slow, your worker is slow. When your worker is slow, your Kafka consumer lags. When Kafka lags, your producer buffer fills up. Backpressure is the signal that propagates thi…

6

Advanced

Multi-Region DR: Warm Standby vs Active-Active

Multi-Region Disaster Recovery (DR) If a complete AWS region goes down, your system must keep running. Designing for regional failure requires moving from "highly available" to "disaster-proof." 1. Warm Standby (The Cost…

7

Advanced

Linearizability vs. Sequential Consistency: A Developer's Guide to Correctness

Linearizability vs. Sequential Consistency If you use a "Consistent" database, what guarantees are you actually getting? In distributed computing, there are two major models of "Strong" consistency. 1. Linearizability (T…

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignExpert

Distributed Snapshots: Chandy-Lamport Algorithm

Distributed Snapshots: Chandy-Lamport How do you take a "global photo" of a system where every node has a different time and no central master? 1. The Problem You need to save the state of a system for debugging or check…

Apr 20, 20262 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#snapshot#chandy-lamport
System DesignAdvanced

Linearizability vs. Sequential Consistency: A Developer's Guide to Correctness

Linearizability vs. Sequential Consistency If you use a "Consistent" database, what guarantees are you actually getting? In distributed computing, there are two major models of "Strong" consistency. 1. Linearizability (T…

Apr 20, 20263 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#consistency#linearizability
System DesignExpert

Distributed Garbage Collection: Managing References Across Networks

Distributed Garbage Collection In a microservices world, if Service A creates a resource in Service B, who is responsible for deleting it? If Service A crashes, that resource leaks forever. This is Distributed Memory Man…

Apr 20, 20263 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#memory-management#garbage-collection
System DesignAdvanced

Multi-Region DR: Warm Standby vs Active-Active

Multi-Region Disaster Recovery (DR) If a complete AWS region goes down, your system must keep running. Designing for regional failure requires moving from "highly available" to "disaster-proof." 1. Warm Standby (The Cost…

Apr 20, 20263 min read
Case StudyReliability Engineering Mastery
#multi-region#disaster-recovery#reliability

More in System Design

Category-based suggestions if you want to stay in the same domain.