System DesignAdvancedarticle

Distributed Deadlock Detection: Wait-For-Graphs

Why simple TTLs in distributed locks are dangerous and how to implement robust deadlock detection.

Sachin Sarawgi•April 20, 2026•4 min read•4 minute lesson

#distributed-locking #deadlock #concurrency #distributed-systems

On This PageOpen

What is a distributed deadlock?
Why TTL alone is insufficient
Wait-for graph model
Core architecture
Building the graph safely
Cycle detection approaches
Victim selection policy
Starvation prevention
Lock acquisition best practices
Failure handling and client semantics
Multi-region and partition concerns
Observability signals
Example deadlock resolution flow
Common mistakes in production
Practical recommendation

Distributed Deadlock Detection

Distributed locking gets hard the moment one workflow needs multiple resources and lock acquisition order is not globally consistent.

At that point, "just add TTL" is not enough. TTL handles forgotten locks, not active circular waits between healthy nodes.

What is a distributed deadlock?

A deadlock occurs when transactions form a cycle of waiting:

T1 holds A, waits for B
T2 holds B, waits for C
T3 holds C, waits for A

No participant can proceed, and without intervention they may wait forever.

Why TTL alone is insufficient

Teams often rely on lock expiration to eventually break deadlocks. That introduces major problems:

high TTL means long stalls and poor tail latency
low TTL increases false lock loss during long but valid operations
retries can amplify contention and create lock thrashing

TTL is still useful as safety net, but it should not be your primary deadlock strategy.

Wait-for graph model

Represent lock waits as a directed graph:

node = transaction/session/work item
edge Ti -> Tj means Ti waits for resource held by Tj

Deadlock exists if and only if the graph contains a cycle.

This gives you a deterministic detection mechanism rather than timeout guessing.

Core architecture

A practical design includes:

Lock service that tracks ownership and wait queues
Dependency tracker that emits wait edges on blocked lock attempts
Deadlock detector that runs cycle detection periodically or event-driven
Resolution policy engine that chooses victim transaction to abort

Detector can be centralized (simpler) or partitioned by lock namespace (scalable).

Building the graph safely

When transaction T1 requests lock held by T2:

add edge T1 -> T2
if lock granted, remove corresponding wait edge
on abort/timeout/completion, remove all incident edges for transaction

Stale edges cause false positives, so cleanup correctness is critical.

Cycle detection approaches

Two common methods:

DFS from newly added edge source (fast for sparse graph)
Tarjan/Kosaraju strongly connected components on schedule

Event-driven incremental detection usually gives lower mean time to recovery.

Victim selection policy

Once a cycle is found, choose one transaction to abort.

Good heuristics:

lowest priority workload first
youngest transaction first (less wasted work)
smallest rollback cost
lowest user-facing impact

Bad heuristic:

random victim without fairness; can starve specific workloads.

Starvation prevention

A transaction repeatedly chosen as victim may never finish.

Mitigations:

exponential backoff with jitter
retry budget limits
priority inheritance/escalation after repeated aborts
bounded attempt count with surfaced error to caller

Lock acquisition best practices

Deadlock detection is important, but prevention reduces detector load:

global lock ordering where possible
acquire all required locks in deterministic sequence
hold locks for shortest possible duration
avoid user/network calls while holding locks

Even with prevention, keep detector as fallback for complex dynamic resource sets.

Failure handling and client semantics

When a victim is aborted:

release all held locks atomically
return retryable error code with reason (DEADLOCK_VICTIM)
include backoff hint

Client libraries should treat this differently from generic timeout errors.

Multi-region and partition concerns

If locking spans regions, detector visibility may lag due to replication delay.

Strategies:

region-local locking for most workflows
cross-region lock domains only for truly shared resources
monotonic event sequencing where possible

During network partitions, prefer safety over liveness for critical invariants.

Observability signals

Track:

lock wait time percentiles
deadlock cycles detected per minute
victim abort count by service/tenant
average recovery time after cycle detection
lock hold duration distribution

High deadlock rate often indicates poor lock granularity or inconsistent ordering.

Example deadlock resolution flow

T1 waits on T2 -> edge added
T2 waits on T3 -> edge added
T3 waits on T1 -> cycle found
Policy picks T3 as victim
T3 aborted, locks released
T2 proceeds, then T1

This deterministic break prevents long TTL-based stalls.

Common mistakes in production

relying only on lock TTLs
forgetting to remove edges on cancellation
lock service that cannot atomically release all victim locks
no distinction between deadlock and contention metrics
unbounded automatic retries causing traffic amplification

Practical recommendation

If your workload acquires multiple distributed locks per transaction, implement wait-for graph detection early. It is cheaper to add before scale than to retrofit after contention incidents begin affecting revenue-critical paths.

Practical engineering notes

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

LinkedIn GitHub Medium More articles

Share this lesson

Share on X Share on LinkedIn

Keep Learning

Move through the archive without losing the thread.

Distributed Locking: Redis Redlock vs. Zookeeper vs. Database Constraints

Distributed Locking: Coordinating at Scale In a distributed system, multiple instances of a service often need to access a shared resource (like an inventory item or a single-use coupon) simultaneously. Standard language…

System Design3 min readAdvanced

Distributed Data Observability: Metrics That Actually Matter

Distributed Data Observability: High-Signal Metrics In a distributed data system, CPU and RAM are "low-signal" metrics. A system can have 10% CPU usage and still be completely stalled. To truly understand the health of y…

System Design2 min readAdvanced

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

System Design: Designing Airbnb (Hotel/Home Booking)

System Design: Designing Airbnb (Hotel/Home Booking) Designing a platform like Airbnb or Booking.com involves two distinct technical challenges: Search (helping users find the perfect place) and Concurrency (ensuring tha…

Apr 20, 20263 min read

Deep Dive

#system-design#airbnb#booking-system

System DesignAdvanced

System Design: Designing a Distributed File Lock (Zookeeper/Curator)

Designing a Distributed File Lock In a distributed environment, two instances of a service might try to modify the same shared file at the same time, leading to data corruption. While we have locks for databases (Redis/P…

Apr 20, 20262 min read

Deep Dive

#system-design#distributed-locking#zookeeper

System DesignAdvanced

System Design: Designing an Online Auction System (eBay Scale)

System Design: Designing an Online Auction System Designing a high-scale auction system like eBay or a penny auction site is a classic concurrency challenge. The system must handle thousands of users bidding on the same…

Apr 20, 20263 min read

Deep Dive

#system-design#auction-system#ebay

System DesignAdvanced

System Design: Designing a Stock Trading Platform and Matching Engine

System Design: Designing a High-Performance Trading Platform Designing a stock or crypto trading platform is the ultimate test of low-latency engineering. You need to process millions of orders per second, maintain a per…

Apr 20, 20263 min read

Deep Dive

#system-design#fintech#matching-engine

More in System Design

Category-based suggestions if you want to stay in the same domain.

System DesignIntermediate

System Design: Designing Stateless Authentication

System Design: Designing Stateless Authentication In a microservices architecture, you can't rely on server-side sessions (stored in memory/database) because every request might hit a different service instance. Stateles…

Apr 22, 20263 min read

Deep DiveBackend Systems Mastery

#system design#authentication#jwt

System DesignBeginner

gRPC vs REST: The Decision-Maker's Guide for Backend Architecture

gRPC vs REST: Which One for Your Microservices? In modern backend architecture, how services talk is as important as what they say. Choosing between REST and gRPC isn't just about syntax; it's about the trade-off between…

Apr 20, 20262 min read

ComparisonBackend Systems Mastery

#grpc#rest#api-design

System DesignBeginner

gRPC vs REST: A Decision-Maker's Guide for Backend Architecture

gRPC vs REST: Which One for Your Microservices? > Prerequisite: Before diving into protocols, ensure you understand the fundamentals of Load Balancing and API Idempotency. Choosing between REST and gRPC is one of the mos…

Apr 20, 20262 min read

ComparisonBackend Systems Mastery

#grpc#rest#api-design

← Back to all articles

Distributed Deadlock Detection: Wait-For-Graphs

Distributed Deadlock Detection

What is a distributed deadlock?

Why TTL alone is insufficient

Wait-for graph model

Core architecture

Building the graph safely

Cycle detection approaches

Victim selection policy

Starvation prevention

Lock acquisition best practices

Failure handling and client semantics

Multi-region and partition concerns

Observability signals

Example deadlock resolution flow

Common mistakes in production

Practical recommendation

Get the next backend guide in your inbox

Sachin Sarawgi

Keep Learning

Distributed Locking: Redis Redlock vs. Zookeeper vs. Database Constraints

Distributed Data Observability: Metrics That Actually Matter

Related Articles

System Design: Designing Airbnb (Hotel/Home Booking)

System Design: Designing a Distributed File Lock (Zookeeper/Curator)

System Design: Designing an Online Auction System (eBay Scale)

System Design: Designing a Stock Trading Platform and Matching Engine

More in System Design

System Design: Designing Stateless Authentication

gRPC vs REST: The Decision-Maker's Guide for Backend Architecture

gRPC vs REST: A Decision-Maker's Guide for Backend Architecture