System DesignAdvancedarticle

Distributed Deadlock Detection: Wait-For-Graphs

Why simple TTLs in distributed locks are dangerous and how to implement robust deadlock detection.

Sachin SarawgiApril 20, 20264 min read4 minute lesson

Distributed Deadlock Detection

Distributed locking gets hard the moment one workflow needs multiple resources and lock acquisition order is not globally consistent.

At that point, "just add TTL" is not enough. TTL handles forgotten locks, not active circular waits between healthy nodes.

What is a distributed deadlock?

A deadlock occurs when transactions form a cycle of waiting:

  • T1 holds A, waits for B
  • T2 holds B, waits for C
  • T3 holds C, waits for A

No participant can proceed, and without intervention they may wait forever.

Why TTL alone is insufficient

Teams often rely on lock expiration to eventually break deadlocks. That introduces major problems:

  • high TTL means long stalls and poor tail latency
  • low TTL increases false lock loss during long but valid operations
  • retries can amplify contention and create lock thrashing

TTL is still useful as safety net, but it should not be your primary deadlock strategy.

Wait-for graph model

Represent lock waits as a directed graph:

  • node = transaction/session/work item
  • edge Ti -> Tj means Ti waits for resource held by Tj

Deadlock exists if and only if the graph contains a cycle.

This gives you a deterministic detection mechanism rather than timeout guessing.

Core architecture

A practical design includes:

  1. Lock service that tracks ownership and wait queues
  2. Dependency tracker that emits wait edges on blocked lock attempts
  3. Deadlock detector that runs cycle detection periodically or event-driven
  4. Resolution policy engine that chooses victim transaction to abort

Detector can be centralized (simpler) or partitioned by lock namespace (scalable).

Building the graph safely

When transaction T1 requests lock held by T2:

  • add edge T1 -> T2
  • if lock granted, remove corresponding wait edge
  • on abort/timeout/completion, remove all incident edges for transaction

Stale edges cause false positives, so cleanup correctness is critical.

Cycle detection approaches

Two common methods:

  • DFS from newly added edge source (fast for sparse graph)
  • Tarjan/Kosaraju strongly connected components on schedule

Event-driven incremental detection usually gives lower mean time to recovery.

Victim selection policy

Once a cycle is found, choose one transaction to abort.

Good heuristics:

  • lowest priority workload first
  • youngest transaction first (less wasted work)
  • smallest rollback cost
  • lowest user-facing impact

Bad heuristic:

  • random victim without fairness; can starve specific workloads.

Starvation prevention

A transaction repeatedly chosen as victim may never finish.

Mitigations:

  • exponential backoff with jitter
  • retry budget limits
  • priority inheritance/escalation after repeated aborts
  • bounded attempt count with surfaced error to caller

Lock acquisition best practices

Deadlock detection is important, but prevention reduces detector load:

  • global lock ordering where possible
  • acquire all required locks in deterministic sequence
  • hold locks for shortest possible duration
  • avoid user/network calls while holding locks

Even with prevention, keep detector as fallback for complex dynamic resource sets.

Failure handling and client semantics

When a victim is aborted:

  • release all held locks atomically
  • return retryable error code with reason (DEADLOCK_VICTIM)
  • include backoff hint

Client libraries should treat this differently from generic timeout errors.

Multi-region and partition concerns

If locking spans regions, detector visibility may lag due to replication delay.

Strategies:

  • region-local locking for most workflows
  • cross-region lock domains only for truly shared resources
  • monotonic event sequencing where possible

During network partitions, prefer safety over liveness for critical invariants.

Observability signals

Track:

  • lock wait time percentiles
  • deadlock cycles detected per minute
  • victim abort count by service/tenant
  • average recovery time after cycle detection
  • lock hold duration distribution

High deadlock rate often indicates poor lock granularity or inconsistent ordering.

Example deadlock resolution flow

  1. T1 waits on T2 -> edge added
  2. T2 waits on T3 -> edge added
  3. T3 waits on T1 -> cycle found
  4. Policy picks T3 as victim
  5. T3 aborted, locks released
  6. T2 proceeds, then T1

This deterministic break prevents long TTL-based stalls.

Common mistakes in production

  • relying only on lock TTLs
  • forgetting to remove edges on cancellation
  • lock service that cannot atomically release all victim locks
  • no distinction between deadlock and contention metrics
  • unbounded automatic retries causing traffic amplification

Practical recommendation

If your workload acquires multiple distributed locks per transaction, implement wait-for graph detection early. It is cheaper to add before scale than to retrofit after contention incidents begin affecting revenue-critical paths.

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

System Design: Designing Airbnb (Hotel/Home Booking)

System Design: Designing Airbnb (Hotel/Home Booking) Designing a platform like Airbnb or Booking.com involves two distinct technical challenges: Search (helping users find the perfect place) and Concurrency (ensuring tha…

Apr 20, 20263 min read
Deep Dive
#system-design#airbnb#booking-system
System DesignAdvanced

System Design: Designing a Distributed File Lock (Zookeeper/Curator)

Designing a Distributed File Lock In a distributed environment, two instances of a service might try to modify the same shared file at the same time, leading to data corruption. While we have locks for databases (Redis/P…

Apr 20, 20262 min read
Deep Dive
#system-design#distributed-locking#zookeeper
System DesignAdvanced

System Design: Designing an Online Auction System (eBay Scale)

System Design: Designing an Online Auction System Designing a high-scale auction system like eBay or a penny auction site is a classic concurrency challenge. The system must handle thousands of users bidding on the same…

Apr 20, 20263 min read
Deep Dive
#system-design#auction-system#ebay
System DesignAdvanced

System Design: Designing a Stock Trading Platform and Matching Engine

System Design: Designing a High-Performance Trading Platform Designing a stock or crypto trading platform is the ultimate test of low-latency engineering. You need to process millions of orders per second, maintain a per…

Apr 20, 20263 min read
Deep Dive
#system-design#fintech#matching-engine

More in System Design

Category-based suggestions if you want to stay in the same domain.