System DesignExpertarticlePart 1 of 7 in Reliability Engineering Mastery

Distributed Snapshots: Chandy-Lamport Algorithm

How to capture the state of an entire distributed system without stopping traffic. Deep dive into Chandy-Lamport.

Sachin SarawgiApril 20, 20262 min read2 minute lesson

Distributed Snapshots: Chandy-Lamport

How do you take a "global photo" of a system where every node has a different time and no central master?

1. The Problem

You need to save the state of a system for debugging or checkpointing. If you stop the system, you break availability.

2. Chandy-Lamport Algorithm

  1. A node starts a snapshot by saving its local state and sending a "Marker" message to all neighbors.
  2. When a node receives a Marker, it saves its state and forwards the Marker to its neighbors.
  3. If it receives another Marker, it simply records the messages sent between the two markers.

3. Consistency

It guarantees that if Event A happened before the snapshot, it is included. If Event B happened after, it is excluded. This is a "consistent cut" of the distributed state.

4. Why this matters in production

Distributed snapshots are foundational for:

  • checkpointing long-running stream processors
  • debugging incident-time global state
  • recovery after node failures
  • verifying consistency invariants offline

Without consistent cuts, restored systems can violate causal assumptions.

5. Intuition behind marker messages

Markers delimit channel history:

  • messages before marker belong to snapshot channel state
  • messages after marker do not

This removes the need for synchronized clocks and still preserves causality.

6. Practical implementation considerations

Real systems need extra details:

  • snapshot IDs and metadata tracking
  • handling dynamic membership/topology changes
  • bounded memory for in-flight channel recording
  • persistent storage format for checkpoint state

Algorithmic correctness must be paired with engineering constraints.

7. Failure and restart behavior

During snapshot collection:

  • node crash must not orphan snapshot globally
  • coordinator (if used) needs timeout/abort logic
  • partial snapshots should be garbage-collected safely

A robust system treats snapshot orchestration as fault-tolerant workflow.

8. Common pitfalls

  • assuming FIFO channels where they are not guaranteed
  • mixing snapshot and business control messages without ordering guarantees
  • unbounded buffering while waiting for markers
  • restoring from inconsistent partial snapshots

Test snapshot logic under network delay, packet reordering, and process restarts.

9. Modern examples

Streaming frameworks and stateful dataflow engines use snapshot principles to achieve exactly-once state recovery.
The core idea remains Chandy-Lamport: capture local state plus in-transit message boundaries consistently.

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Continue Series

Reliability Engineering Mastery

Lesson 1 of 7 in this learning sequence.

Next in series
1

Expert

Distributed Snapshots: Chandy-Lamport Algorithm

Distributed Snapshots: Chandy-Lamport How do you take a "global photo" of a system where every node has a different time and no central master? 1. The Problem You need to save the state of a system for debugging or check…

2

Advanced

System Design: Designing Multi-Region Active-Active Architectures

Multi-Region Active-Active: The Global Scale Deploying to multiple regions is the only way to survive a total regional failure and provide sub-100ms latency to a global user base. An Active-Active setup means every regio…

3

Expert

Distributed Locking: The Danger of Fencing Tokens

Distributed Locking: The Danger of Fencing Tokens The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can m…

4

Expert

Distributed Garbage Collection: Managing References Across Networks

Distributed Garbage Collection In a microservices world, if Service A creates a resource in Service B, who is responsible for deleting it? If Service A crashes, that resource leaks forever. This is Distributed Memory Man…

5

Expert

Backpressure Propagation: Designing Flow Control in Microservices

Backpressure Propagation When your database is slow, your worker is slow. When your worker is slow, your Kafka consumer lags. When Kafka lags, your producer buffer fills up. Backpressure is the signal that propagates thi…

6

Advanced

Multi-Region DR: Warm Standby vs Active-Active

Multi-Region Disaster Recovery (DR) If a complete AWS region goes down, your system must keep running. Designing for regional failure requires moving from "highly available" to "disaster-proof." 1. Warm Standby (The Cost…

7

Advanced

Linearizability vs. Sequential Consistency: A Developer's Guide to Correctness

Linearizability vs. Sequential Consistency If you use a "Consistent" database, what guarantees are you actually getting? In distributed computing, there are two major models of "Strong" consistency. 1. Linearizability (T…

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignExpert

Distributed Locking: The Danger of Fencing Tokens

Distributed Locking: The Danger of Fencing Tokens The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can m…

Apr 20, 20263 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#locking#consistency
System DesignAdvanced

Linearizability vs. Sequential Consistency: A Developer's Guide to Correctness

Linearizability vs. Sequential Consistency If you use a "Consistent" database, what guarantees are you actually getting? In distributed computing, there are two major models of "Strong" consistency. 1. Linearizability (T…

Apr 20, 20263 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#consistency#linearizability
System DesignExpert

Distributed Garbage Collection: Managing References Across Networks

Distributed Garbage Collection In a microservices world, if Service A creates a resource in Service B, who is responsible for deleting it? If Service A crashes, that resource leaks forever. This is Distributed Memory Man…

Apr 20, 20263 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#memory-management#garbage-collection
System DesignAdvanced

System Design: Designing Multi-Region Active-Active Architectures

Multi-Region Active-Active: The Global Scale Deploying to multiple regions is the only way to survive a total regional failure and provide sub-100ms latency to a global user base. An Active-Active setup means every regio…

Apr 20, 20263 min read
Case StudyReliability Engineering Mastery
#system-design#multi-region#active-active

More in System Design

Category-based suggestions if you want to stay in the same domain.