Distributed Snapshots: Chandy-Lamport

How do you take a "global photo" of a system where every node has a different time and no central master?

1. The Problem

You need to save the state of a system for debugging or checkpointing. If you stop the system, you break availability.

2. Chandy-Lamport Algorithm

A node starts a snapshot by saving its local state and sending a "Marker" message to all neighbors.
When a node receives a Marker, it saves its state and forwards the Marker to its neighbors.
If it receives another Marker, it simply records the messages sent between the two markers.

3. Consistency

It guarantees that if Event A happened before the snapshot, it is included. If Event B happened after, it is excluded. This is a "consistent cut" of the distributed state.

4. Why this matters in production

Distributed snapshots are foundational for:

checkpointing long-running stream processors
debugging incident-time global state
recovery after node failures
verifying consistency invariants offline

Without consistent cuts, restored systems can violate causal assumptions.

5. Intuition behind marker messages

Markers delimit channel history:

messages before marker belong to snapshot channel state
messages after marker do not

This removes the need for synchronized clocks and still preserves causality.

6. Practical implementation considerations

Real systems need extra details:

snapshot IDs and metadata tracking
handling dynamic membership/topology changes
bounded memory for in-flight channel recording
persistent storage format for checkpoint state

Algorithmic correctness must be paired with engineering constraints.

7. Failure and restart behavior

During snapshot collection:

node crash must not orphan snapshot globally
coordinator (if used) needs timeout/abort logic
partial snapshots should be garbage-collected safely

A robust system treats snapshot orchestration as fault-tolerant workflow.

8. Common pitfalls

assuming FIFO channels where they are not guaranteed
mixing snapshot and business control messages without ordering guarantees
unbounded buffering while waiting for markers
restoring from inconsistent partial snapshots

Test snapshot logic under network delay, packet reordering, and process restarts.

9. Modern examples

Streaming frameworks and stateful dataflow engines use snapshot principles to achieve exactly-once state recovery.
The core idea remains Chandy-Lamport: capture local state plus in-transit message boundaries consistently.

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Distributed Snapshots: Chandy-Lamport Algorithm