Distributed Snapshots: Chandy-Lamport
How do you take a "global photo" of a system where every node has a different time and no central master?
1. The Problem
You need to save the state of a system for debugging or checkpointing. If you stop the system, you break availability.
2. Chandy-Lamport Algorithm
- A node starts a snapshot by saving its local state and sending a "Marker" message to all neighbors.
- When a node receives a Marker, it saves its state and forwards the Marker to its neighbors.
- If it receives another Marker, it simply records the messages sent between the two markers.
3. Consistency
It guarantees that if Event A happened before the snapshot, it is included. If Event B happened after, it is excluded. This is a "consistent cut" of the distributed state.
4. Why this matters in production
Distributed snapshots are foundational for:
- checkpointing long-running stream processors
- debugging incident-time global state
- recovery after node failures
- verifying consistency invariants offline
Without consistent cuts, restored systems can violate causal assumptions.
5. Intuition behind marker messages
Markers delimit channel history:
- messages before marker belong to snapshot channel state
- messages after marker do not
This removes the need for synchronized clocks and still preserves causality.
6. Practical implementation considerations
Real systems need extra details:
- snapshot IDs and metadata tracking
- handling dynamic membership/topology changes
- bounded memory for in-flight channel recording
- persistent storage format for checkpoint state
Algorithmic correctness must be paired with engineering constraints.
7. Failure and restart behavior
During snapshot collection:
- node crash must not orphan snapshot globally
- coordinator (if used) needs timeout/abort logic
- partial snapshots should be garbage-collected safely
A robust system treats snapshot orchestration as fault-tolerant workflow.
8. Common pitfalls
- assuming FIFO channels where they are not guaranteed
- mixing snapshot and business control messages without ordering guarantees
- unbounded buffering while waiting for markers
- restoring from inconsistent partial snapshots
Test snapshot logic under network delay, packet reordering, and process restarts.
9. Modern examples
Streaming frameworks and stateful dataflow engines use snapshot principles to achieve exactly-once state recovery.
The core idea remains Chandy-Lamport: capture local state plus in-transit message boundaries consistently.
