Distributed Garbage Collection
In a microservices world, if Service A creates a resource in Service B, who is responsible for deleting it? If Service A crashes, that resource leaks forever. This is Distributed Memory Management.
1. Reference Counting vs. Leases
- Ref Counting: A service counts how many times a resource is used. This is fragile; a missed leads to permanent leaks.
- Leases: The resource is granted to a service for a fixed time (e.g., 60 seconds). If the service doesn't renew the lease, the backend automatically deletes the resource.
2. The Cycle Problem
If Service A depends on B, and B depends on A, you have a distributed cycle. Standard GC fails here. You need a distributed global garbage collector (like a marker-sweeper that traverses service boundaries) or, more simply, enforced time-based TTLs on all shared resources.
3. Why this appears in real architectures
Examples:
- workflow engine creates temporary objects in storage service
- authorization service issues delegated grants consumed by others
- media pipeline creates intermediate blobs across stages
When ownership spans services, cleanup guarantees become unclear.
4. Lease-based strategy in practice
Leases are often the safest default:
- creator obtains resource lease for fixed duration
- active owner renews lease via heartbeat
- missed renewals trigger automatic expiration cleanup
This bounds leak lifetime and removes dependence on perfect explicit delete calls.
5. Tombstones and deferred cleanup
Hard delete can be unsafe if references may still exist.
Many systems use:
- soft-delete tombstone
- grace period
- asynchronous sweeper that verifies no active references
This pattern reduces accidental data loss during transient reference delays.
6. Detecting distributed reference leaks
Track:
- orphan resource count by type
- lease renewal failure rate
- average resource age beyond expected TTL
- cleanup backlog depth
Without leak telemetry, distributed GC failures surface only as storage/cost explosions.
7. Handling cycles safely
For complex dependency graphs:
- model resources as graph edges with ownership metadata
- run periodic graph traversal to find unreachable components
- sweep in topological order when possible
For many teams, strict TTL + explicit ownership conventions give better ROI than full global tracing GC.
8. Design guidelines
- every resource has a declared owner
- every shared object has expiry policy
- renewal protocol is idempotent
- cleanup jobs are retry-safe and observable
- emergency manual cleanup runbook exists
Distributed GC is mostly about ownership contracts and lifecycle discipline.
