System DesignExpertarticlePart 4 of 7 in Reliability Engineering Mastery

Distributed Garbage Collection: Managing References Across Networks

How do you manage reference counting in a microservices environment? Deep dive into distributed cycles and lease-based memory management.

Sachin SarawgiApril 20, 20263 min read3 minute lesson

Distributed Garbage Collection

In a microservices world, if Service A creates a resource in Service B, who is responsible for deleting it? If Service A crashes, that resource leaks forever. This is Distributed Memory Management.

1. Reference Counting vs. Leases

  • Ref Counting: A service counts how many times a resource is used. This is fragile; a missed leads to permanent leaks.
  • Leases: The resource is granted to a service for a fixed time (e.g., 60 seconds). If the service doesn't renew the lease, the backend automatically deletes the resource.

2. The Cycle Problem

If Service A depends on B, and B depends on A, you have a distributed cycle. Standard GC fails here. You need a distributed global garbage collector (like a marker-sweeper that traverses service boundaries) or, more simply, enforced time-based TTLs on all shared resources.

3. Why this appears in real architectures

Examples:

  • workflow engine creates temporary objects in storage service
  • authorization service issues delegated grants consumed by others
  • media pipeline creates intermediate blobs across stages

When ownership spans services, cleanup guarantees become unclear.

4. Lease-based strategy in practice

Leases are often the safest default:

  • creator obtains resource lease for fixed duration
  • active owner renews lease via heartbeat
  • missed renewals trigger automatic expiration cleanup

This bounds leak lifetime and removes dependence on perfect explicit delete calls.

5. Tombstones and deferred cleanup

Hard delete can be unsafe if references may still exist.
Many systems use:

  • soft-delete tombstone
  • grace period
  • asynchronous sweeper that verifies no active references

This pattern reduces accidental data loss during transient reference delays.

6. Detecting distributed reference leaks

Track:

  • orphan resource count by type
  • lease renewal failure rate
  • average resource age beyond expected TTL
  • cleanup backlog depth

Without leak telemetry, distributed GC failures surface only as storage/cost explosions.

7. Handling cycles safely

For complex dependency graphs:

  • model resources as graph edges with ownership metadata
  • run periodic graph traversal to find unreachable components
  • sweep in topological order when possible

For many teams, strict TTL + explicit ownership conventions give better ROI than full global tracing GC.

8. Design guidelines

  • every resource has a declared owner
  • every shared object has expiry policy
  • renewal protocol is idempotent
  • cleanup jobs are retry-safe and observable
  • emergency manual cleanup runbook exists

Distributed GC is mostly about ownership contracts and lifecycle discipline.

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Continue Series

Reliability Engineering Mastery

Lesson 4 of 7 in this learning sequence.

Next in series
1

Expert

Distributed Snapshots: Chandy-Lamport Algorithm

Distributed Snapshots: Chandy-Lamport How do you take a "global photo" of a system where every node has a different time and no central master? 1. The Problem You need to save the state of a system for debugging or check…

2

Advanced

System Design: Designing Multi-Region Active-Active Architectures

Multi-Region Active-Active: The Global Scale Deploying to multiple regions is the only way to survive a total regional failure and provide sub-100ms latency to a global user base. An Active-Active setup means every regio…

3

Expert

Distributed Locking: The Danger of Fencing Tokens

Distributed Locking: The Danger of Fencing Tokens The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can m…

4

Expert

Distributed Garbage Collection: Managing References Across Networks

Distributed Garbage Collection In a microservices world, if Service A creates a resource in Service B, who is responsible for deleting it? If Service A crashes, that resource leaks forever. This is Distributed Memory Man…

5

Expert

Backpressure Propagation: Designing Flow Control in Microservices

Backpressure Propagation When your database is slow, your worker is slow. When your worker is slow, your Kafka consumer lags. When Kafka lags, your producer buffer fills up. Backpressure is the signal that propagates thi…

6

Advanced

Multi-Region DR: Warm Standby vs Active-Active

Multi-Region Disaster Recovery (DR) If a complete AWS region goes down, your system must keep running. Designing for regional failure requires moving from "highly available" to "disaster-proof." 1. Warm Standby (The Cost…

7

Advanced

Linearizability vs. Sequential Consistency: A Developer's Guide to Correctness

Linearizability vs. Sequential Consistency If you use a "Consistent" database, what guarantees are you actually getting? In distributed computing, there are two major models of "Strong" consistency. 1. Linearizability (T…

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignExpert

Distributed Snapshots: Chandy-Lamport Algorithm

Distributed Snapshots: Chandy-Lamport How do you take a "global photo" of a system where every node has a different time and no central master? 1. The Problem You need to save the state of a system for debugging or check…

Apr 20, 20262 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#snapshot#chandy-lamport
System DesignExpert

Distributed Locking: The Danger of Fencing Tokens

Distributed Locking: The Danger of Fencing Tokens The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can m…

Apr 20, 20263 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#locking#consistency
System DesignAdvanced

System Design: Designing Multi-Region Active-Active Architectures

Multi-Region Active-Active: The Global Scale Deploying to multiple regions is the only way to survive a total regional failure and provide sub-100ms latency to a global user base. An Active-Active setup means every regio…

Apr 20, 20263 min read
Case StudyReliability Engineering Mastery
#system-design#multi-region#active-active
System DesignExpert

Backpressure Propagation: Designing Flow Control in Microservices

Backpressure Propagation When your database is slow, your worker is slow. When your worker is slow, your Kafka consumer lags. When Kafka lags, your producer buffer fills up. Backpressure is the signal that propagates thi…

Apr 20, 20263 min read
Deep DiveReliability Engineering Mastery
#backpressure#resilience#microservices

More in System Design

Category-based suggestions if you want to stay in the same domain.