System DesignExpertarticlePart 5 of 7 in Reliability Engineering Mastery

Backpressure Propagation: Designing Flow Control in Microservices

Stop your system from crashing under load. Learn how to propagate backpressure signals from the database through Kafka to the client.

Sachin SarawgiApril 20, 20263 min read3 minute lesson

Backpressure Propagation

When your database is slow, your worker is slow. When your worker is slow, your Kafka consumer lags. When Kafka lags, your producer buffer fills up. Backpressure is the signal that propagates this state upstream so you don't overwhelm the system.

1. TCP-Level vs. App-Level

  • TCP: Default. If the buffer is full, the OS stops reading from the socket.
  • Application: You must explicitly send a "Server Busy" (503/429) signal to upstream services.

2. Reactive Streams

Using libraries like Project Reactor or Akka Streams, you can implement a demand-based flow. The consumer asks for exactly N messages, ensuring it is never fed more than it can handle.

3. Backpressure must cross service boundaries

Many teams implement backpressure inside one process but lose control between services.
Real resilience requires propagation through every layer:

  • DB pool saturation -> worker concurrency reduction
  • worker lag -> broker consumer pause or reduced poll volume
  • queue depth growth -> upstream rate limiting
  • API pressure -> client-visible 429/503 with retry hints

If any boundary ignores pressure, the system shifts failure rather than absorbing it.

4. Synchronous call chain patterns

For request/response microservices:

  • set strict per-hop timeouts
  • cap concurrent in-flight requests
  • use bounded queues (avoid infinite buffering)
  • shed non-critical features first

Infinite queueing hides overload until latency collapse becomes broad outage.

5. Async pipeline patterns (Kafka/SQS)

For event-driven systems:

  • dynamic consumer concurrency based on downstream health
  • pause/resume partitions when processing backlog crosses thresholds
  • dead-letter poison messages quickly
  • differentiate retryable vs non-retryable failures

Throughput goals should never exceed safe downstream processing capacity.

6. Backpressure and priority

Not all workloads are equal.
Introduce priority classes:

  • Tier 0: payments/login/core writes
  • Tier 1: standard business operations
  • Tier 2: analytics/enrichment/non-critical jobs

During overload, shed Tier 2 first, then Tier 1, while preserving Tier 0 as long as possible.

7. Observability signals

Track these together:

  • queue depth and age
  • consumer lag by partition
  • request rejection rate (429/503)
  • thread pool and connection pool saturation
  • end-to-end latency percentiles

Backpressure is healthy when rejection increases in a controlled way while core SLOs stay stable.

8. Common anti-patterns

  • retry storms without jitter/backoff
  • unbounded in-memory buffers
  • no distinction between overload and functional errors
  • silently dropping critical messages
  • autoscaling without load-shedding controls

Backpressure is not "failing more"; it is failing intentionally to protect system integrity.

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Continue Series

Reliability Engineering Mastery

Lesson 5 of 7 in this learning sequence.

Next in series
1

Expert

Distributed Snapshots: Chandy-Lamport Algorithm

Distributed Snapshots: Chandy-Lamport How do you take a "global photo" of a system where every node has a different time and no central master? 1. The Problem You need to save the state of a system for debugging or check…

2

Advanced

System Design: Designing Multi-Region Active-Active Architectures

Multi-Region Active-Active: The Global Scale Deploying to multiple regions is the only way to survive a total regional failure and provide sub-100ms latency to a global user base. An Active-Active setup means every regio…

3

Expert

Distributed Locking: The Danger of Fencing Tokens

Distributed Locking: The Danger of Fencing Tokens The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can m…

4

Expert

Distributed Garbage Collection: Managing References Across Networks

Distributed Garbage Collection In a microservices world, if Service A creates a resource in Service B, who is responsible for deleting it? If Service A crashes, that resource leaks forever. This is Distributed Memory Man…

5

Expert

Backpressure Propagation: Designing Flow Control in Microservices

Backpressure Propagation When your database is slow, your worker is slow. When your worker is slow, your Kafka consumer lags. When Kafka lags, your producer buffer fills up. Backpressure is the signal that propagates thi…

6

Advanced

Multi-Region DR: Warm Standby vs Active-Active

Multi-Region Disaster Recovery (DR) If a complete AWS region goes down, your system must keep running. Designing for regional failure requires moving from "highly available" to "disaster-proof." 1. Warm Standby (The Cost…

7

Advanced

Linearizability vs. Sequential Consistency: A Developer's Guide to Correctness

Linearizability vs. Sequential Consistency If you use a "Consistent" database, what guarantees are you actually getting? In distributed computing, there are two major models of "Strong" consistency. 1. Linearizability (T…

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

Multi-Region DR: Warm Standby vs Active-Active

Multi-Region Disaster Recovery (DR) If a complete AWS region goes down, your system must keep running. Designing for regional failure requires moving from "highly available" to "disaster-proof." 1. Warm Standby (The Cost…

Apr 20, 20263 min read
Case StudyReliability Engineering Mastery
#multi-region#disaster-recovery#reliability
System DesignExpert

Distributed Snapshots: Chandy-Lamport Algorithm

Distributed Snapshots: Chandy-Lamport How do you take a "global photo" of a system where every node has a different time and no central master? 1. The Problem You need to save the state of a system for debugging or check…

Apr 20, 20262 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#snapshot#chandy-lamport
System DesignExpert

Distributed Locking: The Danger of Fencing Tokens

Distributed Locking: The Danger of Fencing Tokens The most common failure in distributed locking is assuming that the lock is 100% secure. A system pause (e.g., a 2-second Garbage Collection pause in your Java app) can m…

Apr 20, 20263 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#locking#consistency
System DesignExpert

Distributed Garbage Collection: Managing References Across Networks

Distributed Garbage Collection In a microservices world, if Service A creates a resource in Service B, who is responsible for deleting it? If Service A crashes, that resource leaks forever. This is Distributed Memory Man…

Apr 20, 20263 min read
Deep DiveReliability Engineering Mastery
#distributed-systems#memory-management#garbage-collection

More in System Design

Category-based suggestions if you want to stay in the same domain.