Backpressure Propagation
When your database is slow, your worker is slow. When your worker is slow, your Kafka consumer lags. When Kafka lags, your producer buffer fills up. Backpressure is the signal that propagates this state upstream so you don't overwhelm the system.
1. TCP-Level vs. App-Level
- TCP: Default. If the buffer is full, the OS stops reading from the socket.
- Application: You must explicitly send a "Server Busy" (503/429) signal to upstream services.
2. Reactive Streams
Using libraries like Project Reactor or Akka Streams, you can implement a demand-based flow. The consumer asks for exactly N messages, ensuring it is never fed more than it can handle.
3. Backpressure must cross service boundaries
Many teams implement backpressure inside one process but lose control between services.
Real resilience requires propagation through every layer:
- DB pool saturation -> worker concurrency reduction
- worker lag -> broker consumer pause or reduced poll volume
- queue depth growth -> upstream rate limiting
- API pressure -> client-visible
429/503with retry hints
If any boundary ignores pressure, the system shifts failure rather than absorbing it.
4. Synchronous call chain patterns
For request/response microservices:
- set strict per-hop timeouts
- cap concurrent in-flight requests
- use bounded queues (avoid infinite buffering)
- shed non-critical features first
Infinite queueing hides overload until latency collapse becomes broad outage.
5. Async pipeline patterns (Kafka/SQS)
For event-driven systems:
- dynamic consumer concurrency based on downstream health
- pause/resume partitions when processing backlog crosses thresholds
- dead-letter poison messages quickly
- differentiate retryable vs non-retryable failures
Throughput goals should never exceed safe downstream processing capacity.
6. Backpressure and priority
Not all workloads are equal.
Introduce priority classes:
- Tier 0: payments/login/core writes
- Tier 1: standard business operations
- Tier 2: analytics/enrichment/non-critical jobs
During overload, shed Tier 2 first, then Tier 1, while preserving Tier 0 as long as possible.
7. Observability signals
Track these together:
- queue depth and age
- consumer lag by partition
- request rejection rate (
429/503) - thread pool and connection pool saturation
- end-to-end latency percentiles
Backpressure is healthy when rejection increases in a controlled way while core SLOs stay stable.
8. Common anti-patterns
- retry storms without jitter/backoff
- unbounded in-memory buffers
- no distinction between overload and functional errors
- silently dropping critical messages
- autoscaling without load-shedding controls
Backpressure is not "failing more"; it is failing intentionally to protect system integrity.
