Retry Queues vs. DLQ: Beyond Simple Retries
In a high-scale messaging system, failures are not a matter of 'if', but 'when'. Most developers make the mistake of either retrying indefinitely (blocking the partition) or dropping messages. Neither is acceptable for a production system.
1. The Poison Pill Problem
A 'Poison Pill' is a message that can never be processed successfully (e.g., a malformed JSON or a logic error). If your consumer keeps retrying this message in-place, the entire partition stops. No other messages can move forward.
2. Non-Blocking Retries
The modern solution is to move failed messages to a Retry Topic.
- The Flow: Main Topic -> Failure -> Retry Topic 1 (5s delay) -> Retry Topic 2 (30s delay) -> DLQ.
- Benefit: This allows the Main Topic to continue processing new messages while failed ones 'sleep' in the background.
3. Implementing Exponential Backoff
Never retry at a constant interval. You might overwhelm a downstream service that is already struggling. Use Exponential Backoff with Jitter to spread out the load.
Next in Mastery: The Saga Pattern: Error Handling
