Distributed Data Observability: High-Signal Metrics
In a distributed data system, CPU and RAM are "low-signal" metrics. A system can have 10% CPU usage and still be completely stalled. To truly understand the health of your infrastructure, you need to monitor metrics that reflect the internal state of the distributed engine.
1. Kafka: The Lag and the ISR
- Consumer Lag: The number of messages produced but not yet consumed. High lag indicates your consumers are too slow or stalled.
- Under-Replicated Partitions (URP): The most critical Kafka metric. It indicates that some replicas are not caught up with the leader. If this stays high, you are at risk of data loss.
- ISR Changes: Frequent changes in the "In-Sync Replica" set indicate network flaps or broker instability.
2. Redis: Memory and the Slow Log
- Memory Fragmentation Ratio: If this is > 1.5, Redis is wasting memory. If it's < 1.0, Redis is swapping to disk (a performance death sentence).
- Evicted Keys: If this is rising, your cache is too small for your working set.
- Slow Log Count: Use
SLOWLOG GETto identify (N)$ operations that are blocking the single-threaded event loop.
3. Cassandra: Read Repair and Compaction
- Pending Compactions: If this number keeps growing, your disk I/O cannot keep up with your write volume.
- Read Repair Background: High read-repair activity indicates that your replicas are frequently out of sync, possibly due to network issues or dropped mutations.
- Local Read/Write Latency: Monitor the 99th percentile ($) specifically for the local DC.
4. The Golden Signals for All Data Systems
- Latency: How long does it take for a request to be serviced?
- Traffic: How many requests are being made?
- Errors: What percentage of requests are failing?
- Saturation: How "full" is your service (e.g., connection pool usage)?
5. Tracing the Data Path
Use Distributed Tracing (OpenTelemetry) to track a request as it moves from your application into your data store. This helps you identify if a slow response is due to a network hop, a slow query, or a database locking issue.
Summary
Monitoring is about finding the "signal" in the "noise." By focusing on system-specific metrics like Kafka's ISR and Cassandra's pending compactions, you can gain a deeper understanding of your cluster's health and prevent outages before they happen.
