System DesignAdvancedarticle

Distributed Data Observability: Metrics That Actually Matter

Beyond CPU and RAM: Learn the high-signal metrics for Kafka, Redis, and Cassandra that help you identify production issues before they become outages.

Sachin SarawgiApril 20, 20262 min read2 minute lesson

Distributed Data Observability: High-Signal Metrics

In a distributed data system, CPU and RAM are "low-signal" metrics. A system can have 10% CPU usage and still be completely stalled. To truly understand the health of your infrastructure, you need to monitor metrics that reflect the internal state of the distributed engine.

1. Kafka: The Lag and the ISR

  • Consumer Lag: The number of messages produced but not yet consumed. High lag indicates your consumers are too slow or stalled.
  • Under-Replicated Partitions (URP): The most critical Kafka metric. It indicates that some replicas are not caught up with the leader. If this stays high, you are at risk of data loss.
  • ISR Changes: Frequent changes in the "In-Sync Replica" set indicate network flaps or broker instability.

2. Redis: Memory and the Slow Log

  • Memory Fragmentation Ratio: If this is > 1.5, Redis is wasting memory. If it's < 1.0, Redis is swapping to disk (a performance death sentence).
  • Evicted Keys: If this is rising, your cache is too small for your working set.
  • Slow Log Count: Use SLOWLOG GET to identify (N)$ operations that are blocking the single-threaded event loop.

3. Cassandra: Read Repair and Compaction

  • Pending Compactions: If this number keeps growing, your disk I/O cannot keep up with your write volume.
  • Read Repair Background: High read-repair activity indicates that your replicas are frequently out of sync, possibly due to network issues or dropped mutations.
  • Local Read/Write Latency: Monitor the 99th percentile ($) specifically for the local DC.

4. The Golden Signals for All Data Systems

  1. Latency: How long does it take for a request to be serviced?
  2. Traffic: How many requests are being made?
  3. Errors: What percentage of requests are failing?
  4. Saturation: How "full" is your service (e.g., connection pool usage)?

5. Tracing the Data Path

Use Distributed Tracing (OpenTelemetry) to track a request as it moves from your application into your data store. This helps you identify if a slow response is due to a network hop, a slow query, or a database locking issue.

Summary

Monitoring is about finding the "signal" in the "noise." By focusing on system-specific metrics like Kafka's ISR and Cassandra's pending compactions, you can gain a deeper understanding of your cluster's health and prevent outages before they happen.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

Chaos Engineering for Data Infrastructure: Testing Distributed Resilience

Chaos Engineering for Data Infrastructure In a distributed system, "failure" isn't an if, but a when. Chaos Engineering is the discipline of experimenting on a software system in production to build confidence in its cap…

Apr 20, 20262 min read
Deep Dive
#chaos-engineering#distributed-systems#reliability
System DesignAdvanced

System Design: Real-Time Chat Application at Scale

Real-time chat systems are among the most architecturally interesting distributed systems. They require persistent connections at massive scale, exactly-once message delivery guarantees, presence detection across million…

Mar 17, 202511 min read
Deep Dive
#system design#websocket#real-time
System DesignAdvanced

System Design: Designing a Distributed Logging System (TB/Day Scale)

System Design: Designing a Distributed Logging System In a microservices architecture with thousands of containers, logs are scattered everywhere. You need a centralized system that can ingest terabytes of log data every…

Apr 20, 20263 min read
Deep Dive
#system-design#logging#elk-stack
System DesignAdvanced

System Design: Designing a Distributed Message Queue (Kafka Architecture)

System Design: Designing a Distributed Message Queue A Distributed Message Queue is the backbone of modern asynchronous architecture. It allows services to communicate without being tightly coupled. While many use Apache…

Apr 20, 20263 min read
Deep Dive
#system-design#kafka#message-queue

More in System Design

Category-based suggestions if you want to stay in the same domain.