System DesignAdvancedarticle

System Design: Designing a Metrics Monitoring and Alerting System

How does Prometheus or Datadog monitor millions of time-series metrics? A technical deep dive into Pull vs. Push models, TSDB architecture, and alerting.

Sachin SarawgiApril 20, 20263 min read3 minute lesson

System Design: Designing a Metrics Monitoring System

Monitoring is the nervous system of any production infrastructure. A metrics monitoring system must ingest millions of data points every second, store them efficiently for years, and trigger alerts in milliseconds when things go wrong.

1. Core Requirements

  • Metrics Collection: Gathering data (CPU, RAM, Error rates) from thousands of servers.
  • Time-Series Storage: Efficiently storing (Timestamp, Value) pairs.
  • Querying: Supporting complex aggregations (e.g., $ latency across all shards).
  • Alerting: Notifying engineers when thresholds are crossed.

2. Ingestion Models: Pull vs. Push

Pull Model (Prometheus)

The monitoring server actively "scrapes" metrics from targets via HTTP.

  • Pros: Automatic service discovery; server controls the load; easy to monitor target health (if scrape fails, target is down).
  • Cons: Hard to use with short-lived jobs (serverless) or behind firewalls.

Push Model (StatsD, Datadog)

Targets push their metrics to the monitoring server.

  • Pros: Great for serverless/ephemeral jobs; works behind firewalls.
  • Cons: Targets can overwhelm the server during a spike; harder to detect if a target has silently crashed.

3. Storage Architecture: TSDB (Time-Series Database)

Traditional databases are too slow for metrics. A specialized TSDB is required.

  • Data Layout: Metrics are grouped by (Name + Labels). All data for a single "time series" is stored contiguously on disk.
  • Compression: Since values for the same metric often change slowly, TSDBs use specialized compression like Delta-of-Delta encoding (for timestamps) and Gorilla compression (for floating-point values), reducing storage by 90%+.

4. Querying and Aggregation

Most queries involve "rolling up" data (e.g., "Show me the average CPU over the last 5 minutes").

  • Optimization: Pre-compute common aggregations (Recording Rules) and store them as separate time series to make dashboards blazing fast.

5. High Availability and Scalability

Ingestion is easy to scale, but storage is hard.

  • Clustering: Use a distributed TSDB like M3DB or VictoriaMetrics.
  • Long-term Storage: Move older, less-frequently accessed data to cheaper object storage (S3) while keeping "hot" data in local SSDs.

6. Alerting (The "Silence" Problem)

The biggest challenge in alerting is Alert Fatigue.

  • The Logic: An alerting daemon runs queries against the TSDB periodically. If the result crosses a threshold for a sustained period (e.g., 5 minutes), it triggers an alert.
  • Optimization: Group related alerts (e.g., if a whole DC is down, send one alert, not 1,000 for every server).

Summary

Building a monitoring system is about managing Throughput and Compression. By leveraging the Pull model for discovery and a specialized TSDB for storage, you can gain deep visibility into your systems without breaking the bank or the server.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

More in System Design

Category-based suggestions if you want to stay in the same domain.