Lesson 57 of 105 20 minFlagship

System Design: Designing a Metrics Monitoring and Alerting System

How does Prometheus or Datadog monitor millions of time-series metrics? A technical deep dive into Pull vs. Push models, TSDB architecture, and alerting.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • Monitoring systems collect multi-dimensional time-series data using a metric name and key-value label pairs.
  • Prometheus uses a pull-based ingestion model, dynamically discovering targets and scraping metrics via HTTP.
  • Specialized TSDBs leverage delta-of-delta compression and Gorilla float compression to reduce storage footprints by 90%.
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Monitoring is the operational nervous system of any production infrastructure. At modern scale, a metrics monitoring system must ingest millions of data points every second, store them efficiently for extended periods, support real-time aggregation queries, and trigger alerts in milliseconds when thresholds are violated.

Traditional relational databases fail at this scale because the volume of writes, combined with the overhead of index updates and locking mechanisms, creates immediate disk I/O bottlenecks. Conversely, general-purpose NoSQL databases lack specialized time-series compression and query optimizations. Consequently, modern observability platforms rely on custom-designed Time-Series Databases (TSDBs) and decoupled, resilient metric collection pipelines.

This system design guide details the architectural blueprint for designing a metrics monitoring and alerting system similar to Prometheus, capable of handling 10 million active time-series metrics and 100,000 target scrapes per second.


System Requirements

An enterprise-grade metrics monitoring platform requires a clear division of operational responsibilities. We model these as functional requirements, non-functional requirements, and explicit scale parameters.

Functional Requirements

  • Metric Ingestion: Periodically collect metrics from thousands of targets (such as microservice instances, servers, databases, and containers) using a pull-based HTTP mechanism.
  • Multi-Dimensional Data Model: Store time-series samples identified by a metric name and a set of key-value pairs (labels/tags).
  • Time-Series Query Engine: Support real-time mathematical operations, aggregation over dimensions, and rate calculations over moving time windows.
  • Alert Evaluation: Continuously execute query expressions against the data store to detect anomalies and trigger alerts.
  • Service Discovery Integration: Dynamically discover new scrape targets as containers scale up or down, integrating with platforms like Kubernetes, Consul, or AWS EC2 APIs.
  • Alert Routing & Management: Deduplicate, group, inhibit, and route fired alerts to notification endpoints (such as PagerDuty, Slack, or webhooks) while managing active silences.

Non-Functional Requirements

  • High Ingestion Scalability: Accept millions of data writes per second with low latency, ensuring the system handles spikes in infrastructure size.
  • Query Isolation: Prevent complex dashboard queries from stalling the ingestion pipeline or causing memory exhaustion.
  • Operational Decoupling: Design the monitoring infrastructure to operate independently of the monitored application. If monitored applications experience a catastrophic failure, the monitoring system must remain active to report the outage.
  • Storage Efficiency: Compress metrics using specialized time-series codecs to minimize disk space requirements for long-term retention.
  • Fault-Tolerant Scheduling: Ensure targets are scraped at consistent intervals without drift, distributing network request loads evenly.

Scale Assumptions

  • Active Time-Series: 10,000,000 unique series (combinations of metric names and label sets).
  • Scrape Targets: 10,000 active service instances.
  • Metrics Per Target: Average of 1,000 unique metrics per target instance.
  • Scrape Interval: 15 seconds.
  • Data Retention: 30 days of raw historical data.

API Design and Interface Contracts

The monitoring pipeline uses clean interface definitions to manage data exposition, internal alert propagation, and query execution.

1. Scrape Target Exposition Format (HTTP GET /metrics)

Monitored service instances expose their metric state using a line-oriented, plaintext exposition format. This design minimizes parsing overhead.

# HELP http_requests_total The total number of HTTP requests processed by the service.
# TYPE http_requests_total counter
http_requests_total{method="POST",handler="/v1/checkout",status="200"} 1027 1717757280000
http_requests_total{method="POST",handler="/v1/checkout",status="500"} 3 1717757280000
# HELP system_cpu_utilization The fractional CPU utilization of the host.
# TYPE system_cpu_utilization gauge
system_cpu_utilization{cpu="0"} 0.42 1717757280000
system_cpu_utilization{cpu="1"} 0.38 1717757280000

2. Prometheus to Alertmanager Alert Schema (HTTP POST /api/v2/alerts)

When Prometheus evaluates an alert rule and finds it active, it pushes the alert state payload to the Alertmanager cluster.

[
  {
    "labels": {
      "alertname": "InstanceDown",
      "instance": "order-service-app-55bf9.prod",
      "severity": "critical",
      "job": "order-service"
    },
    "annotations": {
      "summary": "Service instance is unreachable",
      "description": "Scraping of order-service-app-55bf9.prod has failed for greater than 3 minutes."
    },
    "startsAt": "2026-06-07T11:48:00Z"
  }
]

3. Query API contracts (gRPC Protocol)

To support federated querying and dashboard visualization, the storage and query engine expose a gRPC API.

syntax = "proto3";

package codesprintpro.monitoring.v1;

service QueryService {
  rpc EvaluateInstantQuery (InstantQueryRequest) returns (InstantQueryResponse);
  rpc EvaluateRangeQuery (RangeQueryRequest) returns (RangeQueryResponse);
}

message LabelPair {
  string name = 1;
  string value = 2;
}

message Sample {
  int64 timestamp_ms = 1;
  double value = 2;
}

message Series {
  repeated LabelPair labels = 1;
  repeated Sample samples = 2;
}

message InstantQueryRequest {
  string query = 1;
  int64 evaluation_timestamp_ms = 2;
}

message InstantQueryResponse {
  repeated Series result_series = 1;
}

message RangeQueryRequest {
  string query = 1;
  int64 start_timestamp_ms = 2;
  int64 end_timestamp_ms = 3;
  int64 step_ms = 4;
}

message RangeQueryResponse {
  repeated Series result_series = 1;
}

High-Level Architecture

The monitoring platform is divided into two distinct components: the Metrics Engine (which pulls metrics and stores them in a TSDB) and the Alertmanager (which aggregates and dispatches alerts).

Metrics Collection Pipeline

The Metrics Engine uses service discovery to identify target endpoints, invokes the pull worker pool to scrape metrics via HTTP, and appends the incoming samples to the Time-Series Database.

graph TD
    SD[Service Discovery: K8s / Consul] -->|Target Lists| Scraper[Scrape Coordinator]
    Scraper -->|Worker Allocation| PullPool[HTTP Scrape Worker Pool]
    PullPool -->|GET /metrics| T1[Target App Instance A]
    PullPool -->|GET /metrics| T2[Target App Instance B]
    PullPool -->|GET /metrics| T3[Target App Instance C]
    
    T1 -.->|Plaintext Metrics| PullPool
    T2 -.->|Plaintext Metrics| PullPool
    T3 -.->|Plaintext Metrics| PullPool
    
    PullPool -->|Append Samples| HeadBlock[In-Memory Head Block]
    HeadBlock -->|Durable Append| WAL[(Write-Ahead Log)]
    
    HeadBlock -->|Flush every 2h| DiskBlocks[(Persistent Disk Block Storage)]
    QueryEngine[Query Engine / PromQL] -->|Read Hot Samples| HeadBlock
    QueryEngine -->|Read Cold Samples| DiskBlocks
    Grafana[Visualization: Grafana] -->|Query Request| QueryEngine

Alert Evaluation and Routing Engine

Alert rules are evaluated as database queries. When a query evaluates to true over a designated time window, it fires an alert. The Alertmanager cluster receives this alert, groups it with related failures, and routes it to the designated notification platform.

graph TD
    RuleEngine[Alert Rule Evaluator] -->|Query Execution| TSDB[(Time-Series Database)]
    TSDB -->|Series Data| RuleEngine
    RuleEngine -->|Fires Active Alerts| AM_Ingest[Alertmanager Ingestion Queue]
    
    subgraph Alertmanager Cluster
        AM_Ingest --> Gossip[Gossip Network Layer]
        Gossip --> Deduplicator[Deduplicator & Silver Filter]
        Deduplicator --> Grouper[Alert Grouper & Buffer]
        Grouper --> Router[Notification Router]
    end
    
    Router -->|Push Pagers| PagerDuty[PagerDuty API]
    Router -->|Send Chat Message| Slack[Slack Webhook]
    Router -->|Generic Event| Webhook[Webhook Server]

Low-Level Design and Schema

Prometheus manages time-series storage through a custom TSDB layout designed to write high volumes of samples sequentially. However, metadata and alerting states require structured schema definitions.

TSDB Data Layout and Storage Mechanics

The custom TSDB does not run typical relational queries. Instead, it groups metrics into blocks. Each block covers a 2-hour period and resides in a directory containing:

  • meta.json: Details the time range, block ID, version, and compaction metadata.
  • chunks/: Files containing raw time-series data compressed using Gorilla floating-point encoding.
  • index: A search index mapping metric names and labels to the byte offsets of the compressed chunks.

The active directory layout on disk is structured as follows:

data/
├── 01FPS6V7A3QG713XE932A5B9CX/       # Completed 2-hour block
│   ├── meta.json                     # Time boundaries and block type
│   ├── index                         # Label-to-chunk index
│   └── chunks/                       # Compressed time-series segment files
│       └── 000001
├── 01FPS8T9B4RH824YF943B6C0DY/       # Older compacted block
│   ├── meta.json
│   ├── index
│   └── chunks/
│       └── 000001
├── wal/                              # Write-Ahead Log for active memory
│   ├── 00000001
│   └── 00000002

Relational Database Schemas for Metadata, Alert Rules, and Auditing

While raw samples reside in the TSDB, target definitions, alert rules configurations, and historic alert event logs are managed in a Postgres relational database.

-- Represents dynamic monitoring targets discovered or configured
CREATE TABLE monitored_targets (
    target_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_name VARCHAR(128) NOT NULL,
    target_url VARCHAR(256) NOT NULL UNIQUE,
    scrape_interval_seconds INT NOT NULL DEFAULT 15,
    scrape_timeout_seconds INT NOT NULL DEFAULT 10,
    static_labels JSONB NOT NULL DEFAULT '{}'::jsonb,
    is_active BOOLEAN NOT NULL DEFAULT TRUE,
    last_scraped_at TIMESTAMPTZ,
    last_scrape_status VARCHAR(32) DEFAULT 'UNKNOWN', -- UP, DOWN, TIMEOUT, PARSE_ERROR
    last_scrape_duration_ms INT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_targets_job_status ON monitored_targets (job_name, is_active);
CREATE INDEX idx_targets_labels ON monitored_targets USING GIN (static_labels);

-- Stores alert rules defined in the system
CREATE TABLE alert_rules (
    rule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    rule_name VARCHAR(128) NOT NULL UNIQUE,
    promql_expression TEXT NOT NULL,
    for_duration_seconds INT NOT NULL, -- Threshold duration before triggering alert
    rule_labels JSONB NOT NULL DEFAULT '{}'::jsonb,
    rule_annotations JSONB NOT NULL DEFAULT '{}'::jsonb,
    is_enabled BOOLEAN NOT NULL DEFAULT TRUE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Tracks historical alert lifecycle events for analytics and dashboards
CREATE TABLE fired_alerts (
    alert_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    rule_id UUID NOT NULL REFERENCES alert_rules(rule_id) ON DELETE CASCADE,
    instance_name VARCHAR(256) NOT NULL,
    alert_status VARCHAR(32) NOT NULL DEFAULT 'FIRING', -- FIRING, RESOLVED
    fingerprint VARCHAR(64) NOT NULL, -- Hash of labels to deduplicate active instances
    alert_labels JSONB NOT NULL,
    started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    resolved_at TIMESTAMPTZ,
    silenced_by VARCHAR(128),
    resolved_by VARCHAR(32) DEFAULT 'SYSTEM' -- SYSTEM, MANUAL
);

CREATE INDEX idx_fired_alerts_fingerprint ON fired_alerts (fingerprint) WHERE alert_status = 'FIRING';
CREATE INDEX idx_fired_alerts_history ON fired_alerts (started_at DESC, alert_status);

Schema Rationale & Index Optimization

  1. idx_fired_alerts_fingerprint: A partial index restricted to alert_status = 'FIRING'. When a new alert ingestion payload lands from the evaluation engine, the system hashes the alert labels to generate a fingerprint. This index allows the engine to locate existing active instances of that alert in less than a millisecond, preventing duplicate rows.
  2. static_labels (GIN Index): The labels column is indexed using a generalized inverted index (GIN). This allows operators to query targets based on dynamic attributes (e.g., static_labels @> '{"env": "production"}') without scanning the entire table.
  3. ON DELETE CASCADE on fired_alerts: Ensures that deleting an alert rule automatically cleans up historical logs in developmental or testing environments, preventing orphaned rows.

Scaling Challenges and Capacity Estimation

Managing 10 million active series with 15-second scrape intervals requires evaluating ingestion throughput, memory overhead, and disk footprints.

1. Ingestion Rate and Network Throughput

  • Assumptions:

    • Active Series ($N$) = $10,000,000$ unique metrics
    • Scrape Interval ($I$) = $15$ seconds
    • Average Scrape Size per Target ($S$) = $1,000$ metrics
    • Total Scrape Targets ($T$) = $10,000$ instances
  • Calculations: $$\text{Ingestion Throughput} = \frac{N}{I} = \frac{10,000,000\text{ series}}{15\text{ seconds}} \approx 666,667\text{ samples/second}$$

    Each HTTP scrape returns plaintext data. Assuming an average size of $100$ bytes per metric line (including label descriptions and comments): $$\text{Payload per Target Scrape} = 1,000\text{ metrics} \times 100\text{ bytes} = 100\text{ KB}$$ $$\text{Total Network Input Rate} = \frac{10,000\text{ targets} \times 100\text{ KB}}{15\text{ seconds}} = 66.7\text{ MB/second} \approx 533\text{ Mbps}$$

This volume of constant ingress network traffic requires spreading the network load. To scale ingestion, we use consistent hashing over target identifiers to shard scrape lists across $10$ collection instances, keeping each container's incoming network bandwidth under a manageable $6.7$ MB/second.

2. Time-Series Storage Footprint

  • Assumptions:

    • Ingestion Rate = $666,667$ samples/second
    • Average byte footprint per sample (using Gorilla float compression) = $1.5$ bytes
    • Retention period ($R$) = $30$ days ($2,592,000$ seconds)
  • Calculations: $$\text{Bytes Written Per Second} = 666,667 \times 1.5\text{ bytes} = 1,000,000\text{ bytes/second} = 1\text{ MB/second}$$ $$\text{Hourly Storage Rate} = 1\text{ MB/s} \times 3,600\text{ seconds} = 3.6\text{ GB/hour}$$ $$\text{Daily Storage Rate} = 3.6\text{ GB} \times 24\text{ hours} = 86.4\text{ GB/day}$$ $$\text{Total 30-Day Storage Size} = 86.4\text{ GB/day} \times 30\text{ days} = 2.592\text{ TB}$$

While $2.6$ TB is easily stored on a single solid-state drive (SSD), index files for 10 million active series grow significantly. The TSDB index mapping names to chunks requires solid-state random read performance to prevent dashboard query timeouts.

3. Active Memory (RAM) Calculation for In-Memory Head Block

  • Assumptions:

    • Prometheus holds the active 2 hours of data in memory inside the Head Block before flushing to disk.
    • Active Series = $10,000,000$
    • Each series uses a dedicated 120-byte struct in RAM to track active chunk pointers, label indexes, and metric labels.
    • A sample consists of an 8-byte timestamp and an 8-byte float value ($16$ bytes raw).
  • Calculations: $$\text{Samples per Series in 2 Hours} = \frac{7,200\text{ seconds}}{15\text{ seconds}} = 480\text{ samples}$$ Using Gorilla compression, these 480 samples compress to a chunk size of approximately $720$ bytes. $$\text{Total Raw Series Structure Data} = 10,000,000\text{ series} \times (120\text{ bytes} + 720\text{ bytes}) = 8.4\text{ GB}$$

    In Go or Java environments, heap structures, garbage collection pointers, and label index trees introduce an overhead multiplier of 3x to 4x. Therefore, the scraping node requires at least $32$ GB of physical RAM to maintain ingestion stability without triggering Out-Of-Memory (OOM) exceptions.


Failure Scenarios and Resilience

Observability systems must be resilient to infrastructure overloads. If the monitoring platform crashes during an outage, engineers lose the ability to diagnose the root cause.

1. High-Cardinality Label Explosion

A developer releases code containing a metric with a high-cardinality dynamic tag (e.g., http_request_duration_seconds{user_id="<uuid>"}).

  • The Threat: The active time-series count jumps from 10 million to 100 million within minutes. The index structures consume all available memory, triggering an OOM-kill loop on the collection engine.
  • Resilience Design:
    • We enforce Scrape Limits. Every job configuration defines a hard limit: sample_limit: 2000. If a target exposes more than 2,000 metrics, the scrape worker aborts the scrape, discards the payload, and sets the target status to OVERLIMIT.
    • We implement a Cardinality Controller that profiles labels. If the number of unique values for a label key on a target exceeds 1,000, the controller drops the label or replaces the value with a fallback string like reassigned_high_cardinality.

2. Alertmanager Network Partition (Split-Brain Event)

A network partition isolates Alertmanager nodes in a multi-region deployment, cutting off communication between instance A and instance B.

  • The Threat: Instance A and Instance B both detect a service failure. If both dispatch pages to the same engineer, they cause alert fatigue.
  • Resilience Design:
    • Alertmanagers are configured as a cluster using the HashiCorp Memberlist library (Gossip protocol).
    • During a partition, they default to sending notifications. We prefer duplicate alerts over missed alerts (high availability over consistency).
    • To minimize duplicates, we implement client-side gossip tracking. Each node delays sending notifications by a calculated buffer window: $$\text{Wait Interval} = \text{base_delay} + (\text{index_of_node} \times \text{offset})$$ This gives the primary node time to send the notification and publish the dispatch confirmation event to the remaining accessible cluster members.

3. Target Scrape Timeouts and Slow Connections

A target service experiences garbage collection pauses or thread pool exhaustion, causing HTTP requests to hang.

  • The Threat: Scrape worker threads block waiting for HTTP responses, starving the work queue and preventing other healthy targets from being scraped.
  • Resilience Design:
    • We enforce a strict timeout constraint: scrape_timeout: 10s. This value must be less than the scrape interval ($15$ seconds).
    • The HTTP client uses non-blocking asynchronous socket selectors. We allocate a fixed worker pool for each job. If a target fails to respond within the timeout, the socket is forced closed, the failure is logged, and the worker is returned to the pool.

4. Alert Storms during Cascading Failures

A central database instance crashes, causing 500 dependent microservice instances to throw connection errors and trigger alerts simultaneously.

  • The Threat: The Alertmanager dispatches 500 individual pages within seconds, overloading notification gateways and overwhelming the on-call engineer.
  • Resilience Design:
    • We configure Inhibition Rules. We define a top-level alert rule for the database (e.g., DatabaseDown). If DatabaseDown is active, it inhibits child alerts containing the label match dependency="database".
    • We use Alert Grouping. Alerts are grouped by service name and environment (e.g., group_by: ['alertname', 'service']). When multiple targets in the same service fail, they are held in a buffer for 30 seconds (group_wait: 30s) and sent as a single aggregated notification.

Architectural Trade-offs

Choosing the ingestion model and database architecture involves balancing system complexity against resource efficiency.

Trade-off 1: Pull Ingestion Model vs. Push Ingestion Model

The pull model fetches metrics via periodic HTTP scrapes; the push model requires applications to publish metrics directly to a collector.

Feature / Aspect Pull Model (Prometheus) Push Model (StatsD / Datadog)
Service Discovery Centralized. Server query APIs detect targets and control scrape lists. Decentralized. Targets must be configured with the destination address.
Load Control / Protection High. Server determines scrape intervals and concurrency limits. Low. A spike in traffic can cause instances to overwhelm the collector.
Firewall / Network Complexity High. Requires bidirectional routing or proxy gateways for private subnets. Low. Targets only need outbound egress to the monitoring endpoint.
Health Monitoring Inherent. If a scrape fails, the target is flagged as offline (up = 0). Passive. Requires complex dead-man's switch daemons to detect missing endpoints.
Ephemeral Jobs Support Poor. Requires intermediate buffers (e.g., Pushgateway) to hold state. High. Short-lived tasks can push metrics right before exiting.

Trade-off 2: Time-Series Database (TSDB) vs. Relational Database (RDBMS)

A custom TSDB stores compressed metrics sequentially; an RDBMS uses standard tables, index trees, and transaction logs.

Feature / Aspect Custom Time-Series Database (TSDB) Relational Database (RDBMS)
Write Performance Maximum. Writes are buffered in memory and written sequentially to disk. Low. Row locks, index leaf splits, and write-ahead logs limit throughput.
Storage Efficiency High. Delta-of-delta and Gorilla compression compress samples to 1.5 bytes. Low. Row overhead, alignment bytes, and indices increase space usage.
Query Specialization Optimized for temporal aggregations (e.g., rate, percentile over time). Poor. Multi-row aggregations require table scans or complex index updates.
Data Integrity Weak consistency. No support for cross-row transactions or ACID. Strong consistency. Enforces schema validation and transactional safety.

Staff Engineer Perspective

Operating a large-scale metrics monitoring infrastructure requires implementing strict limits at the collection layer.


Verbal Script

Interviewer: "Why does Prometheus use a pull model, and how would you monitor ephemeral, short-lived jobs in a pull-based architecture?"

Candidate: "Prometheus uses a pull model because it places control over ingestion rates and resource utilization on the monitoring server itself. The server decides when to scrape and can protect itself from being overwhelmed by application spikes. Additionally, pull scraping provides an out-of-the-box health check: if the server cannot reach a target, it marks the target as offline.

For short-lived jobs—like serverless functions or batch cron tasks that finish in less than 15 seconds—a pull model struggles because the target might exit before the next scrape loop runs.

To solve this, we use an intermediate buffer called the Pushgateway. Ephemeral jobs push their metrics to the Pushgateway via HTTP POST before they terminate. The Pushgateway caches these metrics and exposes them as a static target that Prometheus can scrape at its regular interval.

However, this introduces a trade-off: we lose direct health monitoring of the ephemeral job, and we must manage the lifecycle of metrics in the Pushgateway manually to prevent stale metrics from persisting."


Interviewer: "How would you handle a high-cardinality tag explosion that causes a Prometheus container to trigger an OOM-kill?"

Candidate: "We handle high-cardinality explosions using a multi-tiered defense strategy: prevention at the scraper layer, rate-limiting on target metrics, and database write isolation.

First, on the scraper nodes, we configure a hard limit on the number of metrics allowed per scrape target using the sample_limit configuration. If a target attempts to expose 50,000 metrics due to an unbounded label (like user_uuid), the scraper aborts the scrape immediately, discards the payload, and logs the event. This prevents the high-cardinality data from entering the active Head block memory.

Second, if the metrics bypass the scraper limit, we use metric relabeling rules to drop or normalize high-cardinality labels before they are committed to the TSDB. For instance, we can rewrite the path label on HTTP metrics to remove query parameters or ID paths, grouping them into static buckets like /users/:id.

Finally, we ensure query isolation by setting resource limits on PromQL evaluations. We configure maximum memory boundaries for single queries so that a developer running a broad regex aggregation query across millions of series is blocked before consuming all system memory and crashing the container."


Interviewer: "How would you design a monitoring architecture for a global deployment across multiple cloud regions with 50 million active metrics?"

Candidate: "At a scale of 50 million active metrics across multiple regions, a single centralized monitoring node is not viable due to network costs, ingestion limits, and single-point-of-failure risks. We design a Federated and Distributed Observability Architecture using Thanos or VictoriaMetrics.

First, we deploy local, independent Prometheus agents in every availability zone and region. These agents perform local target discovery and scrape metrics, writing them to a local TSDB with a short retention period (e.g., 2 hours). This guarantees that network issues between regions do not impact local metrics collection.

Second, we attach a Thanos Sidecar to each local Prometheus agent. As the agent flushes its 2-hour TSDB blocks to disk, the sidecar uploads them to a regional, low-cost object storage bucket (like AWS S3 or Google Cloud Storage).

Third, we deploy a stateless Thanos Query layer globally. When an operator runs a dashboard query in Grafana, the Query engine splits the request, queries the active Head blocks via regional gRPC sidecars for real-time data, and queries the object store for historical data. It performs deduping, grouping, and aggregation on the fly.

This architecture scales horizontally, keeps cross-region network traffic low, and provides high availability because historical data is decoupled from the compute nodes."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.