Lesson 72 of 105 12 minFlagship

System Design: Building a Distributed Tracing Platform

Design a production distributed tracing platform with trace ingestion, context propagation, sampling, span storage, trace query, retention, tenant isolation, and cost controls.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • request path across API gateway, auth, inventory, payment, and notification services
  • batch workflow trace spanning queue consumers and downstream APIs
  • async event flow from producer to Kafka to consumer to database
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Metrics tell you that application latency is bad.

Logs tell you that something failed somewhere.

Traces tell you which request went where, in what order, and where the execution time actually disappeared.

That is why tracing becomes essential once one user request crosses many microservices. Without it, debugging a slow checkout or a flaky order flow turns into a manual archaeology exercise across distinct log files, shifting timestamps, and blind guesses. With it, you get a unified request tree showing exactly which hop or third-party dependency caused the latency spike.

This guide designs a production distributed tracing platform.


Requirements and System Goals

A distributed tracing platform is a high-volume telemetry pipeline. The design must handle millions of writes per second while ensuring query paths remain responsive.

Functional Requirements

  • Span & Log Ingestion: Accept tracing spans and events from applications in standard formats (e.g., OpenTelemetry, Jaeger).
  • Context Propagation: Support tracking trace context (Trace ID, Parent Span ID, sampling flags) across HTTP, gRPC, and messaging queue boundaries.
  • Trace Tree Assembly: Assemble unstructured span events into a visual directed acyclic graph (DAG) representing parent-child execution paths.
  • Advanced Query & Search: Search traces by service name, operation, duration, error status, tenant ID, and custom metadata attributes.
  • Adaptive Ingestion Sampling: Support both head-based and tail-based sampling rules to prioritize errors and slow operations over healthy requests.
  • Configurable Retention & Purging: Automatically retire and delete old trace records to manage storage growth.

Non-Functional Requirements

  • High Write Throughput: Ingest and process over 100,000 spans per second.
  • Low Query Latency: Retrieve complete trace details by Trace ID in less than 500ms.
  • Low Host CPU Overhead: The host agent and instrumentation libraries must consume less than 1% CPU and memory of the application process.
  • Cardinality Explosion Protection: Prevent search indexes from collapsing under high-cardinality metadata keys (e.g., User IDs, SQL queries).
  • Multi-Tenant Isolation: Ensure performance boundaries and query isolation between distinct developer teams or enterprise accounts.

API Interfaces and Service Contracts

We use OpenTelemetry protobuf protocols over gRPC or HTTP for span ingestion. We expose REST endpoints for querying traces in the UI.

Ingestion Service Contract (gRPC / Protobuf)

syntax = "proto3";

package codesprintpro.telemetry.v1;

service TraceIngestionService {
  rpc ExportSpans (ExportSpansRequest) returns (ExportSpansResponse);
}

message ExportSpansRequest {
  repeated ResourceSpans resource_spans = 1;
}

message ResourceSpans {
  string service_name = 1;
  string environment = 2;
  repeated Span spans = 3;
}

message Span {
  bytes trace_id = 1;         // 16-byte unique identifier
  bytes span_id = 2;          // 8-byte unique identifier
  bytes parent_span_id = 3;   // Optional 8-byte parent identifier
  string name = 4;            // Operation name
  int64 start_time_unix_nano = 5;
  int64 end_time_unix_nano = 6;
  map<string, string> attributes = 7;
  repeated Event events = 8;
  int32 status_code = 9;      // 0 = Unset, 1 = OK, 2 = Error
}

message Event {
  int64 time_unix_nano = 1;
  string name = 2;
  map<string, string> attributes = 3;
}

message ExportSpansResponse {
  bool partial_failure = 1;
  string rejection_reason = 2;
}

Search Traces REST Endpoint

  • Endpoint: GET /v1/traces/search
  • Query Parameters:
    • service=payment-service
    • operation=AuthorizePayment
    • status=ERROR
    • minDurationMs=500
    • limit=20
  • Response Payload (HTTP 200 OK):
{
  "matches": [
    {
      "traceId": "fa91b0021cda40a993e1104e76839bb2",
      "rootName": "POST /checkout",
      "durationMs": 1280,
      "spanCount": 14,
      "hasErrors": true,
      "timestamp": "2026-06-06T14:20:00Z"
    }
  ]
}

Fetch Complete Assembled Trace Tree

  • Endpoint: GET /v1/traces/fa91b0021cda40a993e1104e76839bb2
  • Response Payload (HTTP 200 OK):
{
  "traceId": "fa91b0021cda40a993e1104e76839bb2",
  "spans": [
    {
      "spanId": "87f0b90c12e873a1",
      "parentSpanId": null,
      "service": "api-gateway",
      "name": "POST /checkout",
      "startOffsetMs": 0,
      "durationMs": 1280,
      "status": "ERROR"
    },
    {
      "spanId": "bc110a22de44598a",
      "parentSpanId": "87f0b90c12e873a1",
      "service": "payment-service",
      "name": "AuthorizePayment",
      "startOffsetMs": 20,
      "durationMs": 1260,
      "status": "ERROR"
    }
  ]
}

High-Level Design and Visualizations

Our design decouples write ingestion path scaling from read query parsing. We insert a message queue buffer (Kafka) to handle ingestion spikes during production incidents.

Ingestion and Sampling Architecture

flowchart TD
    App[App Container VM] -->|1. Async buffering of spans| Agent[OTel Host Agent]
    Agent -->|2. Send Span Batch| Ingress[Ingress Gateway Load Balancer]
    Ingress -->|3. Route to Collector| Collector[Collector Pool]
    
    subgraph StreamBuffer [Spike Buffering & Tail-Sampling]
        Collector -->|4. Push raw spans| Kafka[Kafka / Telemetry Stream Broker]
        Kafka -->|5. Read spans| Sampler[Tail-Sampling Coordinator]
        Sampler -->|6. Keep in memory ring buffer 10s| RingBuffer[(Sampling Ring Buffer)]
    end
    
    Sampler -->|7. If error, slow request, or 1% pass| Ingester[Ingestion Worker Pool]
    Sampler -.->|8. Drop healthy non-slow spans| DropPool[Discard Sink]
    
    subgraph Storage [Telemetry Tiered Storage]
        Ingester -->|9. Write columnar records| ClickHouse[(ClickHouse Columnar Database)]
        Ingester -->|10. Index search tags| Elasticsearch[(Elasticsearch Metadata Index)]
    end
    
    UI[Trace UI / Jaeger Dashboard] -->|11. Run search query| QueryService[Query API Gateway]
    QueryService -->|12. Scan index| Elasticsearch
    QueryService -->|13. Fetch trace DAG| ClickHouse
    QueryService -->|14. Return compiled tree| UI

Context Propagation Flow Across Networks

This diagram illustrates how trace context propagates across network boundaries (HTTP and Kafka) to assemble a single distributed trace.

sequenceDiagram
    autonumber
    actor User as User Browser
    participant GW as API Gateway
    participant Ord as Order Service
    participant Msg as Kafka Message Queue
    participant Inv as Inventory Service
    
    User->>GW: POST /checkout (Create Trace ID: t_991, Span ID: s_001)
    note over GW: Start Span s_001
    GW->>Ord: POST /orders (Inject Headers: traceparent: 00-t_991-s_001-01)
    note over Ord: Extract Trace ID: t_991, Parent: s_001.<br/>Start Span s_002.
    
    Ord->>Msg: Publish 'order_created' event (Inject metadata key: traceparent: 00-t_991-s_002-01)
    note over Ord: End Span s_002. Return HTTP 202
    GW->>User: HTTP 202 Accepted (End Span s_001)
    
    note over Inv: Consume 'order_created' event
    Msg->>Inv: Read record headers (Extract traceparent: 00-t_991-s_002-01)
    note over Inv: Start Span s_003 with Parent: s_002
    note over Inv: End Span s_003. Update DB.

Low-Level Design and Schema Strategies

We use ClickHouse as our primary columnar database to optimize raw span writes and queries. We store indexable search metadata in a separate Elasticsearch cluster.

ClickHouse Columnar Table DDL

-- Primary columnar table designed for writing massive telemetry data
CREATE TABLE telemetry_spans (
    trace_id FixedString(16),                    -- Binary representation of 128-bit Trace ID
    span_id FixedString(8),                      -- Binary representation of 64-bit Span ID
    parent_span_id FixedString(8),
    tenant_id LowCardinality(String),            -- LowCardinality saves index space
    service_name LowCardinality(String),
    operation_name LowCardinality(String),
    status_code UInt8,                           -- 0: Unset, 1: OK, 2: Error
    start_time DateTime64(6, 'UTC'),             -- Microsecond accuracy
    duration_us UInt64,                          -- Microseconds duration
    attributes Map(String, String),              -- Key-value mappings
    events Nested(
        time DateTime64(6, 'UTC'),
        name String,
        attributes Map(String, String)
    ),
    created_date Date DEFAULT toDate(start_time)
) 
ENGINE = MergeTree()
PARTITION BY created_date                       -- Auto-partitioning by date allows cheap drop partitions
ORDER BY (tenant_id, service_name, operation_name, start_time, trace_id)
TTL created_date + INTERVAL 7 DAY DELETE;       -- Automatic 7-day retention cleanup
{
  "settings": {
    "index": {
      "number_of_shards": 6,
      "number_of_replicas": 1,
      "refresh_interval": "10s"
    }
  },
  "mappings": {
    "properties": {
      "trace_id": { "type": "keyword", "doc_values": true },
      "tenant_id": { "type": "keyword" },
      "service_name": { "type": "keyword" },
      "operation_name": { "type": "keyword" },
      "status_code": { "type": "integer" },
      "start_time": { "type": "date" },
      "duration_ms": { "type": "long" },
      "searchable_attributes": {
        "type": "nested",
        "properties": {
          "key": { "type": "keyword" },
          "value": { "type": "keyword" }
        }
      }
    }
  }
}

Scaling and Operational Challenges

To design a tracing platform that does not consume more hosting resources than the application itself, we must evaluate ingestion calculations.

Back-of-the-Envelope Capacity Estimations

Let us estimate the capacity requirements for a system processing 100,000 requests/second.

  • Spans per Request: Let us assume each request averages 15 spans across the microservices.
  • Span Ingestion rate: $$\text{Ingest rate} = 100,000 \times 15 = 1,500,000\text{ spans/second}$$
  • Data Footprint: If each raw span (including metadata, path headers, and log events) averages 1 KB serialized: $$\text{Data throughput} = 1,500,000 \times 1\text{ KB} = 1,500,000\text{ KB} = 1.5\text{ GB/second}$$ $$\text{Daily storage} = 1.5\text{ GB/sec} \times 86,400\text{ seconds} = 129,600\text{ GB/day} \approx 130\text{ TB/day}$$
  • Applying Tail Sampling: Instead of storing every request, we apply tail-sampling:
    • Keep 100% of error traces (assume 1% error rate = 1,000 requests/sec).
    • Keep 100% of slow traces (duration greater than 500ms, assume 4% = 4,000 requests/sec).
    • Keep 1% of healthy, low-latency traces (1% of 95,000 = 950 requests/sec).
    • Total traces kept: $$\text{Kept requests} = 1,000 + 4,000 + 950 = 5,950\text{ requests/second}$$ $$\text{Kept spans rate} = 5,950 \times 15 = 89,250\text{ spans/second}$$
    • Sampling Storage Savings: $$\text{New daily storage} = 89,250\text{ spans/sec} \times 1\text{ KB} \times 86,400\text{ sec} \approx 7.7\text{ TB/day}$$ This reduces our storage footprint and cost by over 94% while preserving the most valuable debugging data.

Trade-offs and Architectural Alternatives

Ingestion Sampling: Head-based vs. Tail-based Sampling

Feature Head-based Sampling Tail-based Sampling
Decision Point Made at trace initialization (e.g., in the API Gateway). Made at collector layer after trace assembly.
Accuracy Misses rare error traces and unexpected latency spikes. Captures 100% of errors and slow execution paths.
Memory Cost 0 bytes (no buffer required on collectors). High memory footprint; requires holding partial traces in buffers for 10 seconds.
Complexity Simple configuration; SDK makes the decision. High; requires a coordinated collector routing layer.

Telemetry Storage Engines

  • ClickHouse (Columnar Engine):
    • Pros: Extremely high write throughput; highly efficient compression of trace attributes; fast aggregations (e.g., calculating p99 latencies).
    • Cons: Poor support for full-text search or high-concurrency pointwise reads.
  • Elasticsearch (Document Indexer):
    • Pros: Native support for searching nested fields and high-cardinality metadata; fast text queries.
    • Cons: High memory footprint; expensive storage requirements due to index bloat.
  • Raw Object Storage (S3 / GCS):
    • Pros: Cheapest storage; highly durable.
    • Cons: Extremely slow query performance; not indexable without secondary index databases.

Failure Modes and Fault Tolerance Strategies

Ingestion Ring Buffer Memory Saturation

During high-load incidents, downstream databases (like ClickHouse or Elasticsearch) can slow down, causing the Kafka queues or collector buffers to fill up.

  • Mitigation: We implement non-blocking drop policies. If the collector's memory buffer exceeds 85% capacity, it drops healthy spans while preserving error spans. If buffer utilization reaches 95%, it switches to a random drop policy to protect the collector process from crashing.

Trace Parent-Child Alignment Clock Drift

Microservices running on distinct VMs may experience NTP clock drift. This can cause a child span's start time to appear earlier than its parent's start time in the UI.

  • Mitigation: The query service runs a parent-child alignment heuristic. If a child's offset is less than the parent's offset, we adjust the child's display offset to align with the parent's start time: $$\text{Adjusted Offset} = \max(\text{Child Offset}, \text{Parent Offset} + 1\text{ms})$$

Index Cardinality Explosion

Developers may accidentally attach high-cardinality keys (like full SQL queries, HTTP request bodies, or user GUIDs) as tags. This can bloat the Elasticsearch index, causing out-of-memory errors on index nodes.

  • Mitigation: We enforce an Index Allowlist. Only registered keys (e.g., tenant_id, status_code, http.method) are sent to Elasticsearch for indexing. Unregistered keys remain inside the raw span payload in ClickHouse, allowing searches by trace ID while protecting the search index from cardinality explosion.

Staff Engineer Perspective


Verbal Script

Interviewer: "How would you design a distributed tracing system to handle 1,000,000 spans per second without saturating storage?"

Candidate: "I would implement a hybrid storage architecture combined with tail-sampling.

First, I would use ClickHouse as the primary columnar database for raw spans. Its columnar format compresses repetitive attributes, reducing our storage footprint by up to 80% compared to row-oriented databases.

Second, I would separate the search index from the raw span store. I would store only indexable metadata (like service name, status, and tenant ID) in Elasticsearch, keeping high-cardinality keys in ClickHouse.

Finally, I would use tail-sampling to drop healthy, low-latency traces. We hold partial traces in collector buffers for 10 seconds. We keep 100% of traces containing errors or latency spikes, but sample only 1% of healthy requests. This captures the most useful debugging data while reducing our storage volume and costs by over 90%."

Interviewer: "How do you handle clock drift between different servers when visualizing a trace?"

Candidate: "Clock drift is common in distributed systems. When parent and child spans are generated on different hosts, NTP drift can cause a child span to appear to start before its parent.

To fix this during visualization, the query service runs a parent-child alignment heuristic. We construct the DAG using parent-child links. We then traverse the tree. If a child's start timestamp is earlier than its parent's start timestamp, we adjust the child's start offset to match the parent's start time plus 1 millisecond. This ensures the visualization is chronological while preserving the duration of the child span."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.