Lesson 14 of 105 13 minFlagship

Building Production Observability with OpenTelemetry and Grafana Stack

End-to-end observability implementation: distributed tracing with OpenTelemetry, metrics with Prometheus, structured logging with Loki, and the dashboards and alerts that actually help during incidents.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **Logs**: Immutable, timestamped records of discrete events. (Loki, ELK).
  • **Metrics**: Aggregated numerical data (counters, gauges, histograms) over time. (Prometheus).
  • **Traces**: The journey of a single request through the entire system. (OpenTelemetry, Jaeger).
Recommended Prerequisites
System Design Interview FrameworkMicroservices vs Monolith

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

In a distributed system, when something goes wrong, it is often impossible to tell why without a robust Observability stack. Observability is the measure of how well you can understand the internal state of your system based on its external outputs. In modern engineering, this translates to the integration of distributed tracing, system metrics, and structured log events into a single, unified observability plane.

This guide provides an end-to-end architectural blueprint and implementation plan for building a highly resilient, enterprise-grade observability fabric using the standard OpenTelemetry and Grafana LGTM (Loki, Grafana, Tempo, Mimir) stack.


System Requirements and Goals

Designing a distributed observability architecture for production workloads requires establishing precise operational bounds and service level objectives.

1. Functional Requirements

  • Unified Correlation: The observability stack must guarantee tight correlation between telemetry signals. A user should be able to click on a slow trace span in Grafana Tempo, view the exact structured JSON log statements written during that span in Grafana Loki, and review system metrics (CPU, database connection pool saturation) on the host pod at that millisecond using Grafana Mimir/Prometheus.
  • Standardized Protocol: All microservices must emit telemetry using open-source, vendor-agnostic protocols—specifically the OpenTelemetry Protocol (OTLP)—preventing proprietary vendor lock-in.
  • Context Propagation: Distributed trace context must propagate seamlessly across network boundaries (HTTP, gRPC, and asynchronous message brokers like Apache Kafka).

2. Non-Functional Requirements

  • Ingestion Latency: Telemetry data must be aggregated, indexed, and queryable within Grafana dashboards in under $5$ seconds from the moment of execution.
  • Workload Overhead: The observability agent sidecar and instrumentation libraries must not increase service CPU consumption by more than $2%$ or add more than $10\text{ MB}$ of memory overhead per host container pod.
  • Retention and Storage tiering: Telemetry must be retained with structured cost controls:
    • Metrics: 30 days retention (low storage footprint time-series).
    • Logs: 14 days retention (compressed and indexed).
    • Traces: 7 days retention (due to massive raw JSON payload size).

High-Level Design Architecture

Distributed observability requires decoupling telemetry generation from ingestion pipelines. The architecture below showcases the telemetry flow: Microservices push metrics, logs, and traces to a localized OpenTelemetry Collector sidecar. The sidecars process, batch, and compress the payload before forwarding it to a centralized, horizontally scalable OTel Collector Gateway Fleet. This fleet distributes the signals to the respective storage engines.

graph TD
    %% Define Nodes
    subgraph "Workload Pods (EKS Cluster)"
        AppA[Billing Microservice]
        AppB[Order Microservice]
        SidecarA[Local OTel Collector Agent]
        SidecarB[Local OTel Collector Agent]
    end

    subgraph "Collector Gateway Layer"
        GW[OTel Collector Gateway Fleet]
        Buffer[Apache Kafka Telemetry Queue]
    end

    subgraph "Storage & Indexing Engines"
        Mimir[(Grafana Mimir - Metrics)]
        Loki[(Grafana Loki - Logs)]
        Tempo[(Grafana Tempo - Traces)]
    end

    subgraph "Visualization & Alerting"
        Grafana[Grafana Visualization Server]
        AM[Alertmanager Engine]
    end

    %% Node Connections
    AppA -->|OTLP gRPC| SidecarA
    AppB -->|OTLP gRPC| SidecarB
    
    SidecarA -->|Compressed OTLP| GW
    SidecarB -->|Compressed OTLP| GW
    
    GW -->|Trace Buffering| Buffer
    Buffer -->|Consume & Store| Tempo
    
    GW -->|Write Metrics| Mimir
    GW -->|Write Logs| Loki
    
    Mimir --> Grafana
    Loki --> Grafana
    Tempo --> Grafana
    
    Mimir -->|Evaluate PromQL Rules| AM
    AM -->|Slack / PagerDuty Alerts| Ops[On-Call Engineer]

    %% Styling
    classDef workload fill:#2c3e50,stroke:#fff,stroke-width:1px,color:#fff;
    classDef collector fill:#e67e22,stroke:#fff,stroke-width:2px,color:#fff;
    classDef storage fill:#27ae60,stroke:#fff,stroke-width:1px,color:#fff;
    classDef visual fill:#8e44ad,stroke:#fff,stroke-width:1px,color:#fff;
    
    class AppA,AppB,SidecarA,SidecarB workload;
    class GW,Buffer collector;
    class Mimir,Loki,Tempo storage;
    class Grafana,AM visual;

Telemetry Pipeline Correlation Flow

To resolve distributed incident issues quickly, engineers must trace the life cycle of a query across components using correlation keys. The diagram below illustrates how a single HTTP trace token correlates logs and metrics:

sequenceDiagram
    autonumber
    participant Client
    participant API Gateway
    participant Order Service
    participant Database

    Client->>API Gateway: GET /order/checkout (No headers)
    Note over API Gateway: Generates TraceID: 9a3e21c8...<br/>Generates SpanID: 4f12...
    API Gateway->>Order Service: GET /order/checkout [Header: traceparent]
    Note over Order Service: Extracts Context.<br/>Writes Log: "Processing checkout"<br/>correlated with TraceID: 9a3e21c8...
    Order Service->>Database: SQL SELECT * FROM inventory
    Note over Database: Driver records database execution latency metric<br/>tagged with TraceID: 9a3e21c8...
    Database-->>Order Service: Returns Inventory State
    Order Service-->>API Gateway: Returns Order State
    API Gateway-->>Client: HTTP 200 OK (Response)
    Note over API Gateway: OpenTelemetry exports full trace parent-child<br/>graph to Tempo. Logs to Loki.

API Design and Interface Contracts

Standardizing API interface contracts guarantees that heterogeneous microservices communicate telemetry seamlessly.

1. The W3C Trace Context Header standard

To propagate context across HTTP boundaries, services must inject and extract the standardized traceparent HTTP header.

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Breakdown of the Traceparent Header:

  • 00: Protocol version (currently 00).
  • 4bf92f3577b34da6a3ce929d0e0e4736: Trace ID ($16$ bytes hex string, unique across the entire distributed query path).
  • 00f067aa0ba902b7: Parent Span ID ($8$ bytes hex string, unique for this specific node execution span).
  • 01: Trace flags ($1$ byte bitmask. 01 means "sampled", meaning the trace was selected for storage).

2. Structured JSON Log Payload Contract

To correlate logs with traces inside Grafana Loki, application logs must be serialized to JSON with explicit trace tags:

{
  "timestamp": "2026-05-22T23:22:45.992Z",
  "level": "ERROR",
  "service.name": "order-service",
  "service.version": "v1.2.4",
  "environment": "production",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "message": "Payment processing failed: insufficient funds.",
  "exception": {
    "type": "InsufficientFundsException",
    "message": "User wallet balance is below requested transaction amount.",
    "stacktrace": "at com.codesprintpro.OrderService.chargeWallet(OrderService.java:104)\n at com.codesprintpro.OrderController.checkout(OrderController.java:42)"
  }
}

3. OpenTelemetry Collector YAML Configuration

Below is the system configuration blueprint for an OpenTelemetry Collector Gateway. It receives telemetry over gRPC/HTTP OTLP, processes batch memory optimization limits, filters logs, and routes data to Tempo, Mimir, and Loki backends.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 8192
    timeout: 1s
    send_batch_max_size: 16384
  
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 20

  tail_sampling:
    decision_wait: 10s
    num_traces: 10000
    expected_new_traces_per_sec: 2000
    policies:
      [
        {
          name: filter_errors,
          type: status_code,
          status_code: {status_codes: [ERROR]}
        },
        {
          name: filter_latency,
          type: numeric_attribute,
          numeric_attribute: {key: "http.status_code", value_condition: {greater_than_or_equal: 500}}
        },
        {
          name: rate_limit_successes,
          type: probabilistic,
          probabilistic: {sampling_percentage: 1.0}
        }
      ]

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "production"
  
  otlp/tempo:
    endpoint: "tempo-ingester.tempo:4317"
    tls:
      insecure: true

  loki:
    endpoint: "http://loki-write.loki:3100/loki/api/v1/push"
    tenant_id: "production-logs"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Low-Level Design & Component Mechanics

To ensure clean execution and low latency, we must implement custom trace capture routines in application code.

Java Spring Boot Observability: Custom Trace Context & Manual Span Generation

Below is a highly optimized Java component using the standard OpenTelemetry SDK. It demonstrates how to manually propagate context, generate child spans, inject custom enterprise attributes, and record runtime exceptions safely.

package com.codesprintpro.observability;

import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.context.Scope;
import io.opentelemetry.context.Context;
import org.springframework.stereotype.Service;

import javax.annotation.PostConstruct;

@Service
public class OrderProcessingObservability {

    private Tracer tracer;

    @PostConstruct
    public void init() {
        // Initialize Tracer using Global OpenTelemetry registry
        this.tracer = GlobalOpenTelemetry.getTracer("com.codesprintpro.order", "1.2.4");
    }

    public String processPaymentWithTrace(String orderId, double amount, String customerTier) {
        // Retrieve current active context or fallback to root context
        Context parentContext = Context.current();
        
        // Start a custom trace child span
        Span paymentSpan = tracer.spanBuilder("ProcessPaymentTransaction")
                .setParent(parentContext)
                .setAttribute("order.id", orderId)
                .setAttribute("payment.amount", amount)
                .setAttribute("customer.tier", customerTier)
                .startSpan();

        // Establish the execution Scope to enable thread local context access
        try (Scope scope = paymentSpan.makeCurrent()) {
            
            // Execute business logic simulated here
            String transactionId = executePaymentGatewayCall(orderId, amount);
            
            paymentSpan.setAttribute("payment.transaction_id", transactionId);
            paymentSpan.setStatus(StatusCode.OK, "Payment executed successfully");
            
            return transactionId;

        } catch (IllegalArgumentException ex) {
            // Record exception specifics onto the trace span
            paymentSpan.recordException(ex);
            paymentSpan.setStatus(StatusCode.ERROR, ex.getMessage());
            paymentSpan.setAttribute("payment.error_code", "INVALID_ARGUMENTS");
            throw ex;
            
        } catch (Exception ex) {
            paymentSpan.recordException(ex);
            paymentSpan.setStatus(StatusCode.ERROR, "Unexpected system crash occurred");
            throw new RuntimeException("System processing exception: " + ex.getMessage(), ex);
            
        } finally {
            // Guarantee span completion is called to flush data to sidecar
            paymentSpan.end();
        }
    }

    private String executePaymentGatewayCall(String orderId, double amount) {
        if (amount <= 0) {
            throw new IllegalArgumentException("Amount must be greater than zero");
        }
        return "TXN-" + System.currentTimeMillis();
    }
}

Scaling Challenges & Production Bottlenecks

1. Scaling Telemetry Storage Systems

At large-scale operations ($100,000$ concurrent user requests per second), raw telemetry files generate unsustainable ingestion pressure.

Back-of-the-Envelope Estimation:

  • Assumed Constants:
    • Ingress Rate: $50,000$ HTTP requests/sec.
    • Spans per request: $6$ spans (Gateway $\to$ Auth $\to$ Orders $\to$ Payments $\to$ Database $\to$ Cache).
    • Size per span: $750$ bytes of structured JSON attributes.
    • Log lines per request: $8$ log statements at $300$ bytes each.
  • Tracing Throughput Calculation: $$\text{Spans per sec} = 50,000 \times 6 = 300,000\text{ spans/sec}$$ $$\text{Throughput} = 300,000 \times 750\text{ bytes} = 225,000,000\text{ bytes/sec} = 225\text{ MB/s}$$ $$\text{Daily Storage} = 225\text{ MB/s} \times 86,400\text{ seconds} \approx 19.44\text{ Terabytes per day}$$
  • Log Ingestion Throughput Calculation: $$\text{Log throughput} = 50,000 \times 8 \times 300\text{ bytes} = 120,000,000\text{ bytes/sec} = 120\text{ MB/s}$$ $$\text{Daily Log Storage} = 120\text{ MB/s} \times 86,400\text{ seconds} \approx 10.37\text{ Terabytes per day}$$
  • Total Raw Data: $\approx 30\text{ Terabytes per day}$ of telemetry.
  • Storage Cost: At $$0.023$ per GB on standard AWS S3, storing this without optimization costs thousands of dollars daily.

2. Operational Scaling Solutions

  • Adaptive Sampling: Do not capture $100%$ of successful traces. Implement Head-Based Probabilistic Sampling to keep only $1%$ of normal requests ($200\text{ OK}$). Scale up to Tail-Based Sampling to retain $100%$ of error traces ($5\text{xx}$ or $4\text{xx}$) and any request with a latency exceeding $1500\text{ ms}$.
  • Index Cardinality Optimization: In Grafana Loki, avoid extracting dynamic variables (like user_id or order_id) as index labels. This creates high-cardinality label space that crashes memory caches. Keep labels broad (e.g., service, env, namespace), and parse high-cardinality fields at query time using Loki LogQL line filters.

Technical Trade-offs & Strategic Compromises

When designing systems observability, engineering teams must balance resource costs with resolution precision.

Observability Decision Pros Cons Performance / Cost Matrix
Head-Based Sampling (Decided at request entry) * Extremely cost-effective; drops traffic before network transmission.
* Low CPU/Memory requirements on collectors.
* Can drop rare, critical edge-case errors if they happen within the unsampled request bucket. * Ingestion Costs: Very Low
* Memory Cost: Negligible
Tail-Based Sampling (Decided at request exit) * Guarantees $100%$ capturing of errors and long-tail latency anomalies.
* Ideal for debugging intermittent bugs.
* High memory overhead on OTel Collectors, as all spans must be held in memory until the root span terminates. * Ingestion Costs: Medium
* Memory Cost: Extremely High
Auto-Instrumentation (Java Agents / Node wrappers) * Zero code modifications required.
* Instant boot-up time and broad standard framework metrics.
* High overhead; instruments everything, including large nested libraries.
* Pulls in unused metrics that bloat storage index files.
* Development Cost: Zero
* CPU Overhead: Medium-High
Manual Instrumentation (Explicit OpenTelemetry API) * Absolute precision; only captures critical performance boundaries.
* Extremely low CPU overhead.
* Requires developer time.
* High risk of code drift or developers missing system errors.
* Development Cost: High
* CPU Overhead: Extremely Low

Failure Scenarios and Fault Tolerance

1. Collector Buffer Bloat and Out of Memory (OOM)

If downstream telemetry storage (e.g., Grafana Loki/Tempo) degrades or suffers network segmentation, local collector containers will queue telemetry payloads in memory buffers. If left unchecked, this triggers an Out-Of-Memory (OOM) crash loop across the collector sidecars.

Fault-Tolerance Mitigation:

  1. Configure the memory_limiter processor in the OpenTelemetry YAML file to drop telemetry data proactively when memory usage reaches $80%$ of container limits.
  2. Configure sidecar agents to store backlogged metrics on local disk buffers (disk_to_write_on_overflow) rather than memory queues.

2. Cascading Failure from Diagnostic Loops

If a microservice deep health check /health/deep executes real-time database query scans, load balancers checking this route every $500\text{ ms}$ across $100$ nodes will generate high database traffic. If the database starts running slowly, health checks take longer, connection pools saturate, and this diagnostics overhead crashes the system completely.

Fault-Tolerance Mitigation:

Implement a caching bridge for the health status. The /health/deep path must read the health state from a lightweight memory buffer which is updated asynchronously by a background cron job once every $15$ seconds, decoupling load balancer verification from real-time query executions.


Staff Engineer Perspective

<!-- Logback configuration for high-volume non-blocking logs -->
<appender name="ASYNC_LOKI" class="ch.qos.logback.classic.AsyncAppender">
    <appender-ref ref="LOKI" />
    <queueSize>10240</queueSize>
    <!-- Keep critical error logs, drop low severity INFO/DEBUG logs if queue is 80% full -->
    <discardingThreshold>20</discardingThreshold>
    <neverBlock>true</neverBlock>
</appender>

Verbal Script & Mock Interview

Verbal Script: Production Observability Design

Interviewer: "How would you design a robust, cost-effective observability stack for a multi-region microservice architecture processing 50,000 requests per second?"

Candidate: "To design a resilient, high-throughput observability platform at $50,000\text{ requests/sec}$, I would implement a decoupled, agent-collector-gateway architecture utilizing OpenTelemetry and the Grafana LGTM stack.

First, I would deploy a local OpenTelemetry Collector Agent as a DaemonSet on every host node. This agent receives OTLP telemetry locally over fast gRPC. This keeps microservices stateless and lightweight: instead of writing telemetry to remote endpoints, they push locally to the node. The DaemonSet batches, compresses, and forwards this data to a horizontally scalable OTel Collector Gateway Fleet running behind a Network Load Balancer.

Second, to address cost and storage constraints, capturing $100%$ of traces at $50\text{k req/s}$ is economically unsustainable and operationally unnecessary. I would implement an Adaptive Sampling strategy at the Gateway layer:

  1. Head-Based Probabilistic Sampling immediately at the edge router to discard $99%$ of healthy $200\text{ OK}$ traces.
  2. Tail-Based Sampling in the Collector Gateway to hold trace spans in memory buffers temporarily. If a request returns a $5\text{xx}$ error, a $4\text{xx}$ status, or its latency exceeds $1,000\text{ ms}$, we capture and store $100%$ of those traces. This captures all system anomalies while reducing daily trace volumes by over $90%$.

Third, to ensure rapid incident triage (lowering MTTR), I would enforce strict telemetry correlation. I would configure the logging framework to inject the active trace_id and span_id into every structured JSON log line. Inside Grafana, this enables Trace-to-Logs and Trace-to-Metrics correlations: when an engineer clicks on a slow trace segment, they are presented with a direct button that instantly queries Loki for logs containing that identical trace_id and queries Mimir for host resource utilization metrics at that precise millisecond, eliminating guesswork during an incident.

Finally, to make the system fault-tolerant, I would configure the collectors' memory_limiter processor to drop telemetry gracefully under severe system load, ensuring that telemetry pipelines can never consume host resources to the point where they starve primary application containers."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.