System Design: Building a Secrets Management Platform

Secrets management looks like a simple storage problem until you operate it in production.

At first, the requirement sounds simple: "Store API keys and database passwords somewhere safe." Then real systems arrive. Services need different values per environment. Developers need break-glass access. Kubernetes workloads need secrets without restarting every pod. Rotations must happen without downtime. Audit logs must answer who read what and why. A leaked secret must be revoked quickly. The platform must be available during deploys, but it cannot become a convenient place for everyone to dump plaintext credentials.

This guide designs a secrets management platform from first principles. It covers secret models, envelope encryption, KMS integration, access control, versioning, rotation, audit logs, delivery into applications, caching, Kubernetes integration, observability, and failure modes.

Requirements and System Goals

A secrets management platform serves as the root of trust for workload authentication and data protection. It must enforce security boundaries while remaining resilient to infrastructure outages.

Functional Requirements

Envelope Encryption: Secrets must be encrypted using data-encryption keys (DEKs), which are in turn protected by key-encryption keys (KEKs) hosted in a hardware security module (HSM) or cloud KMS.
Dynamic Lease Generation: Support dynamic credential generation (e.g., temporary database user credentials) that are automatically revoked when their lease expires.
Strict Role-Based Access Control (RBAC): Support access policies based on workload identity, environment, namespace, and actions.
Immutable Audit Trails: Write audit records for every read, write, deny, and break-glass operation.
Zero-Downtime Rotation: Allow secrets to be rotated by maintaining an overlap window where old and new versions remain active.
Version Control & Promotion: Support versioned secrets and alias tracking (e.g., current, previous, next).

Non-Functional Requirements

Sub-10ms Read Latency: Retrieve secrets quickly, relying on secure client-side caches to avoid synchronous roundtrips to the central store during request processing.
High Availability & Fallback: Support regional replication of encrypted secrets. If the primary region goes offline, replica hosts must continue serving read requests.
Fail-Closed Security: If the metadata store or the audit logger is unreachable, the API gateway must fail closed, blocking secret retrieval to prevent unauthorized access.
Secure Memory Cleaning: Wipe cryptographic keys and plaintext secrets from server memory as soon as they are processed.

API Interfaces and Service Contracts

Workloads interact with the secrets platform using REST APIs for administration and gRPC for high-throughput runtime retrievals.

Register a New Secret Path

Endpoint: POST /v1/secrets
Request Payload:

{
  "environment": "prod",
  "path": "payments/stripe-api-key",
  "ownerTeam": "payments-core",
  "rotationStrategy": "automated-30d"
}

Response Payload (HTTP 201 Created):

{
  "secretId": "sec_091f-b283-4a11",
  "path": "prod/payments/stripe-api-key",
  "status": "ACTIVE",
  "createdAt": "2026-06-06T14:30:00Z"
}

Propose Secret Value Version

Endpoint: POST /v1/secrets/sec_091f-b283-4a11/versions
Request Payload:

{
  "plaintextValue": "sk_live_51M..."
}

Response Payload (HTTP 201 Created):

{
  "versionNumber": 3,
  "status": "PENDING",
  "sha256Fingerprint": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

Promote Version to Current Alias

Endpoint: POST /v1/secrets/sec_091f-b283-4a11/aliases/current
Request Payload:

{
  "versionNumber": 3,
  "expectedPreviousVersion": 2
}

Response Payload (HTTP 200 OK):

{
  "alias": "current",
  "versionNumber": 3,
  "updatedAt": "2026-06-06T14:31:00Z"
}

Workload Ingress gRPC Service Contract

Workloads retrieve secrets using gRPC to minimize latency and connection overhead:

syntax = "proto3";

package codesprintpro.secrets.v1;

service SecretDeliveryService {
  rpc GetSecret (SecretRequest) returns (SecretResponse);
  rpc WatchSecretChanges (WatchRequest) returns (stream SecretChangeEvent);
}

message SecretRequest {
  string secret_path = 1;         // e.g. "prod/payments/stripe-api-key"
  string client_token = 2;        // Workload identity token (OIDC / SPIFFE)
  string version_or_alias = 3;    // e.g. "current" or "v3"
}

message SecretResponse {
  string secret_path = 1;
  int64 version_number = 2;
  string plaintext_value = 3;
  int64 expires_at_unix = 4;
}

message WatchRequest {
  repeated string secret_paths = 1;
  string client_token = 2;
}

message SecretChangeEvent {
  string secret_path = 1;
  int64 new_version_number = 2;
  string alias = 3;
}

High-Level Design and Visualizations

Our secrets management architecture separates configuration metadata from key encryption logic, routing cryptographic operations through a managed KMS.

Secrets API Access and Decryption Pipeline

flowchart TD
    App[Workload Container App] -->|1. Get Secret request with OIDC Token| Gateway[Secrets API Gateway]
    Gateway -->|2. Verify Caller Identity| IdentityService[OIDC / Workload Identity Validator]
    Gateway -->|3. Evaluate Policy| PolicyEngine[RBAC Policy Engine]
    
    PolicyEngine -->|4. Allow / Deny| Gateway
    Gateway -->|5. Fetch Ciphertext & Key ID| MetadataDB[(Metadata PostgreSQL)]
    
    subgraph Cryptography [KMS Envelope Decryption]
        Gateway -->|6. Decrypt Data Key request| KMS[Cloud KMS / HSM Cluster]
        KMS -->|7. Decrypted Data Key DEK| Gateway
    end
    
    Gateway -->|8. Decrypt Ciphertext in Memory| Decryptor[AES-GCM Decryption Worker]
    Gateway -->|9. Write Audit Record| AuditLog[Append-only Audit Stream]
    
    AuditLog -->|10. Write ACK| Gateway
    Gateway -->|11. Return Plaintext Secret| App

Version Rotation Sequencing with Overlap Window

During credential rotation, old and new credentials must overlap to ensure that running workloads do not experience database or API authentication failures.

sequenceDiagram
    autonumber
    actor Admin as Security Team
    participant Vault as Secrets Platform
    participant DB as Production Database
    participant App as App Container Pool
    
    Admin->>Vault: 1. Create version v2 (Status: PENDING)
    Vault->>DB: 2. Create secondary database user (e.g. app_v2 with new password)
    DB-->>Vault: 3. User created successfully
    Vault->>Vault: 4. Promote v2 alias to 'next'
    
    Note over App: App instances run using 'current' (v1 credentials)
    Vault->>App: 5. Signal configuration update: new 'next' credential available
    App->>Vault: 6. Fetch 'next' (v2) credential & store in secondary cache
    
    Note over App: App begins validating connections using both v1 and v2 credentials
    Admin->>Vault: 7. Promote v2 alias to 'current' (v1 moves to 'previous')
    Vault->>App: 8. Signal update: 'current' is now v2
    App->>Vault: 9. Fetch 'current' (v2) and make it primary connection
    
    Note over App: Wait for SDK cache TTL and connection pool migration (e.g. 15 minutes)
    Admin->>Vault: 10. Mark v1 as DISABLED
    Vault->>DB: 11. Drop database user app_v1
    DB-->>Vault: 12. User app_v1 dropped
    Vault->>Vault: 13. Mark v1 as DESTROYED

Low-Level Design and Schema Strategies

We use a PostgreSQL database to manage versioned configuration. We maintain separate tables for configuration history, currently active versions, and audit trails.

PostgreSQL Table DDLs

-- Core secret metadata configuration
CREATE TABLE secrets (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    environment VARCHAR(32) NOT NULL,            -- 'dev', 'staging', 'prod'
    path VARCHAR(512) NOT NULL,                  -- e.g. 'payments/stripe-api-key'
    owner_team VARCHAR(128) NOT NULL,
    rotation_strategy VARCHAR(64) NOT NULL,      -- 'manual', 'automated-30d'
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT uk_tenant_env_path UNIQUE (tenant_id, environment, path)
);

-- Indexing for path-based prefix lookups
CREATE INDEX idx_secrets_path_lookup 
ON secrets (tenant_id, environment, path);

-- Versioned historical table holding encrypted secrets and data keys
CREATE TABLE secret_versions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    secret_id UUID NOT NULL REFERENCES secrets(id) ON DELETE CASCADE,
    version_number BIGINT NOT NULL,
    ciphertext BYTEA NOT NULL,                   -- Encrypted secret value
    encrypted_data_key BYTEA NOT NULL,           -- Encrypted DEK protected by KMS
    key_id VARCHAR(256) NOT NULL,                -- KMS key reference (KEK)
    status VARCHAR(32) NOT NULL,                 -- 'PENDING', 'CURRENT', 'PREVIOUS', 'DISABLED', 'DESTROYED'
    created_by VARCHAR(128) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMPTZ,
    CONSTRAINT uk_secret_version UNIQUE (secret_id, version_number)
);

-- Indexing to quickly retrieve the latest active versions
CREATE INDEX idx_secret_versions_lookup 
ON secret_versions (secret_id, status, version_number DESC);

-- Pointers to version maps (aliases)
CREATE TABLE secret_aliases (
    secret_id UUID NOT NULL REFERENCES secrets(id) ON DELETE CASCADE,
    alias_name VARCHAR(64) NOT NULL,             -- 'current', 'previous', 'next'
    version_number BIGINT NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (secret_id, alias_name)
);

-- Audit log recording access events
CREATE TABLE secret_access_audits (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    secret_id UUID NOT NULL REFERENCES secrets(id) ON DELETE RESTRICT,
    actor_identity VARCHAR(256) NOT NULL,         -- Workload ID or operator email
    action VARCHAR(64) NOT NULL,                  -- 'READ', 'WRITE', 'ROTATE', 'DESTROY'
    decision VARCHAR(32) NOT NULL,                -- 'ALLOW', 'DENY'
    client_ip VARCHAR(45) NOT NULL,
    request_id VARCHAR(128) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_secret_audits_lookup 
ON secret_access_audits (secret_id, created_at DESC);

Scaling and Operational Challenges

To design a secrets management platform that remains stable under massive workloads, we must evaluate KMS throughput budgets and thundering herd scenarios.

Back-of-the-Envelope Capacity Estimations

Let us evaluate the platform during a massive cluster deployment containing 10,000 application pods booting concurrently.

Boot-Time Secret Requests: Assume each pod requires 5 secrets on startup.
Total Secret Requests: $$\text{Total Requests} = 10,000 \text{ pods} \times 5 = 50,000 \text{ secrets}$$
Boot Window Duration: Assume the scheduler boots the pods over a 10-second window.
Target Request Rate: $$\text{Request rate} = \frac{50,000 \text{ secrets}}{10 \text{ seconds}} = 5,000 \text{ requests/second}$$
KMS API Limits: Managed KMS services (e.g. AWS KMS) enforce request limits (e.g., 10,000 decrypt operations/sec). A decrypt rate of 5,000 requests/second consumes 50% of the KMS quota. If other systems share the KMS, this can trigger rate-limiting errors.
Mitigating Decrypt Spikes: To protect KMS limits:
1. We cache Decrypted Data Keys (DEKs) in the API gateway's memory using a short-lived cache (TTL of 5 minutes). Since one DEK is used to decrypt multiple secrets or versions, caching DEKs reduces KMS decrypt calls: $$\text{KMS Decrypt Rate} = \frac{\text{Unique Secrets}}{300\text{ seconds}} \approx 16 \text{ decrypts/second}$$ This reduces our KMS API load by 99.6%.
2. Client-side caching: Client SDKs cache decrypted secrets locally in memory with a jittered 2-minute TTL, avoiding calls to the secrets platform entirely during request processing.

Trade-offs and Architectural Alternatives

Runtime Secret Delivery: Direct SDK vs. Sidecar Volume Injection

Dimension	Direct SDK Fetch	Sidecar Volume Injection
Workload Integration	Requires application changes; language-specific SDK libraries.	Language-agnostic; secrets are written to memory-backed files (e.g., tmpfs).
Rotation Support	Simple; the SDK handles cache expiry and re-fetches updated secrets automatically.	Requires file watchers or application reload paths to detect file updates.
Audit Fidelity	High; each read request includes the workload's identity token.	Low; the sidecar retrieves all secrets at once, hiding which fields the application reads.

Configuration Types: Static Versioned vs. Dynamic Leased Secrets

Static Versioned Secrets:
- Pros: Simple design; low database overhead; easy to implement and debug.
- Cons: If a secret is leaked, it remains valid until rotated manually; rotation requires updating downstream systems.
Dynamic Leased Secrets:
- Pros: Temporary credentials; automatic expiration reduces the blast radius of leaks.
- Cons: Requires deep integration with downstream systems (databases, APIs); high database overhead from continuous user creation and deletion.

Failure Modes and Fault Tolerance Strategies

KMS Provider Outage and Decryption Failures

If the cloud KMS provider is unreachable, the secrets platform cannot decrypt DEKs, blocking secret retrieval.

Mitigation: We cache DEKs in the gateway's memory using an encrypted cache. We also replicate the KMS master key (KEK) across multiple regions. If a region goes offline, the gateway routes decrypt requests to replica KMS nodes.

Plaintext Secret Leaks in Application Logs

A common security issue occurs when debugging statements log entire request or response payloads, accidentally writing plaintext secrets to logging systems.

Mitigation: We implement request/response sanitization. The API gateway inspects response payloads and replaces plaintext secret values with a masked string (e.g., [REDACTED_SECRET]) in all execution logs.

Audit Ingestion Failures

If the append-only audit log database goes offline, the platform could continue serving secrets without recording access.

Mitigation: We enforce a fail-closed policy. The API gateway writes to a local audit buffer. If the buffer fills up or cannot write to the audit database, the gateway blocks secret retrieval and returns HTTP 500 Internal Server Error.

Staff Engineer Perspective

Verbal Script

Interviewer: "How would you design a secrets management platform to handle a thundering herd scenario where 10,000 container instances reboot and query secrets at the same time?"

Candidate: "A thundering herd scenario can overwhelm the secrets gateway and exhaust cloud KMS decryption quotas.

To protect the platform, I would implement caching at both the client and gateway layers.

First, the client SDK uses in-memory caching with jittered TTLs (e.g., 2 minutes). This prevents services from querying the gateway on every request.

Second, at the gateway layer, we avoid calling KMS to decrypt the Data Encryption Key (DEK) for every read request. Instead, we cache decrypted DEKs in an encrypted memory cache for 5 minutes. Since one DEK is used to decrypt multiple secrets or versions, caching the DEK reduces KMS API calls by over 99%, preventing rate-limiting errors.

Finally, we apply client-side retry backoff with randomized jitter to prevent retries from creating load spikes during reboots."

Interviewer: "If the audit logging system goes offline, should the secrets platform continue serving secrets or block access?"

Candidate: "We enforce a strict fail-closed security policy.

A secrets platform must guarantee data security. If the audit logging system goes offline, we lose the ability to track who reads sensitive credentials, creating a security risk.

Therefore, if the gateway's audit buffer fills up or cannot write to the audit store, the gateway blocks secret retrieval and returns an HTTP 500 error.

While this impacts application availability, it protects data integrity, ensuring secrets cannot be read without a durable audit record."