System Design: Building a Secrets Management Platform

Secrets management looks like a simple storage problem until you operate it in production.

At first, the requirement sounds like this: "Store API keys and database passwords somewhere safe." Then real systems arrive. Services need different values per environment. Developers need break-glass access. Kubernetes workloads need secrets without restarting every pod. Rotations must happen without downtime. Audit logs must answer who read what and why. A leaked secret must be revoked quickly. The platform must be available during deploys, but it cannot become a convenient place for everyone to dump plaintext credentials.

This guide designs a secrets management platform from first principles. It covers secret models, envelope encryption, KMS integration, access control, versioning, rotation, audit logs, delivery into applications, caching, Kubernetes integration, observability, and failure modes.

Problem Statement

Build a platform that lets services and operators store, read, rotate, and audit secrets.

Examples:

database passwords
webhook signing secrets
third-party API keys
OAuth client secrets
TLS private keys
encryption subkeys
service-to-service credentials

The platform should reduce secret sprawl. It should not make secrets magically harmless. A secret read by an application can still be logged, leaked, copied to a laptop, or embedded into a crash dump. Good design narrows the blast radius and gives teams a reliable way to rotate after something goes wrong.

Requirements

Functional requirements:

create a secret
read the latest secret version
read a specific version
rotate a secret
disable or destroy old versions
attach metadata and ownership
enforce access policies
audit reads and writes
support break-glass access
deliver secrets to applications

Non-functional requirements:

strong encryption at rest
secure transport
high availability for reads
strict authorization
low read latency for applications
safe rotation with overlap windows
immutable audit trail
tenant and environment isolation
predictable behavior during outages

The hardest requirement is rotation. A platform that can store secrets but cannot rotate them safely often creates a false sense of security.

Core Concepts

Use a small vocabulary:

Secret: logical name, such as prod/payments/stripe-api-key
SecretVersion: immutable encrypted value
Alias: pointer like current, previous, or next
Policy: who can read, write, rotate, or administer the secret
Lease: optional short-lived access grant for dynamic credentials
AuditEvent: immutable record of reads and writes

Versioning matters because rotation is rarely instant. During a rollout, some instances may still use the old database password while new instances use the new one. The platform must represent that overlap explicitly.

High-Level Architecture

           +-------------------+
           |   Admin Console   |
           +---------+---------+
                     |
                     v
+----------+   +-----+------+   +----------------+
| Service  |-->| Secret API |-->| Policy Engine  |
| SDK      |   +-----+------+   +----------------+
+----------+         |
                     v
             +-------+--------+
             | Metadata Store |
             +-------+--------+
                     |
                     v
             +-------+--------+       +------+
             | Encrypted Blob |<----->| KMS  |
             | Store          |       +------+
             +-------+--------+
                     |
                     v
              +------+------+
              | Audit Log   |
              +-------------+

The API service handles request validation and authentication. The policy engine decides whether the caller can act. The metadata store tracks names, owners, versions, and aliases. The encrypted blob store holds ciphertext. KMS protects data-encryption keys. The audit log receives every sensitive operation.

For small systems, metadata and encrypted blobs can live in one relational database. For large systems, metadata can remain relational while encrypted values are stored in object storage or a dedicated key-value store.

Data Model

CREATE TABLE secrets (
  id UUID PRIMARY KEY,
  tenant_id TEXT NOT NULL,
  environment TEXT NOT NULL,
  path TEXT NOT NULL,
  owner_team TEXT NOT NULL,
  description TEXT,
  rotation_strategy TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (tenant_id, environment, path)
);

CREATE TABLE secret_versions (
  id UUID PRIMARY KEY,
  secret_id UUID NOT NULL REFERENCES secrets(id),
  version_number BIGINT NOT NULL,
  ciphertext BYTEA NOT NULL,
  encrypted_data_key BYTEA NOT NULL,
  key_id TEXT NOT NULL,
  status TEXT NOT NULL,
  created_by TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  expires_at TIMESTAMPTZ,
  UNIQUE (secret_id, version_number)
);

CREATE TABLE secret_aliases (
  secret_id UUID NOT NULL REFERENCES secrets(id),
  alias TEXT NOT NULL,
  version_number BIGINT NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (secret_id, alias)
);

version_number should be monotonic per secret. status can be:

PENDING: created but not yet used
CURRENT: default version for readers
PREVIOUS: kept for rollback or overlap
DISABLED: not returned to applications
DESTROYED: cryptographic material removed or made unrecoverable

Avoid updating ciphertext in place. Create a new version.

Envelope Encryption

Do not encrypt every secret directly with a master key. Use envelope encryption.

Write path:

1. Generate a random data key.
2. Encrypt the secret value with the data key.
3. Ask KMS to encrypt the data key.
4. Store ciphertext and encrypted data key.
5. Destroy plaintext key material from memory as soon as possible.

Read path:

1. Load ciphertext and encrypted data key.
2. Ask KMS to decrypt the data key.
3. Decrypt the secret value in memory.
4. Return the value to the authorized caller.
5. Emit an audit event.

Pseudo-code:

async function storeSecretVersion(input: {
  secretId: string;
  plaintext: Buffer;
  createdBy: string;
}) {
  const dataKey = crypto.randomBytes(32);
  const iv = crypto.randomBytes(12);

  const cipher = crypto.createCipheriv("aes-256-gcm", dataKey, iv);
  const ciphertext = Buffer.concat([
    cipher.update(input.plaintext),
    cipher.final(),
  ]);
  const tag = cipher.getAuthTag();

  const encryptedDataKey = await kms.encrypt({
    keyId: "secrets-platform-prod",
    plaintext: dataKey,
  });

  await secretVersionRepository.insert({
    secretId: input.secretId,
    ciphertext: Buffer.concat([iv, tag, ciphertext]),
    encryptedDataKey,
    keyId: "secrets-platform-prod",
    createdBy: input.createdBy,
  });

  dataKey.fill(0);
  input.plaintext.fill(0);
}

In real systems, use well-reviewed cryptographic libraries and managed KMS clients. The important system design point is separation: the metadata store never has the plaintext master key.

API Design

Create:

POST /v1/secrets

{
  "environment": "prod",
  "path": "payments/stripe-api-key",
  "ownerTeam": "payments",
  "rotationStrategy": "manual-overlap"
}

Create version:

POST /v1/secrets/{secretId}/versions

{
  "value": "sk_live_redacted",
  "status": "PENDING"
}

Read:

GET /v1/secrets/prod/payments/stripe-api-key?alias=current

Response:

{
  "path": "payments/stripe-api-key",
  "version": 42,
  "value": "sk_live_redacted",
  "expiresAt": null
}

Promote:

POST /v1/secrets/{secretId}/aliases/current

{
  "version": 42,
  "expectedPreviousVersion": 41
}

Use compare-and-swap semantics when updating aliases. That prevents two rotations from racing and accidentally promoting the wrong version.

Access Control

Access should be based on workload identity, team ownership, environment, and operation.

Policy example:

{
  "effect": "allow",
  "principal": "service:prod:payments-api",
  "actions": ["secrets:read"],
  "resources": ["prod/payments/*"],
  "conditions": {
    "sourceCluster": "prod-us-east-1",
    "mfa": false
  }
}

Admin policy:

{
  "effect": "allow",
  "principal": "group:payments-oncall",
  "actions": ["secrets:rotate", "secrets:read-break-glass"],
  "resources": ["prod/payments/*"],
  "conditions": {
    "mfa": true,
    "ticketRequired": true
  }
}

Separate read, write, rotate, administer, and break-glass permissions. A service account that reads a secret does not need permission to rotate it. A developer who can rotate a secret does not automatically need to see its plaintext.

Application Delivery Patterns

There are three common delivery patterns.

Direct SDK fetch.

The application calls the secrets API at startup or when it needs a value.

Pros:

simple freshness model
easy audit attribution
no sidecar dependency

Cons:

application needs client code
secrets API latency can affect startup
every language needs SDK support

Sidecar agent.

A local agent fetches and refreshes secrets, then exposes them through files or a local socket.

Pros:

language agnostic
centralizes caching and renewal
works well with Kubernetes

Cons:

more moving parts
local file permissions matter
agent bugs affect every workload

Build-time or deploy-time injection.

CI/CD injects secrets into environment variables or config.

Pros:

simple to adopt
no runtime dependency

Cons:

rotation usually requires redeploy
environment variables can leak in process dumps and logs
weak audit attribution for individual reads

For production services, prefer runtime delivery through SDK or sidecar. For static configuration that changes rarely, deploy-time injection may be acceptable.

SDK Caching

Without caching, every request path can become dependent on the secrets platform.

SDK cache behavior:

type CachedSecret = {
  value: string;
  version: number;
  expiresAt: number;
};

class SecretClient {
  private cache = new Map<string, CachedSecret>();

  async get(path: string): Promise<string> {
    const cached = this.cache.get(path);
    const now = Date.now();

    if (cached && cached.expiresAt > now) {
      return cached.value;
    }

    const fresh = await this.fetchSecret(path);
    this.cache.set(path, {
      value: fresh.value,
      version: fresh.version,
      expiresAt: now + 60_000,
    });

    return fresh.value;
  }
}

Cache TTL should be short enough for rotation and revocation to take effect, but long enough to survive brief platform hiccups.

Add jitter:

const baseTtlMs = 60_000;
const jitterMs = Math.floor(Math.random() * 15_000);
const ttlMs = baseTtlMs + jitterMs;

Jitter prevents every instance from refreshing at the same time after a deploy.

Rotation Workflow

A safe rotation flow:

1. Create version 42 as PENDING.
2. Update downstream system to accept both version 41 and 42.
3. Promote alias next -> 42 for canary workloads.
4. Verify canary metrics.
5. Promote alias current -> 42.
6. Wait for SDK caches and application rollout.
7. Mark version 41 as PREVIOUS.
8. After the rollback window, disable version 41.
9. Destroy version 41 when retention policy allows.

For database passwords, rotation often requires both the database and application to support overlap:

CREATE USER app_v42 WITH PASSWORD 'new_password';
GRANT app_role TO app_v42;

Then update applications to use app_v42, monitor connection success, and remove the old user after the overlap window:

REVOKE app_role FROM app_v41;
DROP USER app_v41;

If the downstream system does not support multiple active credentials, the rotation plan must include a coordinated maintenance window or a proxy layer that can absorb the change.

Dynamic Secrets

Some secrets should not be stored as long-lived values at all.

For database access, the platform can issue short-lived credentials:

{
  "username": "svc_payments_20260408_103000",
  "password": "temporary-password",
  "leaseId": "lease_123",
  "expiresAt": "2026-04-08T11:00:00Z"
}

The platform creates the credential in the downstream database, returns it to the service, and revokes it when the lease expires.

Dynamic secrets reduce blast radius, but they require deeper integration with downstream systems. They also need renewal, cleanup, and emergency revocation flows.

Kubernetes Integration

Kubernetes Secrets are useful, but they are not a complete secrets management strategy.

A common production pattern:

Secrets platform -> External Secrets controller -> Kubernetes Secret -> pod volume

Mount secrets as files when rotation matters:

apiVersion: v1
kind: Pod
metadata:
  name: payments-api
spec:
  containers:
    - name: app
      image: payments-api:2026.04.08
      volumeMounts:
        - name: app-secrets
          mountPath: /var/run/secrets/app
          readOnly: true
  volumes:
    - name: app-secrets
      secret:
        secretName: payments-api-secrets

Environment variables are easy, but they are usually harder to rotate without restart and easier to leak through diagnostics.

If you use a sidecar agent, it can write refreshed secrets to a memory-backed volume and notify the application to reload. The application still needs a reload path. Rotation is not just a platform feature; it is also an application feature.

Audit Logging

Every sensitive operation should emit an audit event:

{
  "eventType": "SECRET_READ",
  "tenantId": "internal",
  "environment": "prod",
  "path": "payments/stripe-api-key",
  "version": 42,
  "principal": "service:prod:payments-api",
  "sourceIp": "10.2.4.17",
  "requestId": "req_abc",
  "decision": "ALLOW",
  "createdAt": "2026-04-08T10:30:00Z"
}

Do not store the plaintext value in audit logs. Not even "temporarily." Audit logs often have broader access than secret stores.

Read events can be high volume. Store full-fidelity audit logs in an append-only stream or warehouse, and expose summarized views in the UI:

reads by principal
reads by secret
break-glass reads
denied reads
rotations by owner team
stale secrets without rotation

Break-glass reads should require a reason and ticket:

{
  "reason": "production incident INC-1842",
  "ticketUrl": "https://tracker.example.com/INC-1842",
  "durationMinutes": 30
}

Availability Strategy

Secrets access sits on a painful boundary. If the platform is unavailable, new pods may fail to start. If the platform fails open, an authorization bug can leak everything.

Use different behavior for startup and steady state.

Startup:

fail closed for missing critical secrets
retry with exponential backoff
surface clear error messages
do not log plaintext values

Steady state:

continue using cached values for a bounded stale window
refresh asynchronously
alert on cache staleness
stop using a secret if revocation is explicitly signaled

Example SDK policy:

{
  "cacheTtlSeconds": 60,
  "maxStaleSeconds": 900,
  "failOpenOnRefreshError": true,
  "failOpenOnExplicitRevocation": false
}

This says: tolerate a temporary refresh outage, but do not ignore a revocation event.

Multi-Region Design

For multi-region systems, decide whether secrets are regional or global.

Regional secrets:

lower blast radius
simpler data residency
region-specific credentials
more operational work

Global secrets:

simpler application config
easier single control plane
larger blast radius
cross-region replication concerns

For critical production systems, keep the control plane multi-region but avoid making every read cross-region. Replicate encrypted secret versions and policy snapshots into each serving region. Writes and rotations can go through a primary region if the operational model is simpler.

The KMS strategy must match the data model. If a region cannot decrypt replicated secrets during an outage, replication did not buy you much.

Observability

Metrics:

secret read latency
read success and failure rate
KMS decrypt latency
policy decision latency
cache hit rate
stale cache usage
denied access count
break-glass count
rotation success rate
secrets past rotation SLA
SDK refresh failures

Logs should include secret path, version, principal, request ID, and decision. They should never include plaintext.

Alerts:

spike in denied reads
break-glass access in production
KMS decrypt errors
sudden increase in reads for a sensitive secret
stale cache usage above threshold
secret past rotation deadline
failed rotation for critical secret

Security teams care about unusual access. Platform teams care about startup failures and refresh errors. Product teams care that deploys continue working.

Failure Modes

Plaintext appears in logs. A debugging statement logs request or response bodies from the secrets API.

No version overlap. Rotation breaks production because old and new credentials cannot coexist.

Policy too broad. A wildcard permission gives one service access to another team's secrets.

No owner metadata. Nobody knows who should approve rotation or deletion.

Stale caches last forever. An SDK keeps using a revoked secret because max stale age is not enforced.

Audit log is best effort only. Sensitive reads succeed even when audit logging is broken.

KMS throttling becomes an outage. Every application starts at once and triggers a decrypt storm.

Environment variables leak. Secrets show up in process dumps, debug endpoints, or support bundles.

Break-glass is invisible. Humans can read production secrets without reason, ticket, or alert.

Destroyed versions are still recoverable. Backups retain encrypted data and keys without a destruction policy.

Production Checklist

Store immutable secret versions.
Use envelope encryption.
Keep master keys outside the metadata database.
Separate read, rotate, administer, and break-glass permissions.
Require workload identity for application reads.
Make rotation a first-class workflow.
Support overlap windows for old and new credentials.
Prefer runtime delivery for frequently rotated secrets.
Use SDK caching with TTL, jitter, and max stale age.
Emit audit events for reads, writes, denies, and break-glass access.
Never log plaintext secrets.
Alert on unusual reads and failed rotations.
Track owners and rotation SLAs.
Test KMS outage, policy denial, stale cache, and rotation rollback.
Document what happens when the secrets platform is unavailable.

System Design: Building a Secrets Management Platform

Problem Statement

Requirements

Core Concepts

High-Level Architecture

Data Model

Envelope Encryption

API Design

Access Control

Application Delivery Patterns

SDK Caching

Rotation Workflow

Dynamic Secrets

Kubernetes Integration

Audit Logging

Availability Strategy

Multi-Region Design

Observability

Failure Modes

Production Checklist

Read Next

Recommended Resources

Sachin Sarawgi

Related Articles

System Design: Building a Distributed Cache

System Design: Building a Distributed Configuration Platform

System Design: Building a Fraud Detection Platform

System Design: Building a Secrets Management Platform

Problem Statement

Requirements

Core Concepts

High-Level Architecture

Data Model

Envelope Encryption

API Design

Access Control

Application Delivery Patterns

SDK Caching

Rotation Workflow

Dynamic Secrets

Kubernetes Integration

Audit Logging

Availability Strategy

Multi-Region Design

Observability

Failure Modes

Production Checklist

Read Next

Recommended Resources

Get the next backend guide in your inbox

Sachin Sarawgi

Related Articles

System Design: Building a Distributed Cache

System Design: Building a Distributed Configuration Platform

System Design: Building a Fraud Detection Platform