Lesson 29 of 105 12 minFlagship

Multi-Tenancy in NoSQL: Data Isolation Strategies for SaaS

Designing SaaS backends? Learn how to implement multi-tenancy in NoSQL. Explore Database-per-tenant, Schema-per-tenant, and Shared-schema (Partitioning) strategies.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **Pros**: Strongest isolation, easy to back up/restore individual tenants.
  • **Cons**: High cost, hard to manage thousands of DBs.
  • **Pros**: Balanced cost and isolation.
Recommended Prerequisites
System Design Interview FrameworkSystem Design Module 7: Sharding & Partitioning

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Designing a Software-as-a-Service (SaaS) platform like Slack, Shopify, or Zendesk introduces a fundamental database engineering challenge: Multi-Tenancy. You must decide how to isolate customer data (tenants) to prevent data leaks and maintain strict security, while simultaneously maximizing hardware resource efficiency and keeping operational costs low.

When utilizing NoSQL databases (such as DynamoDB, Cassandra, or MongoDB), traditional relational boundaries (like separate database schemas or table-level namespaces) are either missing or expensive to scale.

This case study explores the architectural patterns, security controls, and low-level data structures required to build a highly scalable, secure, and cost-effective multi-tenant NoSQL storage engine.


System Requirements and Goals

To design a multi-tenant SaaS datastore, we must establish strict functional boundaries and clear non-functional security goals.

1. Functional Requirements

  • Dynamic Tenant Identification: The system must resolve the tenant context (e.g., tenant_id) from incoming request headers, JWT authentication claims, or subdomains (e.g., apple.slack.com) on every request.
  • Strict Data Isolation: Prevent "cross-tenant data leaks" where bug-ridden application queries accidentally expose Tenant A's private data to Tenant B.
  • Noisy Neighbor Mitigation: Dynamically throttle or isolate high-volume tenants (noisy neighbors) who saturate shared database CPU/IO resources.
  • Tenant Lifecycle Management: Support instant provisioning of new tenants, tenant offboarding (complete data deletion), and seamless tenant migrations between database nodes.

2. Non-Functional Constraints

  • Ultra-Low Latency Overhead: Tenant resolution and routing middleware must add less than $2\text{ ms}$ of latency to the write/read paths.
  • High Scale & Cost Efficiency: Maximize cluster packing density, sharing compute resources to minimize operating costs for smaller (long-tail) tenants.
  • Compliance & Auditing: Support custom encryption-at-rest keys (BYOK - Bring Your Own Key) per tenant to satisfy high-tier enterprise compliance.

API Design and Interface Contracts

A multi-tenant service gateway maps public client calls to isolated internal database partitions by injecting tenant context securely.

1. External REST Request (Public Endpoint)

GET /v1/orders?limit=10

Request Headers:

Host: client-corp-a.saasapp.com
Authorization: Bearer jwt_secure_token_xyz

Decoded JWT Claims (Validated by API Gateway):

{
  "sub": "user_12345",
  "tenant_id": "tenant_corp_a",
  "role": "billing_admin"
}

2. Internal Context Routing API Contract

The API Gateway forwards the request to the backend microservice, injecting the authenticated tenant context as a secure request header.

GET /v1/internal/orders

Internal Routing Headers:

X-Tenant-Id: tenant_corp_a
X-User-Id: user_12345
X-Tracing-Id: trace_88b2a3

High-Level Design Architecture

SaaS multi-tenancy is structured around three major data isolation models: Silo, Bridge, and Pool.

1. The Three Data Isolation Models

graph TD
    %% Silo Model
    subgraph "Silo Model (Database-per-Tenant)"
        AppSilo[App Gateway] -->|Direct Connect| DBSiloA[(DB Tenant A)]
        AppSilo -->|Direct Connect| DBSiloB[(DB Tenant B)]
    end

    %% Bridge Model
    subgraph "Bridge Model (Schema-per-Tenant)"
        AppBridge[App Gateway] -->|Schema Selector| DBSchema[(Shared DB Instance)]
        DBSchema -.->|Isolated Namespace| TableTenantA[Tables Tenant A]
        DBSchema -.->|Isolated Namespace| TableTenantB[Tables Tenant B]
    end

    %% Pool Model
    subgraph "Pool Model (Shared-Database-Shared-Schema)"
        AppPool[App Gateway] -->|Inject tenant_id| DBPooled[(Shared Pooled DB)]
        DBPooled -->|Composite partitions| PartitionA[Partition: tenant_id = CorpA]
        DBPooled -->|Composite partitions| PartitionB[Partition: tenant_id = CorpB]
    end

    %% Styles
    style DBSiloA fill:#111827,stroke:#10b981,stroke-width:2px,color:#fff
    style DBSiloB fill:#111827,stroke:#10b981,stroke-width:2px,color:#fff
    style DBSchema fill:#1e1b4b,stroke:#4f46e5,stroke-width:2px,color:#fff
    style DBPooled fill:#0f172a,stroke:#3b82f6,stroke-width:2px,color:#fff

2. Context-Aware Tenant Router Architecture

When an API call enters the cluster, a dedicated middleware intercepts it, resolves the tenant's sharding rules via Redis, and routes it to the target database.

graph TD
    UserRequest[Client API Call] -->|1. Authenticate Request| Gateway[API Gateway]
    Gateway -->|2. Resolve tenant_id| ContextResolver[Tenant Context Resolver]
    
    subgraph "Routing Control Plane"
        ContextResolver -->|3. Query routing metadata| RedisCache[(Redis Routing Cache)]
        RedisCache -.->|Cache Miss Lookup| MetadataStore[(Postgres Routing DB)]
    end

    ContextResolver -->|4. Forward Route| TargetStorage{Evaluate Tier Strategy}
    TargetStorage -->|Premium Tier| SiloDB[(Dedicated Silo Database)]
    TargetStorage -->|Standard Tier| PooledDB[(Shared Pooled Shard)]

    %% Styles
    style RedisCache fill:#1a1c23,stroke:#ef4444,stroke-width:2px,color:#fff
    style MetadataStore fill:#1a1c23,stroke:#3b82f6,stroke-width:2px,color:#fff
    style PooledDB fill:#0f172a,stroke:#10b981,stroke-width:2px,color:#fff

Low-Level Design & Component Mechanics

To implement the Pooled (Shared-schema) model efficiently in NoSQL databases, we must structure table partitions around tenant_id identifiers.

1. Amazon DynamoDB Partition Schema Layout

In DynamoDB, we partition our data using a composite Primary Key structure:

  • Partition Key (PK): tenant_id (e.g. TENANT#corp_a)
  • Sort Key (SK): entity_type#entity_id (e.g. ORDER#12345)

This guarantees that all records for a specific tenant are physically co-located inside the same database partition, enabling extremely fast, localized point reads.

{
  "PK": { "S": "TENANT#tenant_corp_a" },
  "SK": { "S": "ORDER#ord_887766" },
  "email": { "S": "billing@corpa.com" },
  "amount": { "N": "15000" },
  "status": { "S": "COMPLETED" },
  "created_at": { "S": "2026-05-23T08:00:00Z" }
}

2. ScyllaDB Multi-Tenant Table DDL Configuration

When deploying on ScyllaDB or Cassandra, we define our wide-column schemas to enforce logical partition separation:

CREATE KEYSPACE saas_datastore WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east-1a': 3,
    'us-east-1b': 3
};

USE saas_datastore;

-- Multi-Tenant Pooled Table
CREATE TABLE customer_orders (
    tenant_id varchar,
    order_id uuid,
    customer_id varchar,
    order_amount decimal,
    order_status varchar,
    created_at timestamptz,
    PRIMARY KEY (tenant_id, order_id)
) WITH CLUSTERING ORDER BY (order_id DESC);

3. Application-Level Query Interceptor (TypeScript)

To guarantee that a developer never forgets to inject the tenant_id filter in their queries, we implement a strict query interceptor using DynamoDB Document Client logic:

import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, QueryCommand } from '@aws-sdk/lib-dynamodb';

const client = new DynamoDBClient({ region: 'us-east-1' });
const ddbDocClient = DynamoDBDocumentClient.from(client);

interface SecureQueryOptions {
  tenantId: string;
  entityType: string;
  limit?: number;
}

// Thread-safe Multi-tenant secure query fetcher
export async function getTenantEntities(options: SecureQueryOptions) {
  // CRITICAL: We enforce partitioning using prefix isolation at the SDK level.
  const partitionKey = `TENANT#${options.tenantId}`;
  const sortKeyPrefix = `${options.entityType}#`;

  const command = new QueryCommand({
    TableName: 'SaaS_Application_Table',
    KeyConditionExpression: 'PK = :pk AND begins_with(SK, :skPrefix)',
    ExpressionAttributeValues: {
      ':pk': partitionKey,
      ':skPrefix': sortKeyPrefix
    },
    Limit: options.limit ?? 50
  });

  try {
    const response = await ddbDocClient.send(command);
    return response.Items ?? [];
  } catch (err) {
    console.error(`Security Incident: Tenant ${options.tenantId} failed to query entities: `, err);
    throw new Error('Database query execution rejected.');
  }
}

Scaling Challenges & Production Bottlenecks

Shared-schema systems are highly cost-efficient, but they introduce unique scaling challenges under heavy production loads.

1. The Noisy Neighbor Partition Saturation

In a Pooled database model, multiple tenants share the same physical server instance. If one tenant (e.g. a massive enterprise client) triggers an unexpected marketing campaign, they can saturate the database node's read/write capacity, starving smaller neighbors.

graph TD
    subgraph "Noisy Neighbor Resource Contention"
        TenantMega[Noisy Tenant: 10,000 RPS] -->|Saturates Host CPU| SharedNode[(ScyllaDB Node 1)]
        TenantSmallA[Tenant A: 5 RPS] -->|Starved & Timeout| SharedNode
        TenantSmallB[Tenant B: 5 RPS] -->|Starved & Timeout| SharedNode
    end

Mitigations:

  • Tenant-Level Token Bucket Rate Limiting: Deploy a Redis-backed rate limiter at the API gateway, enforcing strict requests-per-second (RPS) limits by tenant tier.
  • Auto-Sharding / Shard Relocation: If a tenant consistently exceeds 20% of a shared shard's total capacity, trigger an automated background migration script to extract their partition and migrate it to a dedicated Silo database (reallocating them to a Premium tier).

2. Cross-Tenant Data Leaks

A single developer writing a generic query like SELECT * FROM orders WHERE status = 'PENDING' without a strict tenant_id filter will immediately leak cross-tenant private data, resulting in a catastrophic security violation.

Mitigations:

  • SDK-Level Interceptors: Force all database client initializations to wrap requests inside a decorator that automatically appends tenant_id filters to every query context.
  • Logical Database Routing: Separate client connections. The Context Router initializes distinct database client sessions with narrow IAM permissions (IAM Roles per Tenant) configured to permit access exclusively to specific partition key prefixes.

Technical Trade-offs & Strategic Compromises

Architecting a multi-tenant NoSQL datastore requires a deliberate compromise between data isolation, cost, and operational complexity.

Architectural Dimension Silo Model (DB-per-Tenant) Bridge Model (Table-per-Tenant) Pool Model (Shared-Table)
Data Isolation Security Maximum (Physical boundaries) High (Logical schema split) Low (Application-layer guard)
Resource Cost Efficiency Extremely Expensive Medium-High Ultra-Cheap (Maximum packing density)
Scale & Provisioning Speed Slow (Minutes to spin up DBs) Medium Instant (Microseconds - insert row)
Operational Complexity Extremely High (Thousands of DBs) High (Table limits, migrations) Low (Single cluster database)
Bring-Your-Own-Key (BYOK) Easiest (Instance-level keys) Medium Extremely Complex (Row-level encryption)

Failure Scenarios and Fault Tolerance

Multi-tenant platforms must be designed to withstand failures without cascading outages.

1. Row-Level BYOK Cryptographic Failures

High-tier enterprise tenants require Bring Your Own Key (BYOK) encryption-at-rest. If our central Key Management Service (KMS) experiences a network partition, we cannot decrypt the keys of specific premium tenants.

Fault Tolerance Strategy:

  • Graceful Degradation: If a key retrieval fails, immediately throw a localized 401 Unauthorized or 503 Service Unavailable error only to the affected tenant's requests. Smaller, non-encrypted pooled tenants on the same shard must continue to operate completely unaffected, preventing blast-radius cascade.

Staff Engineer Perspective


Verbal Script & Mock Interview

Mock Interview Dialogue

Interviewer: "Welcome! Let's explore multi-tenant systems. If you were designing a B2B SaaS platform like Slack, how would you structure your NoSQL database layer to balance high-security data isolation with cost efficiency? What are the key bottlenecks at scale?"

Candidate: *"To balance data isolation and cost efficiency in a high-scale SaaS platform, we must partition our tenants into distinct storage tiers based on their size and security requirements. We use a hybrid model combining the Silo (Database-per-tenant) and Pool (Shared-schema) models.

For 95% of our customer base—small-to-medium businesses—we implement a highly cost-effective Pool Model. We utilize a single massive NoSQL database (such as Amazon DynamoDB or ScyllaDB) and enforce logical partition separation. In DynamoDB, we structure our primary keys as composite keys: the Partition Key is tenant_id (e.g., TENANT#corp_a), and the Sort Key is the entity identifier (e.g., USER#user_123). This guarantees that all data for a single tenant is physically grouped inside the same partition, enabling low-latency operations while keeping infrastructure costs minimal.

For our premium enterprise customers (the remaining 5%), who require strict physical data isolation and custom encryption keys (BYOK), we deploy a Silo Model. They are allocated to dedicated, isolated database instances, fully neutralizing any noisy neighbor interference."*

Interviewer: "Excellent. You mentioned that smaller customers share the same Pooled database. How do you prevent a single 'Noisy Neighbor' from completely starving the resources of other tenants sharing that database?"

Candidate: *"To protect against Noisy Neighbors in our pooled storage tier, we implement three distinct layers of resilience:

First, we deploy a Redis-backed Token Bucket Rate Limiter at the API Gateway. This limiter enforces strict Requests-Per-Second (RPS) quotas mapped to each tenant's pricing tier. If an application attempts to exceed its quota, we immediately return an HTTP 429 Too Many Requests at the edge, blocking the traffic before it hits our database.

Second, we enforce NoSQL Read/Write Capacity Allocations. In DynamoDB, we can enable partition-level target throttling, or leverage ScyllaDB's native user-defined resource groups to cap the total CPU and I/O utilization of specific tenant queries.

Third, we monitor shard utilization. If a tenant consistently uses more than 20% of a shared database's resource capacity, our monitoring tools trigger a background Shard Migration. We execute an asynchronous read-stream script to copy their partition to a dedicated Silo database, dynamically updating our Context Router metadata in Redis without downtime."*

Interviewer: "Very impressive. What about security? How do you prevent developer error from leaking Tenant A's private data to Tenant B in a shared table?"

Candidate: *"We completely eliminate developer-level leakage risks by removing manual partition query construction from the application layer.

We implement a strict SDK-Level Query Interceptor inside our database client client wrapper. When a query is executed, our interceptor automatically extracts the authenticated tenant_id from the thread-local context (populated by our API Gateway JWT authentication claims) and injects it into the Partition Key prefix before sending the command to the database.

Furthermore, we configure Logical Database Routing. The application API does not connect using a superuser credential. Instead, the Context Router initializes distinct sessions using temporary IAM roles programmed with granular prefix constraints (e.g., allowing access exclusively to arn:aws:dynamodb:...:table/SaaS_Table/tenant_corp_a*). This ensures that even if a developer writes a bug-ridden query, the database engine itself rejects the call, fully securing our tenant boundaries."*

Interviewer: "Outstanding! That shows a deep, practical understanding of SaaS sharding, security, and runtime resilience."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.