Multi-tenancy is the architectural core of modern Software-as-a-Service (SaaS) platforms. By enabling a single deployed application instance to serve multiple distinct business entities (tenants), companies achieve high economy of scale, consolidated maintenance, and rapid deployment velocity. However, the choice of multi-tenant isolation patterns shapes every subsequent engineering decision—including database sharding, caching, security models, compliance, and disaster recovery.
Selecting an suboptimal tenant isolation design on day one can impose catastrophic limitations as the business scales. This system design guide provides the architectural blueprints, database DDL, operational runbooks, and compilable source code to implement Grade A multi-tenant backends.
1. Requirements & Core Constraints
To build a high-performance, resilient multi-tenant platform, we must define specific operational targets and constraints.
Functional Constraints
- Tenant Onboarding & Lifecycle: New tenants must be provisioned automatically in under 60 seconds, allocating isolated database structures depending on their purchased subscription tiers.
- Dynamic Suspension: Platform administrators must be able to suspend a tenant instantly, terminating active database connections and blocking API access across the entire cluster.
- Tenant-Aware Access: Every incoming transaction must be cryptographically verified and bound to a single tenant context, preventing any form of cross-tenant data leakage.
- Custom Schema Extension: High-tier enterprise tenants must have the ability to register custom columns and extension fields without requiring database schema changes for other tenants.
Non-Functional SLAs
- Global Tenancy Capacity: The backend must scale to support 50,000 distinct tenants on a single shared-nothing application cluster.
- Latency SLAs: Database query routing and context parsing must add less than 1 millisecond of overhead. The overall p99 API latency must remain under 20 milliseconds.
- High Uptime Posture: Maintain 99.999% system availability, ensuring database migrations or failovers for single tenants do not cause cascading downtime for other healthy tenants.
- Noisy Neighbor Isolation: Restrict resource utilization (CPU, memory, disk I/O, connection limits) so that a traffic spike on Tenant A does not degrade the performance of Tenant B.
Back-of-the-Envelope Estimates
Let us compute the database footprint, connection pooling scaling limits, and memory constraints for a platform housing 50,000 tenants:
- Connection Saturated Overhead: Relational databases (like PostgreSQL) allocate process memory per open connection. By default, each connection consumes approximately 10 Megabytes of server RAM. If we utilize a "Database-per-Tenant" model with a modest pool size of 10 connections per tenant, maintaining 50,000 tenants would require: $$\text{Total Connections} = 50,000 \times 10 = 500,000 \text{ connections}$$ $$\text{Total Database RAM} = 500,000 \times 10 \text{ MB} = 5,000,000 \text{ MB} \approx 5 \text{ Terabytes of RAM}$$ This is a physical impossibility for standard database hardware. Thus, a naive database-per-tenant pattern cannot scale beyond a few hundred tenants without using dynamic connection multiplexers or moving to a shared database model.
- PostgreSQL Catalog Limits: Under the "Schema-per-Tenant" model, a single Postgres instance hosts multiple schemas. Postgres stores metadata in its system catalogs (
pg_class,pg_attribute). As the schema count grows, the catalogs bloat. Assuming each tenant schema contains 100 tables with indices, 50,000 schemas would generate: $$\text{Total Catalog Objects} = 50,000 \text{ schemas} \times 100 \text{ tables} \approx 5,000,000 \text{ catalog objects}$$ At this scale, system catalog queries (used internally by the query planner) fail to fit into RAM, causing query planning times to balloon from 0.5 milliseconds to over 500 milliseconds. Hence, a single Postgres instance should be limited to a maximum of 1,000 schemas before sharding physically across multiple server nodes.
2. API Design & Core Contracts
API endpoints in a multi-tenant platform must explicitly handle tenant identifiers, routing parameters, and admin onboarding commands.
1. Platform Admin Tenant Onboarding API
POST /api/v1/admin/tenants
Invoked by the platform management portal to register a new tenant and trigger database structure provisioning.
Request Headers:
Content-Type: application/json
Authorization: Bearer <ADMIN_SUPERUSER_JWT>
X-Request-ID: onboarding-tx-9988-77
Request Payload:
{
"tenantName": "Acme Enterprise Corp",
"subdomain": "acme",
"planTier": "ENTERPRISE",
"isolationModel": "DATABASE_PER_TENANT",
"region": "us-east-1",
"customDbConfig": {
"allocatedCpu": 4.0,
"allocatedRamGb": 16,
"storageLimitGb": 500
}
}
Response Payload (201 Created):
{
"tenantId": "tnt_88a7b6c5-d4e3-4f2a-b1c0-998877665544",
"subdomain": "acme.csp-saas.com",
"status": "PROVISIONING",
"databaseHost": "db-pool-03.us-east-1.rds.amazonaws.com",
"allocatedSchema": "tenant_acme",
"createdAt": "2026-05-22T22:00:00Z"
}
2. Tenant Context Routing Contracts
For all runtime application APIs, the tenant identifier is resolved implicitly through HTTP headers, secure JWT claims, or DNS hostnames.
Option A: Subdomain Binding (Recommended for Browsers)
The client browser makes a request:
GET /api/v1/orders/ORD-998877 HTTP/1.1
Host: acme.csp-saas.com
Authorization: Bearer <TENANT_SPECIFIC_JWT>
Option B: Dedicated Tenant Header (Recommended for Internal Microservices)
For backend-to-backend communication:
GET /api/v1/orders/ORD-998877 HTTP/1.1
Host: internal-service.csp-saas.local
X-Tenant-ID: tnt_88a7b6c5-d4e3-4f2a-b1c0-998877665544
Authorization: Bearer <SERVICE_MESH_JWT>
Standard API Error Response (403 Forbidden - Tenant Suspension/Leakage Prevention):
{
"errorCode": "TENANT_ACCESS_DENIED",
"message": "Access denied for tenant account. Reason: ACCOUNT_SUSPENDED",
"tenantId": "tnt_88a7b6c5-d4e3-4f2a-b1c0-998877665544",
"timestamp": 1782236500
}
3. High-Level Design (HLD)
The system architecture utilizes a centralized metadata routing directory and dynamic application middleware to intercept and route tenant queries.
1. Global Multi-Tenant Architecture
The following topology maps clients through an Anycast Load Balancer and API Gateway to stateless servers, which dynamically route requests to the appropriate database structures depending on tenant context:
graph TD
%% Clients
ClientAcme[Acme Web Client - acme.app.com] -->|1. Request| LB[DNS Load Balancer]
ClientBeta[Beta Web Client - beta.app.com] -->|1. Request| LB
%% API Gateway
LB -->|L7 Load Balancing| APIGateway[API Gateway / Envoy Proxy]
%% Router Directory lookup
APIGateway -->|2. Check Authentication & Context| AuthSvc[Global Auth & Tenant Directory Service]
AuthSvc -->|Cache Catalog Hit| CatalogCache[("Redis Tenant Directory Catalog")]
%% App Fleet
APIGateway -->|3. Forward with X-Tenant-ID| AppFleet[Stateless Application Fleet]
%% Data Isolation Routing
subgraph Data Tier (Three Isolation Models)
AppFleet -->|Model 3: Database-per-Tenant| DBEnterprise[("Dedicated DB: Enterprise Tenant (acme)")]
AppFleet -->|Model 2: Schema-per-Tenant| DBSharedServer[("Shared DB Server (pool-01)")]
subgraph Schema Pool
DBSharedServer -->|Schema: tenant_beta| SchemaBeta[Tenant Beta Tables]
DBSharedServer -->|Schema: tenant_gamma| SchemaGamma[Tenant Gamma Tables]
end
AppFleet -->|Model 1: Shared Schema / RLS| DBSharedRLS[("Shared Postgres DB (pool-02)")]
subgraph Shared Tables
DBSharedRLS -->|RLS Policy: current_setting| RlsOrders[orders table with tenant_id col]
end
end
classDef database fill:#0d3b66,stroke:#f4d35e,stroke-width:2px,color:#fff;
classDef cluster fill:#2e0f38,stroke:#f4d35e,stroke-width:2px,color:#fff;
classDef client fill:#3d5a80,stroke:#293241,stroke-width:2px,color:#fff;
classDef gateway fill:#ee6c4d,stroke:#293241,stroke-width:2px,color:#fff;
class DBEnterprise,DBSharedServer,DBSharedRLS,CatalogCache database;
class AppFleet,AuthSvc cluster;
class ClientAcme,ClientBeta client;
class APIGateway,LB gateway;
2. Thread-Local Tenant Context & Query Routing Sequence
The sequence diagram below shows how an HTTP request is parsed, authenticated, converted into a Thread-Local context, and executed against the target database connection:
sequenceDiagram
autonumber
actor Client as Client Browser
participant Gateway as API Gateway
participant Interceptor as Tenant Interceptor (App)
participant Context as TenantContext (ThreadLocal)
participant Router as TenantRoutingDataSource
participant DB as Target Database Server
Client->>Gateway: GET /api/orders (Host: acme.csp-saas.com)
Gateway->>Gateway: Authenticate JWT & Extract Tenant ID
Gateway->>Interceptor: Forward request with header [X-Tenant-ID: tnt_acme]
Interceptor->>Context: SetCurrentTenant("tnt_acme")
Interceptor->>Interceptor: Proceed with Controller Method
Interceptor->>Router: Execute database lookup (Repository)
Router->>Context: getCurrentTenant()
Context-->>Router: Return "tnt_acme"
Router->>Router: Fetch or Create Connection Pool for "tnt_acme"
Router->>DB: Open Connection & Execute: SET search_path TO tenant_acme
DB-->>Router: Return Query Result
Router-->>Interceptor: Return Entity List
Interceptor->>Context: clear() (CRITICAL: Prevent thread pollution)
Interceptor-->>Client: 200 OK Response
4. Low-Level Design & Database Models
Selecting the appropriate database isolation strategy determines your cost footprint, scaling limits, and security compliance.
1. Database Isolation Comparison Matrix
| Selection Factor | Shared Schema (Model 1) | Schema-Per-Tenant (Model 2) | Database-Per-Tenant (Model 3) |
|---|---|---|---|
| Isolation Strength | Logical only. Shared CPU, RAM, Disk, Tables. | Logical Schema namespace. Shared CPU, RAM, Disk. | Physical process isolation. Isolated CPU, RAM, Disk. |
| Hardware Efficiency | Highest. Zero idle resource overhead. | High. Shares connection pools. | Lowest. High idle memory overhead. |
| Operational Scaling | Up to 100,000+ tenants. | Up to 1,000 tenants per database. | Around 100-200 tenants per database. |
| Data Deletion / GDPR | Complex. Requires filtering and running batch deletes. | Simple. Run DROP SCHEMA tenant_id CASCADE. |
Simplest. Run DROP DATABASE tenant_id. |
| Custom Schema Ability | Very difficult (requires dynamic JSONB attributes). | Possible (custom tables in tenant schema). | Easiest (complete schema customizability). |
| Average Cost per Tenant | Micro-cents per tenant. | Low. | High (typically $15 to $100+ monthly). |
2. Low-Level Database Schemas (PostgreSQL DDL)
Primary Catalog Database (Platform Directory)
Resides on a central management instance. Tracks tenant connection configurations.
-- Catalog Database Registration
CREATE TABLE tenant_catalog (
tenant_id VARCHAR(64) PRIMARY KEY,
tenant_name VARCHAR(255) NOT NULL,
subdomain VARCHAR(128) UNIQUE NOT NULL,
plan_tier VARCHAR(32) NOT NULL DEFAULT 'STANDARD', -- 'STANDARD', 'PREMIUM', 'ENTERPRISE'
isolation_model VARCHAR(32) NOT NULL, -- 'SHARED_SCHEMA', 'SCHEMA_PER_TENANT', 'DATABASE_PER_TENANT'
connection_url TEXT NOT NULL, -- Destination DB connection string
db_username VARCHAR(128) NOT NULL,
db_password_encrypted TEXT NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE', -- 'ACTIVE', 'SUSPENDED', 'PROVISIONING'
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_tenant_catalog_subdomain ON tenant_catalog(subdomain);
Shared Schema Table with PostgreSQL Row-Level Security (RLS)
Used for STANDARD tier tenants to maximize resource utilization.
-- Orders table shared by multiple standard tenants
CREATE TABLE shared_orders (
order_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id VARCHAR(64) NOT NULL,
customer_id VARCHAR(64) NOT NULL,
total_amount DECIMAL(15, 4) NOT NULL,
currency VARCHAR(3) NOT NULL DEFAULT 'USD',
status VARCHAR(32) NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
-- Index by tenant_id first to ensure fast isolated index scans
CREATE INDEX idx_shared_orders_tenant_lookup ON shared_orders(tenant_id, created_at DESC);
-- Enable RLS on the table
ALTER TABLE shared_orders ENABLE ROW LEVEL SECURITY;
-- Construct the security policy referencing application configuration settings
CREATE POLICY tenant_isolation_policy ON shared_orders
FOR ALL
USING (tenant_id = current_setting('app.current_tenant_id', true));
3. Compilable Java Multi-Tenant Router Class
This production-grade, thread-safe Java engine utilizes Spring Framework's AbstractRoutingDataSource and HikariCP connection pooling to switch databases dynamically based on the thread-local tenant context:
package com.codesprintpro.multitenancy;
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import org.springframework.jdbc.datasource.lookup.AbstractRoutingDataSource;
import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
* Thread-safe DataSource Router that dynamically resolves the active tenant context
* and provisions or routes to the appropriate HikariCP Connection Pool.
*/
public class TenantRoutingDataSource extends AbstractRoutingDataSource {
private static final Logger LOGGER = Logger.getLogger(TenantRoutingDataSource.class.getName());
// In-memory cache of dynamically provisioned data sources per tenant
private final Map<Object, Object> tenantDataSources = new ConcurrentHashMap<>();
public TenantRoutingDataSource() {
// Register the backing map to AbstractRoutingDataSource
this.setTargetDataSources(tenantDataSources);
}
@Override
protected Object determineCurrentLookupKey() {
String tenantId = TenantContext.getCurrentTenant();
if (tenantId == null) {
LOGGER.log(Level.FINE, "No tenant context resolved. Falling back to default data source.");
return "DEFAULT_DATASOURCE";
}
return tenantId;
}
@Override
protected DataSource determineTargetDataSource() {
Object lookupKey = determineCurrentLookupKey();
if (lookupKey.equals("DEFAULT_DATASOURCE")) {
return (DataSource) this.getResolvedDefaultDataSource();
}
// Thread-safe dynamic provisioning if the tenant pool does not exist
return (DataSource) tenantDataSources.computeIfAbsent(lookupKey, tenantId -> {
LOGGER.log(Level.INFO, "Provisioning new dynamic connection pool for tenant: {0}", tenantId);
return createHikariDataSourceForTenant((String) tenantId);
});
}
private DataSource createHikariDataSourceForTenant(String tenantId) {
// In production, you would fetch these credentials from your central tenant_catalog DB or Secrets Manager
String jdbcUrl = "jdbc:postgresql://localhost:5432/tenant_pool_db";
String username = "tenant_app_user";
String password = "secure_password_123";
HikariConfig config = new HikariConfig();
config.setJdbcUrl(jdbcUrl);
config.setUsername(username);
config.setPassword(password);
// Enforce tight pool limits per tenant to prevent connection starvation on the database
config.setMaximumPoolSize(8);
config.setMinimumIdle(2);
config.setIdleTimeout(300000); // 5 minutes
config.setConnectionTimeout(3000); // 3 seconds
config.setPoolName("HikariPool-Tenant-" + tenantId);
// Optimize performance
config.addDataSourceProperty("cachePrepStmts", "true");
config.addDataSourceProperty("prepStmtCacheSize", "250");
config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
return new HikariDataSource(config);
}
// ThreadLocal context manager for application threads
public static class TenantContext {
private static final ThreadLocal<String> CURRENT_TENANT = new ThreadLocal<>();
public static void setCurrentTenant(String tenantId) {
CURRENT_TENANT.set(tenantId);
}
public static String getCurrentTenant() {
return CURRENT_TENANT.get();
}
public static void clear() {
CURRENT_TENANT.remove();
}
}
}
5. Scaling Challenges & System Bottlenecks
Scaling multi-tenant applications to tens of thousands of active tenants reveals structural bottlenecks that typical single-tenant backends never encounter.
1. Connection Pool Exhaustion (Noisy Neighbors)
- The Bottleneck: Relational databases have a strict limit on active TCP connections (e.g. 500 connections on a medium cloud instance). If 50 tenants spike in traffic and fully consume their connection allocations, they saturate the database. All remaining tenants will receive connection timeout errors, creating a platform-wide outage.
- The Mitigation: PgBouncer + Dynamic Sizing:
- We deploy PgBouncer in transaction mode between our application servers and the PostgreSQL database. PgBouncer multiplexes thousands of virtual application connections down to a few dozen physical database connections.
- We implement an adaptive pool sizing algorithm in the application framework: standard tier tenants share a shared connection pool, whereas premium tier tenants have dynamically adjusted pools with hard limits enforced via rate limiting.
2. Noisy Neighbor CPU & Memory Starvation
- The Bottleneck: In shared database models, one tenant executing massive analytics queries (e.g., pulling 5 million records for an end-of-year report) saturates the database CPU and disk read queues. This starves the hardware, driving p99 write latency for all other tenants from 15 milliseconds to over 5,000 milliseconds.
- The Mitigation: Application-Level Throttle Gates:
- We restrict query complexity for lower-tier tenants. Large analytical requests are blocked on the primary transaction database and must be routed to a read replica or an asynchronous analytics pipeline (e.g., snowflake or BigQuery).
- Use database resource queues (e.g. PostgreSQL
pg_cgroupsor AWS Aurora Custom Endpoints) to prioritize high-paying tenant queries.
3. Schema Migrations at Massive Scale
- The Bottleneck: Running a DDL schema change (e.g.,
ALTER TABLE orders ADD COLUMN discount DECIMAL) takes 1 second. Under a database-per-tenant or schema-per-tenant pattern with 10,000 tenants, looping through and running this migration sequentially takes: $$\text{Total Duration} = 10,000 \times 1 \text{ second} = 10,000 \text{ seconds} \approx 2.7 \text{ hours}$$ If a single migration step fails halfway through (e.g. at tenant 4,500 due to a lock timeout), the system enters a split-state, with some tenants on version N and others on version N+1. - The Mitigation: Asynchronous Partitioned Migrations:
- We never run migrations sequentially in a single thread. Instead, we register database migrations inside a message queue (Kafka/RabbitMQ) partitioned by
tenant_id. - A cluster of asynchronous migration workers consumes these migration tasks in parallel, executing schema upgrades concurrently across 100 databases at a time. This reduces overall migration times from hours to under 3 minutes.
- The application codebase is strictly written to support both version N and version N+1 schemas simultaneously, utilizing feature-flags to activate new features only after all database structures are successfully upgraded.
- We never run migrations sequentially in a single thread. Instead, we register database migrations inside a message queue (Kafka/RabbitMQ) partitioned by
6. Technical Trade-offs & Compromises
1. Cost Efficiency vs. Compliance Hard-Lines
2. Multi-Tenant DB Migration Strategy
[!PITFALL] Dangers of Locking Schema Upgrades Executing an
ALTER TABLEstatement on PostgreSQL can trigger an exclusive lock (AccessExclusiveLock) on the target table. If there is an active, long-running query on that table, the migration waits. This blocks all subsequent queries behind it in the lock queue, locking up the database within seconds. We mitigate this by enforcing strict migration guidelines:
- Never execute locking migrations during peak business hours.
- Force all migrations to use low lock timeouts (
SET lock_timeout = '2s'). If a lock cannot be acquired within 2 seconds, the migration fails gracefully and retries later.- Utilize tools like
pg_repackto reconstruct indexes online without blocking concurrent reads or writes.
7. Failure Scenarios & Operational Resiliency
1. The Superuser RLS Bypass Vulnerability
If an application connection pool accidentally authenticates to the database as a superuser (e.g. postgres or rds_superuser), PostgreSQL silently bypasses Row-Level Security. This completely disables the tenant_isolation_policy, allowing any application bug to dump data across all tenants.
- Recovery & Prevention Protocol:
- We run strict automated CI/CD security sweeps to confirm the database credentials used in application properties belong to a highly restricted user account.
- The application user is explicitly granted the
BYPASSRLSproperty set toNOBYPASSRLS:ALTER ROLE tenant_app_user NOBYPASSRLS; - We configure an automated audit trigger in the database to alert immediately if any session executing a query bypasses security filters.
2. Tenant Suspend Lockouts
When a tenant's billing fails or their account is compromised, the operator suspends them. However, active HTTP sessions or open database connection pools on the gateway nodes can continue to execute cached transactions.
- Recovery & Prevention Protocol:
- We implement an event-driven suspension pipeline. When a tenant is suspended, an event is published to a Kafka broadcast topic.
- Every application instance consumes this event, evicts the tenant's metadata from its local cache, and calls:
tenantDataSourceService.getDataSource(tenantId).close(); - The stateless instances forcefully terminate active Hikari connection pools for that tenant, severing database access in under 500 milliseconds globally.
8. Candidate Verbal Script (Interview Guide)
Below is an verbatim mock interview dialogue showcasing how a candidate navigates the multi-tenancy system design:
Interviewer: "How would you design the storage layer for a high-scale B2B SaaS platform that serves 50,000 business tenants, ranging from small startups to Fortune 500 companies?"
Candidate: *"To scale a platform to 50,000 tenants, I will architect a hybrid multi-tenant database tier. A single model cannot serve this entire distribution. I will divide tenants into two categories based on compliance requirements and billing tiers:
- Standard & Growth Tiers (Model 1 - Shared Schema): Startups and standard users share a single PostgreSQL database instance and the same database tables. To prevent cross-tenant data leakage, I will enable PostgreSQL Row-Level Security (RLS) on all tenant-scoped tables. When a database connection is retrieved from the application pool, the middleware sets the active tenant context using
SET app.current_tenant_id = 'tnt_123'. RLS automatically filters all queries at the database engine level, preventing accidental data dumps. - Enterprise Tier (Model 3 - Database-per-Tenant): Enterprise accounts with strict security mandates, custom encryption needs (BYOK), and massive workloads are provisioned on dedicated database instances. This eliminates noisy neighbor CPU starvation and isolates their data completely.
To manage this dynamically, I will deploy a stateless Application Fleet utilizing a custom
TenantRoutingDataSourceextension of Spring’sAbstractRoutingDataSource. This class resolves the active tenant ID from JWT claims or HTTP headers, retrieves the tenant’s metadata from a fast Redis Catalog Cache, and dynamically routes transactions to either the shared RLS database or the tenant's dedicated database connection pool."*
Interviewer: "Excellent. But under the RLS model, how do you handle noisy neighbors? If a massive Standard tenant runs a query that consumes all database IOPS, it ruins the experience for the other 1,000 standard tenants."
Candidate: *"Yes, this is the classic row-level noisy neighbor problem. I will deploy a three-layered defense:
- Application Rate Limiting: I will enforce strict per-tenant QPS limits at the API Gateway level using Redis token buckets. Standard tenants are capped at 50 requests/second. Any spike beyond that receives an HTTP
429 Too Many Requestserror, protecting the downstream databases from overload. - Execution Limits & Read Replica Offloading: I set a strict timeout on all user queries on the shared database (
SET statement_timeout = '5s'). This prevents long-running, poorly indexed queries from locking up database resources. Furthermore, any aggregate analytics queries are blocked on the primary transaction instance and routed to read replicas or a data warehouse like Snowflake. - Database-level CPU Allocations: I will group database users into resource groups or utilize Amazon Aurora Custom Endpoints to guarantee CPU allocations for premium tier tenants, ensuring they always have dedicated execution threads."*
Interviewer: "What happens when you need to run a schema migration on your 50,000 shared schema database or across the enterprise databases?"
Candidate: "For the shared schema database, it's a single table modification. To prevent database lockups, I will run the migration using a low lock timeout (SET lock_timeout = '2s') and implement index additions using the CONCURRENTLY keyword.
For the dedicated enterprise databases, we cannot run migrations in a single loop. I will publish the migration jobs (which contain the target database host, script, and credentials) to a partitioned Kafka topic. A cluster of asynchronous migration workers will consume these jobs in parallel, updating 100 enterprise databases concurrently. The application code is built to support both version N and version N+1 simultaneously. Once all databases report a successful migration, I will flip a feature flag to activate the new version's features globally."