Case Study: Designing a Global Payment Gateway (Stripe Scale)

Designing a payment gateway (like Stripe, Adyen, or PayPal) is a unique engineering challenge. Unlike social platforms where eventual consistency is perfectly acceptable, payment systems operate under a zero-fault tolerance mandate. Dropping a transaction, charging a card twice, or producing mismatched ledger entries represents direct financial loss, legal liability, and a breach of trust.

This case study designs a globally resilient, PCI-compliant payment gateway orchestration platform operating at stripe-scale, processing millions of transactions daily under absolute atomic guarantees.

1. Requirements & Core Constraints

Functional Requirements

Tokenization Vault: Cards must be securely tokenized at the edge to isolate the system from raw credit card numbers, maintaining strict PCI-DSS compliance boundaries.
Payment Orchestration: Process charges asynchronously or synchronously across multiple third-party bank card processors (Visa, Mastercard, Stripe, Chase).
Double-Entry Ledger: Maintain an immutable audit trail of every cent that enters or leaves the platform.
Idempotent Executions: Absolutely guarantee that any payment request, regardless of merchant retries, network failures, or internal system timeouts, is processed exactly once.

Non-Functional Requirements (SLAs)

ACID Consistency: Financial ledger entries must possess strong transactional consistency. Eventual consistency is structurally unacceptable.
High Availability: The API endpoint must achieve 99.999% availability ("Five Nines"). Merchants rely on continuous checkout uptime.
Secure Compliance: Enforce strict cryptographic separation between cardholder data (PAN) and the rest of the application ecosystem.

Back-of-the-Envelope Estimation

Daily Transactions: $10,000,000$ transactions/day
Average QPS: $$\text{Average QPS} = \frac{10,000,000}{86400} \approx 115.7 \text{ QPS}$$
Peak QPS (10x Factor during Black Friday): $$\text{Peak QPS} = 1,157 \text{ QPS}$$
Ledger Storage Space:
- Each payment transaction creates roughly 4 double-entry ledger rows.
- Each ledger row occupies approximately 256 bytes of data.
- Daily storage requirement: $$\text{Storage/Day} = 10,000,000 \times 4 \text{ entries} \times 256 \text{ bytes} \approx 10.24 \text{ GB/day}$$
- For 5 years of historical retention: $$\text{Storage (5 years)} = 10.24 \text{ GB/day} \times 365 \times 5 \approx 18.7 \text{ TB}$$
- This volume is easily manageable by modern sharded PostgreSQL or Google Spanner clusters, ensuring excellent transactional query latencies.

2. API Design & Core Contracts

Merchants interact with the Gateway via secure HTTPS endpoints. All state changes require cryptographic signatures and unique idempotency contexts.

Initiating a Payment Intent

Create an ephemeral payment intent to track the transaction lifecycle.

Endpoint: POST /v1/payment_intents

Headers:

Authorization: Bearer sk_live_51NABC12345
Idempotency-Key: idemp_UUID_9901823908
Content-Type: application/json

Request Payload:

{
  "amount": 15000,
  "currency": "usd",
  "payment_method_types": ["card"],
  "description": "Enterprise Subscription SaaS Charge",
  "metadata": {
    "order_id": "ord_8829103",
    "tenant_id": "tenant_4452"
  }
}

Response Payload (HTTP 201 Created):

{
  "id": "pi_881a2f9b8c",
  "object": "payment_intent",
  "amount": 15000,
  "currency": "usd",
  "status": "requires_payment_method",
  "client_secret": "pi_881a2f9b8c_secret_XYZ991823",
  "created_at": 1774312860
}

3. High-Level Design (HLD)

The architecture isolates the highly secure tokenization environment (PCI-DSS Zone) from the complex business logic (Core Zone) to simplify security audits.

graph TD
    Client[Client Browser / App] -->|1. Post Card Details| Vault[Secure PCI Token Vault]
    Vault -->|2. Return Token token_4432| Client
    Client -->|3. Submit Order + token_4432| Merchant[Merchant Server]
    Merchant -->|4. POST /v1/charges| Gateway[API Gateway & Idempotency Filter]
    
    subgraph Core Transaction Zone
        Gateway -->|5. Intercept & Check Key| IdempDB[(Idempotency Store Redis)]
        Gateway -->|6. Route Request| PaymentCore[Payment Core Orchestrator]
        PaymentCore -->|7. Query Token| Vault
        PaymentCore -->|8. Create Pending Intent| MainDB[(PostgreSQL Master)]
    end
    
    subgraph Asynchronous Ledger & Event Zone
        PaymentCore -->|9. Dispatch to Bank| Acquirer[Card Acquirer / Visa Net]
        Acquirer -->|10. Capture Confirm| PaymentCore
        PaymentCore -->|11. Commit Double-Entry| LedgerService[Ledger Engine]
        LedgerService -->|12. Transactional Write| LedgerDB[(Double-Entry PostgreSQL)]
        
        LedgerService -->|13. Push State Update| Kafka[Kafka Event Stream]
        Kafka -->|14. Consume & Push| Webhooks[Webhook Dispatcher]
        Webhooks -->|15. Webhook HTTP POST| Merchant
    end
    
    style Vault fill:#047857,stroke:#fff,stroke-width:2px,color:#fff
    style MainDB fill:#1e40af,stroke:#fff,stroke-width:2px,color:#fff
    style LedgerDB fill:#b91c1c,stroke:#fff,stroke-width:2px,color:#fff

End-to-End Request Flow:

Tokenization Flow: The client browser sends card details directly to a hardened, isolated PCI Token Vault. The vault saves the card number, generates a safe token (token_4432), and returns it to the client. The merchant's backend never touches raw credit card numbers.
Gateway Check: The merchant backend calls /v1/charges using the token and an Idempotency-Key. The gateway intercepts the call and checks Redis. If the key exists, it safely returns the cached state.
Acquiring Transaction: The Payment Core Orchestrator requests detokenization from the vault and sends a secure request to the financial acquirer (e.g., Visa Net or Chase Paymentech).
Immutable Ledger Logging: Once the bank approves the charge, the system writes entries into the Double-Entry Ledger DB and fires an event to Kafka to trigger asynchronous webhooks.

4. Low-Level Design (LLD) & Data Models

1. Database Rationale: ACID Consistency

We utilize PostgreSQL or Google Spanner to back our transaction layer. Financial transactions require absolute transactional correctness, serializable isolation levels, and ACID compliance to prevent race conditions during updates. NoSQL databases like Cassandra or MongoDB are entirely excluded from the ledger path.

2. SQL DDL Declarations

Below is the production-grade schema enforcing structural limits:

-- Enforces transactional idempotency states
CREATE TABLE idempotency_keys (
    idempotency_key VARCHAR(255) PRIMARY KEY,
    merchant_id VARCHAR(50) NOT NULL,
    request_hash VARCHAR(64) NOT NULL, -- SHA-256 of payload body
    status VARCHAR(20) NOT NULL, -- 'PENDING', 'SUCCESS', 'FAILED'
    response_code INT,
    response_body TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Main payment transaction intents
CREATE TABLE payment_intents (
    intent_id VARCHAR(50) PRIMARY KEY,
    merchant_id VARCHAR(50) NOT NULL,
    amount_cents BIGINT NOT NULL,
    currency VARCHAR(3) NOT NULL,
    status VARCHAR(30) NOT NULL, -- 'REQUIRES_PAYMENT_METHOD', 'PROCESSING', 'SUCCEEDED', 'FAILED'
    token_id VARCHAR(100) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Core immutable double-entry ledger table
CREATE TABLE ledger_entries (
    entry_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    intent_id VARCHAR(50) REFERENCES payment_intents(intent_id),
    account_id VARCHAR(50) NOT NULL, -- Account mapping target
    entry_type VARCHAR(10) NOT NULL, -- 'DEBIT' (Subtract), 'CREDIT' (Add)
    amount_cents BIGINT NOT NULL,
    currency VARCHAR(3) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_ledger_account ON ledger_entries(account_id, created_at);
CREATE INDEX idx_ledger_intent ON ledger_entries(intent_id);

5. Scaling Challenges & System Bottlenecks

1. Database Hot Spots & Multi-Tenant Lock Contention

High-volume merchants (e.g., Shopify, Uber) initiate thousands of concurrent charges per second. If the system updates the merchant's balance block synchronously via SQL:

UPDATE merchant_balances SET balance = balance + 15000 WHERE merchant_id = 'merchant_uber';

This forces row-level locks on 'merchant_uber', blocking all concurrent transaction threads and causing immediate database starvation.

The Solution: Avoid balances updates on every transaction. Instead, append rows into ledger_entries (which is insert-only and avoids row locks). Balances are computed dynamically as aggregations of entries, or aggregated via asynchronous batch daemons feeding read caches.

2. High-Throughput Webhook Fanout

During peak sales events, the gateway fires hundreds of thousands of asynchronous webhooks. If merchant servers respond slowly (e.g., taking 3 seconds to process webhooks), the gateway's outbound thread pool will starve.

The Solution: Route webhook events into a partitioned Kafka Cluster. Implement an async consumer engine using Go goroutines or Node event-loops backing a retry queue with exponential backoff.

sequenceDiagram
    autonumber
    Merchant Server->>API Gateway: POST /charges (Idempotency-Key)
    API Gateway->>Redis Store: SETNX (key, "PENDING")
    alt Key already exists (Status: PENDING)
        Redis Store-->>API Gateway: Locked
        API Gateway-->>Merchant Server: HTTP 409 Conflict (Retry in progress)
    else Key exists (Status: SUCCESS)
        Redis Store-->>API Gateway: Fetch response payload
        API Gateway-->>Merchant Server: HTTP 200 OK (Return cached response)
    else New Request
        API Gateway->>Payment Processor: Execute Bank Capture
        Payment Processor-->>API Gateway: Capture Complete
        API Gateway->>Postgres DB: Commit Ledger Transaction
        API Gateway->>Redis Store: UPDATE key (SUCCESS, Payload)
        API Gateway-->>Merchant Server: HTTP 200 OK
    end

6. Resilience & Failure Scenarios

1. Third-Party API Timeouts & Network Fractures

Banks and payment processors are notoriously unstable, suffering from high latency spikes. If the payment gateway blocks waiting for a bank API to respond, client HTTP connection pools will saturate.

Resilience Plan: Configure short connection timeouts (e.g., 2.5 seconds) on bank connections. If the bank fails to respond, drop back to an Asynchronous Retry Queue. The transaction state moves to PROCESSING or PENDING and is resolved by an offline worker querying the acquirer's transaction status endpoint.

2. Double-Writes and the Transactional Outbox Pattern

When a payment succeeds, the system must write the change to Postgres and publish an event to Kafka. If Postgres commits successfully, but the server crashes before sending the event to Kafka, the merchant never receives the webhook notification.

Resilience Plan: Implement the Transactional Outbox Pattern. Write the Kafka event directly into an outbox table inside the same Postgres transaction. A separate, high-performance background publisher (e.g., Debezium CDC) reads from the outbox table and publishes events to Kafka with exactly-once delivery guarantees.

7. Staff Engineer Perspective & Key Technical Trade-offs

1. PostgreSQL Ledger Aggregation vs. Cassandra Wide-Rows

PostgreSQL (ACID Approach):
- Pros: Serializable isolation levels, absolute ACID correctness. Guaranteed lack of ledger balance corruption.
- Cons: Write performance degrades as database grows, requiring manual sharding by merchant_id.
Cassandra (Wide-Row Approach):
- Pros: Linear write scaling.
- Cons: Lack of multi-row transactions. Risk of missing ledger entry items during node partitions, leading to bookkeeping discrepancies.
Trade-off Decision: We trade absolute throughput for 100% Correctness. We choose PostgreSQL (or CockroachDB/Google Spanner) because financial accounting mistakes destroy company value far faster than hardware cost spikes.

8. Candidate Verbal Mock Interview Script

Interviewer: "How do you handle a scenario where a transaction fails halfway through? For example, the acquirer bank successfully captures the funds, but your system crashes before writing the success state to the ledger database?"

Candidate: "This is the classic distributed transaction coordination failure. To handle this, we separate our payment execution into a two-step pattern: Authorization followed by Reconciliation.

First, our payment state machine starts with a state of intent_created. When we call the card processor, we assign a unique, cryptographic transaction_id generated by our payment system. Even if our system crashes immediately after the bank captures the money, the funds are safely captured under that transaction ID at the acquiring bank.

Second, our system has a continuous, low-latency Reconciliation Daemon running asynchronously. This daemon polls our processor's transaction logs and matches them against our local PostgreSQL ledger records.

If the daemon finds a transaction that is recorded as 'Captured' on the bank's side but has no corresponding credit/debit entries in our ledger_entries table:

It immediately opens a local Postgres ACID transaction.
It writes the double-entry records (CREDIT merchant, DEBIT settlement) for the matching transaction.
It moves the payment_intents state to SUCCEEDED.
It marks the reconciliation record as resolved.

By ensuring our bank APIs accept our idempotency transaction IDs and relying on asynchronous reconciliation, we guarantee ledger correctness even during complete server infrastructure crashes."

System Design: Designing a Global Payment Gateway (Stripe Scale)

Distributed systems mechanics for engineers building serious backend platforms.