Payments look deceptively simple from the product interface. A customer click-triggers a pay button, the screen displays a spinner, and a green checkmark appears to confirm transaction success. Behind the scenes, however, payment processing is a complex distributed systems challenge.
A production payment system must coordinate with external, legacy banking networks, handle unstable cellular connections, manage API timeouts from payment gateways, prevent double-spending, and maintain a highly audit-compliant financial ledger. If your system has even a tiny race condition, a customer might be double-billed, or a merchant might ship goods for an order that was never paid.
This system design case study details the architectural blueprint for building a secure, scalable, and resilient payment system capable of handling 10,000 transactions per second under peak load, drawing inspiration from global platforms like Stripe and regional real-time networks like UPI.
System Requirements
To build a mission-critical payment platform, we divide our requirements into functional, non-functional, and scale specifications.
Functional Requirements
- Payment Execution: Support card payments (via gateways like Stripe or Adyen) and instant account-to-account transfers (via protocols like UPI).
- Payment Sessions and Intents: Provide a stateful payment intent flow to track transactions from checkout initiation to bank settlement.
- Durable Ledger Ledgering: Record all money movements in an append-only, double-entry financial ledger.
- Refunds and Chargebacks: Expose APIs to reverse transactions partially or fully, and handle disputes asynchronously.
- Reconciliation Ingestion: Ingest end-of-day bank statement files and gateway reports to automatically match internal transactions against actual cash movements.
Non-Functional Requirements
- Strict Idempotency: Guarantee that retrying a payment request due to network timeouts never charges the customer twice.
- Strong Consistency: Ensure that ledger balances and payment states are linearizable; double-spend or duplicate captures must be physically blocked.
- High Ingestion Availability: The front gate must remain highly available to accept payments, even if down-stream third-party networks are slow or degraded.
- Data Durability: Transaction histories and ledger entries must be written to non-volatile storage and replicated across availability zones before confirming success.
- Security Compliance: Ensure the system is PCI-DSS compliant, encrypting all card numbers and relying on tokens for communication.
Scale Assumptions
- Peak Throughput: 10,000 transactions per second (TPS).
- Daily Volume: 100,000,000 payment events processed per day.
API Design and Service Contracts
Our payment system exposes secure REST interfaces to clients and transactional endpoints internally.
1. Create Payment Intent (POST /v1/payment_intents)
Invoked by the checkout frontend to start a payment session.
Request Payload:
{
"amountCents": 9900,
"currency": "USD",
"customerId": "cust_uuid_88192a",
"merchantId": "merch_uuid_1029a",
"paymentMethodId": "pm_uuid_33219"
}
Response Payload (201 Created):
{
"paymentIntentId": "pi_uuid_001a2b",
"clientSecret": "sec_001a2b_tok_9912a",
"status": "REQUIRES_ACTION",
"nextAction": "AUTHENTICATE_3DS"
}
2. Confirm Payment Intent (POST /v1/payment_intents/{id}/confirm)
Invoked once the customer completes security authentication (e.g., 3D Secure or UPI PIN validation).
Request Payload:
{
"paymentIntentId": "pi_uuid_001a2b",
"idempotencyKey": "idem_tok_77218ac89b"
}
Response Payload (200 OK):
{
"paymentIntentId": "pi_uuid_001a2b",
"status": "PROCESSING",
"chargeId": "ch_uuid_4410ab"
}
3. Payment Gateway Webhook (POST /v1/webhooks/stripe)
Asynchronously receives transaction outcome confirmations from the payment gateway.
Request Payload:
{
"id": "evt_99182ab771",
"type": "charge.succeeded",
"data": {
"object": {
"id": "ch_uuid_4410ab",
"amount": 9900,
"currency": "usd",
"payment_intent": "pi_uuid_001a2b",
"status": "succeeded"
}
}
}
Response Payload (200 OK):
{
"received": true
}
High-Level Architecture
The system splits routing paths between high-speed customer checkouts (the hot path) and ledger/reconciliation updates (the async path).
The Client Checkout UI posts to the Payment Gateway Service, which issues a payment intent and writes to the Idempotency Store. The gateway sends authorization requests to the External Payment Processor (like Stripe).
Once authorized, events are published to Kafka. The Ledger Service consumes these events and updates the append-only Ledger Database. At the end of the day, the Reconciliation Engine downloads bank statements from Bank SFTP servers and matches them against the ledger.
End-to-End Payment Execution Lifecycle
This diagram tracks the sequence of events from a checkout click to bank authorization and eventual ledger settlement.
sequenceDiagram
autonumber
participant Client as Client Browser
participant Gate as Payment Gateway Service
participant Exec as Payment Executor
participant PSP as External PSP (Stripe/UPI)
participant Kafka as Kafka Event Broker
participant Ledger as Ledger Service
Client->>Gate: POST /v1/payment_intents (amount, method)
Gate->>Gate: Check Idempotency Key
Gate->>Gate: Write Intent to Database
Gate-->>Client: Return paymentIntentId (requires 3DS)
Client->>Exec: POST /v1/payment_intents/{id}/confirm
Exec->>PSP: Direct Charge Request (Auth + Capture)
PSP-->>Exec: Charge Accepted (200 OK)
Exec->>Gate: Update status = COMPLETED
Exec->>Kafka: Publish PAYMENT_CAPTURED Event
Exec-->>Client: Confirm Success UI
Kafka->>Ledger: Consume Event & Write to Double-Entry Ledger
Double-Entry Ledger Transaction Flow
This diagram details how the system records funds moving across balance sheet accounts when a customer purchases goods.
graph TD
ClientFunds[Customer Payment: $100.00] --> Split{Ledger Engine}
Split -->|Debit Asset| BankAcc[Cash Account: +$100.00]
Split -->|Credit Revenue| MerchRev[Merchant Revenue: -$97.00]
Split -->|Credit Expense| FeeAcc[Gateway Fees: -$3.00]
classDef acc fill:#f9f,stroke:#333,stroke-width:2px;
class BankAcc,MerchRev,FeeAcc acc;
Low-Level Design and Schema
For financial operations, data integrity is paramount. We implement a relational schema in PostgreSQL utilizing strong constraints and unique composite indexes.
-- Tracks user checkout intents and current status
CREATE TABLE payment_intents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id VARCHAR(64) NOT NULL,
merchant_id VARCHAR(64) NOT NULL,
amount_cents BIGINT NOT NULL CHECK (amount_cents > 0),
currency CHAR(3) NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'CREATED', -- CREATED, REQUIRES_ACTION, PROCESSING, COMPLETED, FAILED
idempotency_key VARCHAR(128) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT uk_tenant_idempotency UNIQUE (customer_id, idempotency_key)
);
CREATE INDEX idx_payment_intents_status
ON payment_intents (status, created_at);
-- Tracks active transactions routed to external gateways
CREATE TABLE transaction_records (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
payment_intent_id UUID NOT NULL REFERENCES payment_intents(id) ON DELETE RESTRICT,
psp_name VARCHAR(64) NOT NULL,
external_charge_id VARCHAR(128) UNIQUE,
amount_cents BIGINT NOT NULL,
status VARCHAR(32) NOT NULL, -- SENT, CAPTURED, REJECTED, REFUNDED
error_code VARCHAR(64),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Chart of Accounts for double-entry ledger
CREATE TABLE ledger_accounts (
account_id VARCHAR(64) PRIMARY KEY,
account_name VARCHAR(128) NOT NULL,
account_type VARCHAR(32) NOT NULL, -- ASSET, LIABILITY, EQUITY, REVENUE, EXPENSE
currency CHAR(3) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Append-only double-entry financial ledger entries
CREATE TABLE ledger_entries (
entry_id BIGSERIAL PRIMARY KEY,
transaction_id UUID NOT NULL REFERENCES transaction_records(id) ON DELETE RESTRICT,
account_id VARCHAR(64) NOT NULL REFERENCES ledger_accounts(account_id) ON DELETE RESTRICT,
direction VARCHAR(6) NOT NULL CHECK (direction IN ('DEBIT', 'CREDIT')),
amount_cents BIGINT NOT NULL CHECK (amount_cents > 0),
posted_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_ledger_entries_lookup
ON ledger_entries (account_id, posted_at DESC);
-- Tracks refunds linked to original intents
CREATE TABLE refunds (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
payment_intent_id UUID NOT NULL REFERENCES payment_intents(id) ON DELETE RESTRICT,
amount_cents BIGINT NOT NULL CHECK (amount_cents > 0),
status VARCHAR(32) NOT NULL DEFAULT 'PENDING', -- PENDING, COMPLETED, FAILED
reason TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Schema Rationale & Integrity Constraints
uk_tenant_idempotency: Enforces uniqueness on(customer_id, idempotency_key). If a checkout client makes duplicate API calls due to a network glitch, this constraint blocks the duplicate at the database level, preventing duplicate charges.ON DELETE RESTRICT: We prevent cascading deletions on financial records. Deleting a transaction or account is disabled if it has linked ledger entries, ensuring that our audit history remains immutable and intact.idx_ledger_entries_lookup: Supports rapid balance updates and reconciliation checks. By indexingaccount_idand sorting byposted_at DESC, the query planner can compute active account balances in milliseconds.
Scaling Challenges and Capacity Estimation
A system processing 10,000 transactions per second requires deep capacity planning.
1. Ingress Network Bandwidth
-
Assumptions:
- Peak Transaction Rate = $10,000$ TPS
- Average JSON request size = $500$ bytes
- Average JSON response size = $800$ bytes
-
Calculations: $$\text{Ingress Traffic} = 10,000 \times 500\text{ bytes} = 5,000,000\text{ bytes/sec} = 5\text{ MB/sec}$$ $$\text{Egress Traffic} = 10,000 \times 800\text{ bytes} = 8,000,000\text{ bytes/sec} = 8\text{ MB/sec}$$
While the raw network bandwidth is easily handled by modern interfaces, the application nodes must manage 10,000 concurrent TLS connections. We use high-throughput reverse proxies (such as NGINX or Envoy) to terminate SSL connections and route traffic to the payment gateway cluster.
2. Ledger Database Storage Growth
-
Assumptions:
- Every transaction generates $2$ ledger entry rows (1 debit, 1 credit).
- Row size of a single ledger entry = $150$ bytes.
-
Calculations: $$\text{Daily Transactions} = 10,000\text{ TPS} \times 86,400\text{ seconds/day} = 864,000,000\text{ txns/day (under continuous peak)}$$ $$\text{Daily Ledger Entries} = 864,000,000 \times 2 = 1,728,000,000\text{ rows/day}$$ $$\text{Daily Storage Footprint} = 1,728,000,000 \times 150\text{ bytes} \approx 259.2\text{ GB/day}$$ $$\text{Yearly Storage Footprint} \approx 94.6\text{ TB/year}$$
This volume of database growth is highly challenging for single-instance relational databases. To scale:
- We partition the
ledger_entriestable byposted_atdate ranges. - We implement database sharding by hashing
account_id, dividing the read and write loads across several database nodes. - Legacy ledger records (older than 180 days) are archived into columnar formats (like Parquet) in cold storage.
Failure Scenarios and Resilience
In a mission-critical payment architecture, failure is expected. The system must degrade gracefully.
1. Double-Charge API Retries
A customer clicks the "Submit Payment" button twice in rapid succession.
- The Threat: If both requests are processed, the gateway registers two different charges, double-billing the user.
- Resilience Design:
- The client UI generates a unique client-side idempotency key on checkout page loads.
- The payment gateway utilizes a Redis cache as a fast-access idempotency filter.
- When a request arrives, the gateway attempts to write
(idempotencyKey, IN_PROGRESS)usingSETNX. - If the write succeeds, the request proceeds. If it fails, the gateway waits and returns the cached result of the in-progress transaction once it completes.
2. Network Timeout During Gateway API Calls
The executor requests Stripe to charge the card, but the network connection drops before Stripe can return a response.
- The Threat: The transaction status is left in an indeterminate state. We do not know if Stripe processed the payment.
- Resilience Design:
- The executor must not simply retry or mark the transaction as failed.
- The executor queries Stripe using the payment intent ID to verify the status.
- If Stripe has no record, the executor retries the charge. If Stripe confirms success, the executor marks the internal transaction as captured.
3. Browser Disconnect Mid-Checkout
The user inputs their security credentials, the payment is charged successfully by the card issuer, but the user's internet drops before they can redirect back to the confirmation page.
- The Threat: The internal order state remains unpaid, but the customer has been billed.
- Resilience Design:
- We rely on asynchronous webhooks. The payment processor is configured to send a webhook event (e.g.,
charge.succeeded) to our servers. - An ingestion worker processes the webhook event, updating our database and ledger, ensuring the order progresses even if the user's browser was closed.
- We rely on asynchronous webhooks. The payment processor is configured to send a webhook event (e.g.,
4. Database Ledger Write Failures
The payment executor successfully captures funds from the gateway, but the ledger database experiences a brief write lock or outage, preventing the ledger entry from being written.
- The Threat: Money has moved, but our internal ledger doesn't reflect the change, leading to data inconsistency.
- Resilience Design:
- We implement the Transactional Outbox Pattern.
- The payment executor writes the transaction state change and an outbox message to the database in a single local transaction.
- A background publisher daemon (like Debezium) sweeps the outbox table, publishing the message to Kafka.
- The ledger consumer reads the message from Kafka and retries the ledger write until it succeeds.
Architectural Trade-offs
Designing a payment system requires balancing the write performance of database engines against financial consistency requirements.
Trade-off 1: Distributed Row Locking vs. Saga-Based Orchestration
When updating balances across accounts, we can use distributed database locks to enforce consistency, or orchestrate the steps using Saga patterns with compensating actions.
| Aspect | Distributed Row Locking (Strong CP) | Saga-Based Orchestration (Eventual Consistency) |
|---|---|---|
| Write Throughput | Low (Blocked by lock queues and network hops) | High (All steps run asynchronously) |
| Data Integrity | Absolute (Guarantees atomic balances) | Conditional (Account states are temporarily inconsistent) |
| Complexity | Low (Managed by DB engine capabilities) | High (Requires designing complex compensating transactions) |
| Downstream Outage Behavior | Blocked (Requests fail if a database node is down) | Buffered (Tasks queue up in Kafka until nodes recover) |
Trade-off 2: Relational PostgreSQL vs. Distributed Append-Only Ledger Databases
We compare a traditional PostgreSQL instance with a specialized append-only database (e.g., Amazon QLDB or a blockchain-style ledger).
| Aspect | Relational PostgreSQL | Distributed Ledger Database |
|---|---|---|
| Query Flexibility | High (Supports complex SQL joins and analytics) | Low (Queries are limited to key-value lookups) |
| Cryptographic Verifiability | Low (Admin users can potentially modify raw tables) | High (Every transaction is chained and immutable) |
| Operational Maturity | High (Decades of production history and tooling) | Low (Newer tech, limited developer ecosystem) |
Staff Engineer Perspective
Operating financial infrastructure requires implementing strict safeguards against edge cases and system bugs.
Verbal Script
Interviewer: "How would you design a payment system that guarantees exactly-once processing when communicating with unstable third-party payment gateways?"
Candidate: "We handle this by implementing strict idempotency gateways paired with two-phase status checks.
First, when the client checkout triggers a payment, we generate a unique paymentIntentId and write it to our database. When routing the charge request to the gateway (such as Stripe), the executor passes this intent ID as the idempotency key in the API header.
Second, if the network connection drops during the API call, the executor does not retry or fail immediately. Instead, it queries the gateway to check the status of that idempotency key. If the gateway confirms that the charge was captured, the executor marks the transaction as completed. If the gateway has no record, the executor retries the request safely using the same key. This guarantees exactly-once processing even during network failures."
Interviewer: "How do you handle currency conversions and rounding errors when reconciling international payments?"
Candidate: "We handle this by maintaining separate accounts for currency exchange variances within our double-entry ledger.
When a transaction involves currency conversion (e.g., a customer pays in EUR and we settle in USD), the ledger engine records the transaction using the mid-market rate on that day. However, when the bank statement lands, the actual settled USD amount will vary slightly due to gateway conversion fees and timing differences.
During reconciliation, if the difference is within our configured tolerance range (e.g., less than 0.5%), the system matches the transaction, credit-offsets the cash account, and writes the rounding variance to an FX Gain/Loss expense account. If the variance exceeds the tolerance range, the transaction is routed to a manual case queue for verification."
Interviewer: "What database architecture would you choose to scale the append-only ledger table to billions of rows?"
Candidate: "I would use PostgreSQL with table partitioning by date for our active operations, paired with hot-cold storage archiving.
Active transaction ledgers are partitioned monthly. Since 95% of queries only look at the current month's transactions, the query planner only scans the active month's index partition, keeping search speeds fast.
At the end of each month, we run a batch process that extracts the records from the old partition, converts them to compressed Parquet files, and writes them to Amazon S3. The old partition is then dropped from PostgreSQL. This keeps our database storage footprint small and predictable, while allowing finance teams to run analytical queries over the S3 archive using Amazon Athena."
Interviewer: "How would you prevent ledger corruption if an admin user attempts to update the balance of an account directly in the database?"
Candidate: "We enforce ledger integrity using two main controls: append-only schema designs and cryptographic hashing.
First, our database security policy disables update and delete permissions on the ledger_entries table for all roles, including system administrators. All balances are computed as sums of entries, so modifying a balance requires inserting a new debit or credit entry.
Second, we write a cryptographic signature (hash) on every ledger entry. The hash is calculated using the entry's ID, transaction ID, account ID, amount, and the hash of the preceding ledger row (similar to a blockchain journal). If a rogue user manages to bypass database permissions and alter a row, the chain of hashes breaks, triggering immediate system alerts and halting payment processing until the discrepancy is resolved."