System Design: Building a File Upload Platform

Building a File Upload Platform

File upload platforms look deceptively simple until production traffic arrives. Users upload massive files (e.g., raw videos or massive datasets), mobile connections drop halfway through, browsers aggressively retry requests, malicious files threaten backend systems, and metadata tables drift out of sync with physical storage.

A robust, enterprise-grade file upload platform separates the control path (metadata, security rules, and quota verification) from the data path (transferring raw bytes). By allowing clients to stream chunks directly to object storage (like AWS S3 or Google Cloud Storage) and validating uploads asynchronously, we can scale to millions of uploads while protecting application servers from thread exhaustion.

This guide designs a highly available, secure, and horizontally scalable file upload platform.

Requirements and System Goals

Successfully scaling a file upload platform requires balancing security, upload resilience, and bandwidth utilization.

1. Functional Requirements

Multipart Chunk Ingestion: Allow users to upload large files (up to 50 Gigabytes) divided into parallel, independent chunks.
Resume/Retry Capability: Support resuming partial uploads from the last successful chunk if a client's network disconnects mid-transfer.
Direct Object Storage Uploads: Enable clients to upload files directly to cloud storage using short-lived, secure presigned URLs.
Asynchronous Malware Isolation: Scan all uploaded files for viruses and malware before exposing them for download or processing.
Access Control & Visibility: Support granular private download permissions via presigned GET URLs, and expose real-time upload progress to clients.

2. Non-Functional Requirements

API Gateway Preservation: Never proxy large file bytes through application servers; proxying destroys worker thread pools.
Highly Available Read Path: Support millisecond P99 download speeds for hot, public files using Content Delivery Network (CDN) edge caching.
Durable Metadata Consistency: Guarantee eventual consistency between the relational database metadata store and the physical cloud objects.
Cost-Controlled Storage: Automatically transition older files to cold storage tiers (e.g., Glacier) using standard lifecycle policies.

API Interfaces and Service Contracts

The upload pipeline relies on a secure handshake where the application server authorizes the upload and the client streams bytes directly to object storage.

1. Request Upload Session API

The client requests an upload session, declaring file size, type, and checksum.

POST /api/v1/uploads/sessions
Authorization: Bearer <token>
Content-Type: application/json

Request Payload:

{
  "filename": "annual_presentation.mp4",
  "content_type": "video/mp4",
  "size_bytes": 104857600,
  "checksum_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

Response Payload (201 Created):

{
  "file_id": "file_8877665544",
  "session_id": "session_998877",
  "upload_type": "MULTIPART",
  "chunk_size_bytes": 10485760,
  "total_parts": 10,
  "expires_at": 1774896500,
  "upload_urls": [
    { "part_number": 1, "url": "https://s3.us-east-1.amazonaws.com/uploads/org_01/file_8877?partNumber=1&uploadId=mp_id_101..." },
    { "part_number": 2, "url": "https://s3.us-east-1.amazonaws.com/uploads/org_01/file_8877?partNumber=2&uploadId=mp_id_101..." }
  ]
}

2. Finalize Upload API

Once the client completes streaming all parts to object storage, it calls this endpoint to commit the transaction and queue the validation runner.

POST /api/v1/uploads/sessions/session_998877/complete
Authorization: Bearer <token>
Content-Type: application/json

Request Payload:

{
  "parts": [
    { "part_number": 1, "etag": "\"part1_etag_value\"" },
    { "part_number": 2, "etag": "\"part2_etag_value\"" }
  ]
}

Response Payload (200 OK):

{
  "file_id": "file_8877665544",
  "status": "PROCESSING",
  "quarantine_scan": "IN_PROGRESS",
  "message": "Upload session closed successfully. File sent to asynchronous validation and malware scanning queue."
}

High-Level Design and Visualizations

A resilient file upload architecture completely separates the upload control logic from the actual data stream.

1. Direct-to-Storage Ingestion Pipeline

This diagram illustrates the separation of concerns. Stateless services issue credentials, the client streams raw bytes directly to AWS S3, and S3 event triggers coordinate background workers.

graph TD
    Client[Mobile/Web Client] -->|1. POST /sessions| API[Upload API Service]
    API -->|2. Validate Quota & Create Metadata| DB[(PostgreSQL Metadata)]
    API -->|3. Request Presigned URL| S3[(AWS S3 Bucket)]
    API -->>|4. Return Presigned PUT Links| Client

    Client -->|5. Stream Bytes Directly| S3
    
    S3 -->|6. Trigger ObjectCreated Event| Kafka[Kafka Event Bus]
    Kafka -->|7. Consume Event| Worker[Virus Scan & Processing Worker]
    Worker -->|8. Scan Objects| ClamAV[Malware Scanner Engine]
    
    Worker -->|9. Update File Status to READY| DB
    Worker -->|10. Invalidate CDN Edge| CDN[CDN Edge Cache / CloudFront]

2. Chunk Upload Resiliency and Retry Flow

If a client experiences network dropouts during a multipart upload, it queries the API for completed parts to resume dynamically.

sequenceDiagram
    autonumber
    participant Client as Client Device
    participant API as Upload API Service
    participant S3 as Object Storage (S3)
    
    Client->>S3: Upload Part 1 (Success)
    Client->>S3: Upload Part 2 (Success)
    Note over Client: Network Drops!
    
    Client->>API: GET /sessions/{id}/active-parts
    API->>S3: ListMultipartUploadParts(SessionID)
    S3-->>API: Return Parts: [Part 1, Part 2]
    API-->>Client: Completed: [1, 2]. Resume from Part 3.
    
    Client->>S3: Upload Part 3 (Success)
    Client->>API: POST /sessions/{id}/complete
    API->>S3: CompleteMultipartUpload(SessionID)
    S3-->>API: Complete (ETag)
    API-->>Client: File Upload Finished (200 OK)

Low-Level Design and Schema Strategies

To trace uploads, verify chunk consistency, and enforce access isolation, the database schema tracks files, active sessions, and raw chunks.

1. Files Master Metadata Schema

This schema tracks the global metadata state and security flags of all user files.

CREATE TABLE files_metadata (
    file_id VARCHAR(64) PRIMARY KEY,
    tenant_id VARCHAR(64) NOT NULL,
    owner_id VARCHAR(64) NOT NULL,
    original_filename VARCHAR(255) NOT NULL,
    object_storage_key VARCHAR(512) NOT NULL UNIQUE,
    content_type VARCHAR(128) NOT NULL,
    expected_size_bytes BIGINT NOT NULL,
    actual_size_bytes BIGINT,
    checksum_sha256 VARCHAR(64),
    status VARCHAR(32) NOT NULL DEFAULT 'UPLOAD_REQUESTED', -- UPLOAD_REQUESTED, UPLOADED, READY, QUARANTINED, DELETED
    visibility VARCHAR(16) NOT NULL DEFAULT 'PRIVATE', -- PRIVATE, PUBLIC
    scan_status VARCHAR(16) NOT NULL DEFAULT 'PENDING', -- PENDING, SCANNING, CLEAN, INFECTED
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    deleted_at TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_files_tenant_owner ON files_metadata (tenant_id, owner_id, created_at DESC) WHERE deleted_at IS NULL;
CREATE INDEX idx_files_status_scan ON files_metadata (status, scan_status);

2. Active Upload Sessions Schema

Tracks active multipart sessions and their expiration boundaries.

CREATE TABLE upload_sessions (
    session_id VARCHAR(64) PRIMARY KEY,
    file_id VARCHAR(64) NOT NULL REFERENCES files_metadata(file_id) ON DELETE CASCADE,
    tenant_id VARCHAR(64) NOT NULL,
    aws_multipart_id VARCHAR(255), -- ID returned by cloud provider
    status VARCHAR(32) NOT NULL DEFAULT 'CREATED', -- CREATED, ACTIVE, COMPLETED, ABORTED, EXPIRED
    expires_at TIMESTAMP WITH TIME ZONE NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_session_expires ON upload_sessions (expires_at) WHERE status = 'CREATED' OR status = 'ACTIVE';

3. Upload Parts Registry Schema

Tracks individual chunks uploaded for active sessions, verifying checksums and chunk ordering.

CREATE TABLE upload_parts_registry (
    session_id VARCHAR(64) NOT NULL REFERENCES upload_sessions(session_id) ON DELETE CASCADE,
    part_number INT NOT NULL,
    etag VARCHAR(255) NOT NULL,
    size_bytes BIGINT NOT NULL,
    uploaded_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (session_id, part_number)
);

Scaling and Operational Challenges

Direct ingestion architectures prevent severe application bottlenecks under high loads.

1. Bandwidth Starvation (Mathematical Proof of the Direct-to-S3 Design)

To understand why large file uploads must bypass application servers, let us calculate the network thread saturation impact of a Proxy-based upload design (where files are streamed through our Java containers before reaching S3).

Active Daily Uploads: 1,000,000 files/day.
Average File Size: 100 Megabytes (MB).
Total Daily Bandwidth: $$\text{Daily Data} = 1,000,000 \times 100 \text{ MB} = 100,000,000 \text{ MB} = 100 \text{ Terabytes (TB)}$$
Average Ingress Throughput: $$\text{Average Throughput} = \frac{100 \text{ TB}}{86,400 \text{ seconds}} \approx 1.15 \text{ Gigabytes per second (GB/s)} \approx 9.2 \text{ Gbps}$$
Peak Traffic Multiplier: $3 \times$ peak multiplier: $$\text{Peak Ingress} = 9.2 \text{ Gbps} \times 3 = 27.6 \text{ Gbps}$$
Proxy-based Design Bottleneck: If our Java application servers proxy this traffic, every active upload consumes one worker thread and holds memory buffers. A standard cloud instance has a 10 Gbps network limit. To handle the 27.6 Gbps peak, we must run: $$\text{Required Servers} = \frac{27.6 \text{ Gbps}}{10 \text{ Gbps}} \approx 3 \text{ dedicated nodes}$$ But more importantly, streaming 100MB files consumes thread context times of up to 20 seconds per upload on standard connection speeds. With an average of 5,000 concurrent uploads at peak, our thread pool requirement is: $$\text{Concurrent Threads} = 5,000 \text{ threads}$$ This thread density causes extreme thread scheduling lag, GC pressure, and crashes the gateway.
Direct-to-S3 Benefit: By using Presigned URLs, the 27.6 Gbps of raw byte traffic is routed directly to AWS S3. S3 easily scales to petabytes of ingress. Our Upload API Gateway only handles session orchestration, which requires less than 500 bytes per request. At peak, our API Gateway's bandwidth footprint drops to: $$\text{Gateway Peak Ingress} = 5,000 \text{ requests/sec} \times 500 \text{ bytes} \approx 2.5 \text{ MB/s} \approx 20 \text{ Mbps}$$ This is a 1,380x reduction in API Gateway load! We can run the entire platform's metadata API on a single lightweight instance with almost 100% CPU idling.

Trade-offs and Architectural Alternatives

Architects must select their upload models based on client environments and security isolation boundaries.

Ingestion Strategy	Gateway Network Tax	Client Implementation Complexity	Security & Token Control	Best Use Case
Direct Presigned PUT URLs	Zero (Bytes stream directly to S3)	Low (Standard PUT request, but client must manage token expirations)	Medium (Presigned URL grants raw write access; requires short lifetimes)	General file attachments, standard B2B document repositories
Server-Proxied Streaming	Catastrophic (Proxying bytes saturates network interface cards and thread pools)	Very Low (Standard form POST request)	Excellent (Server fully inspects bytes, checks schemas in-flight, and authorizes)	Ultra-secure government platforms where raw client access to storage is banned
Multipart Cloud Uploads	Zero (Cloud provider manages parts directly via S3 Multipart APIs)	High (Client must manage part sizes, calculate MD5 offsets, and send complete calls)	Good (API Gateway controls the multipart lifecycle)	Media platforms, raw video uploads, massive scientific databases
API Gateway Chunking	High (Server buffers and reassembles parts in memory or local disk)	Medium (Standard chunk retry logic)	Outstanding (Zero exposure of backend S3 endpoints)	Medium-scale platforms with custom security proxies

Failure Modes and Fault Tolerance Strategies

Operating global file pipelines requires handling client drops, lost events, and malware infections defensively.

1. The Missed Object Storage Event Bug

When a client completes an upload directly to S3, S3 generates an ObjectCreated event to trigger our virus scanner. However, event queues (like SQS or Kafka) can occasionally experience transient dropouts, losing the event.

The Failure: The database metadata remains stuck in UPLOAD_REQUESTED indefinitely, and the user's file is never processed.
The Solution: Cron-Driven Reconciliation Daemon We run a background reconciliation job every 15 minutes that queries the database for stale records:
```
SELECT file_id, object_storage_key 
FROM files_metadata 
WHERE status = 'UPLOAD_REQUESTED' 
  AND created_at < CURRENT_TIMESTAMP - INTERVAL '1 hour';
```
For each stale record, the daemon issues a HeadObject metadata call directly to S3. If S3 reports the file exists, the daemon repair-registers the UPLOADED status and enqueues a virus scan job, recovering the lost event safely.

2. Malicious File Upload and Executable Quarantine

If a malicious user uploads a malware executable disguised as a PDF (exploit.pdf):

The Failure: If the system serves this file directly from the S3 bucket via public links, other users' browsers may execute the exploit.
The Mitigation: Sandbox Ingestion and Quarantine Tiers
- We separate our storage into two buckets: Ingestion Bucket (Raw) and Production Bucket (Clean).
- Presigned PUT URLs only grant write access to the Ingestion Bucket. Normal users have zero read access to this bucket.
- When the virus worker scans a file and it passes, the worker copies the object to the Production Bucket and deletes the raw original.
- If the file is infected, the worker moves it to an isolated Quarantine Bucket for SRE investigation, flags scan_status = 'INFECTED', and writes a security audit alert.

Staff Engineer Perspective

[!WARNING] The Nightmare of Uncapped S3 Storage Costs It is easy to assume that because cloud object storage is cheap (e.g., $0.023 per GB), we don't need deletion policies.

However, if your platform has millions of users uploading 100MB files, and 10% of those uploads are abandoned halfway through or deleted by users (but soft-deleted in metadata only), you will accumulate petabytes of orphaned files. Your storage bill will grow exponentially.

To survive, you must configure two absolute S3 lifecycle policies:

AbortIncompleteMultipartUpload: Automatically delete unfinished multipart parts after exactly 7 days. This prevents active part fragments from lingering in storage.

Object Expiration Policies: Configure rules to automatically delete objects in the raw Ingestion bucket after 24 hours, and transition older production objects to Glacier Deep Archive after 90 days.

Verbal Script

Interviewer: "How would you design a highly scalable and secure file upload platform for a system like Google Drive, where users can upload files up to 20 GB?"

Candidate: "To design a resilient Google Drive-scale file upload platform handling 20 GB files, we must separate our control plane from our data plane. Proxying 20 GB file bytes through our application API servers is a critical anti-pattern—it will instantly saturate our network cards, fill up Java heap memory, and trigger a cascading outage.

Therefore, my core architecture will rely on Direct-to-Store Ingestion using short-lived S3 Presigned URLs and Multipart Uploads.

First, when a user wants to upload a file, the client device makes an RPC call to our stateless API Gateway. The Gateway performs validation: it checks the user's storage quota, authenticates permissions, and sanitizes the filename, creating a metadata record in our PostgreSQL database marked as UPLOAD_REQUESTED.

Since the file is large, the API Gateway calls the S3 client to initiate a multipart upload and returns an ordered list of Presigned PUT URLs—one for each 100MB chunk—along with a secure SessionID.

Second, the client device receives these URLs and uploads the raw byte chunks directly and in parallel to our isolated Ingestion S3 Bucket. If a network dropout occurs mid-upload, the client queries our API using GET /sessions/{id}/active-parts.

The gateway queries S3 to see which parts are already complete, and tells the client exactly where to resume, avoiding the need to re-upload the entire 20 GB.

Third, once the client completes all parts, it notifies our API Gateway. The gateway calls the S3 API to complete the multipart assembly.

To ensure security, all raw uploads are treated as hostile and untrusted. They are written to our Ingestion S3 Bucket, which has strict bucket policies blocking all public read access.

The completion of the S3 object triggers an asynchronous event routed through a Kafka Event Bus to our Virus Scan and Processing Workers. The workers pull the file, execute a malware scan using ClamAV, and extract metadata.

If the file is infected, it is moved to a locked Quarantine Bucket and the metadata is marked INFECTED.

If it is clean, the worker copies the file to our Production S3 Bucket, deletes the raw original, and marks the file as READY.

Fourth, to handle downloads, public files are cached globally at edge nodes using Amazon CloudFront CDN.

For private files, the client must request a signed URL from our API Gateway. The gateway verifies user access permissions in our database and generates a highly secure, short-lived presigned GET link.

Finally, we guard against storage leaks by configuring an S3 Lifecycle Policy that automatically deletes unfinished multipart uploads after 7 days, and moves older, inactive files to cold Glacier tiers, keeping our operational costs strictly under control."

From vague architecture answers to staff-level trade-off thinking.