System Design: Designing Google Drive (Distributed File Storage)

Case Study: Design a File Storage System (Google Drive)

Designing a distributed file storage system like Google Drive or Dropbox evaluates your capability to solve hard problems in storage efficiency, real-time consistency, conflict resolution, and high-throughput network transport. In this case study, we will outline the concrete architecture that handles multi-gigabyte files, mitigates network failures, and scales to hundreds of millions of active accounts.

1. Requirements & Core Constraints

Functional Requirements

File Upload & Download: Users can upload and retrieve files from any device (Web, Desktop, Mobile) up to a maximum file size of 50GB.
Cross-Device Syncing: When a file is modified on one device, the changes must automatically and silently synchronize to all other devices logged into that user's account.
Offline Mode: Users must be able to view, edit, and create files offline. Once connectivity is restored, the changes must seamlessly synchronize and reconcile.
File Versioning & Revision History: The system must track file edits and maintain a complete historical revision log (up to 100 historical versions or 30 days of history).
Sharing & Permissions: Users can share files or entire folders with specific users (Read-Only or Read-Write access).

Non-Functional Requirements (SLAs)

High Data Durability: The system must ensure absolute resistance to data corruption or loss. We target a durability SLA of 99.9999999999% (12 nines).
High Availability: Target 99.99% availability for file downloads and metadata synchronization, and 99.9% availability for file uploads.
Low Sync Latency: When a device uploads a file modification, notification of the modification must reach all other online paired devices in under 1 second.
Storage Cost Minimization: The platform must aggressively optimize storage usage via block-level de-duplication to prevent expensive redundant physical cloud storage.

Back-of-the-Envelope Capacity Estimates

To accurately size our metadata and object storage systems, we make the following assumptions based on a scale of 100 Million Daily Active Users (DAU):

DAU: 100,000,000 users.
Average Active Devices Per User: 2 devices.
Upload Rate: Assume an average user uploads or modifies 2 files daily.
Total Ingestion Volume: 200,000,000 files processed daily.
Average File Size: 1 Megabyte (MB) (mix of spreadsheets, large PDF manuals, slide decks, and images).
Average Block (Chunk) Size: 4 Megabytes (MB). Files smaller than 4MB represent 1 chunk; larger files are split into multiple chunks.

Raw Storage Ingestion:

Daily raw data uploaded: $$200,000,000 \text{ files/day} \times 1 \text{ MB/file} \approx 200 \text{ Terabytes (TB) per day raw.}$$
Block-Level De-duplication Efficiency: In enterprise file systems, approximately 30% of uploaded blocks are identical duplicates (e.g., unmodified attachments, system templates, popular public files).
Net new storage added daily: $$200 \text{ TB} \times (1 - 0.30) \approx \mathbf{140 \text{ TB per day physical storage.}}$$
Over 5 Years: $$140 \text{ TB/day} \times 365 \text{ days/year} \times 5 \text{ years} \approx \mathbf{255.5 \text{ Petabytes (PB)}} \text{ of cold object storage.}$$

Network Traffic Estimation:

Average continuous ingress: $$\text{Throughput} = \frac{200 \text{ TB}}{86,400 \text{ seconds}} \approx \mathbf{18.5 \text{ Gigabits per second (Gbps) ingress.}}$$
Assuming download requests are $5\times$ higher than upload volume: $$\text{Egress Throughput} = 18.5 \text{ Gbps} \times 5 \approx \mathbf{92.5 \text{ Gbps egress.}}$$

2. API Design & Core Contracts

The Client Sync Agent communicates with the stateless server layers using REST APIs for metadata and high-performance chunk-streaming channels for block storage.

A. Metadata Sync Protocol (Detect changes)

Clients query the Metadata Server to discover differences between their local SQLite file index and the server state.

POST /v1/metadata/sync
Authorization: Bearer <TOKEN>
Content-Type: application/json

{
  "client_last_sync_token": "token_8347291834_ver9",
  "local_changes": [
    {
      "file_path": "/Documents/Invoices/Q2_report.xlsx",
      "action": "MODIFY",
      "modified_time": "2026-05-22T10:00:00Z",
      "latest_version": 4,
      "size_bytes": 8388608,
      "block_hashes": [
        "sha256_e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "sha256_85860d5cde7492c68f27cf17b9b1d7d0a6ee41d8e1261d7b1e7e4a362a2656bc"
      ]
    }
  ]
}

Response:

{
  "status": "DIVERGED",
  "server_sync_token": "token_8347291834_ver10",
  "required_actions": [
    {
      "file_path": "/Documents/Invoices/Q2_report.xlsx",
      "action": "UPLOAD_BLOCKS",
      "missing_block_hashes": [
        "sha256_85860d5cde7492c68f27cf17b9b1d7d0a6ee41d8e1261d7b1e7e4a362a2656bc"
      ]
    }
  ]
}

B. Block Transfer API

Clients bypass the metadata service when uploading physical data chunks, talking directly to the Block Service via HTTP PUT to maximize network parallelism.

PUT /v1/blocks/sha256_85860d5cde7492c68f27cf17b9b1d7d0a6ee41d8e1261d7b1e7e4a362a2656bc
Authorization: Bearer <TOKEN>
Content-Type: application/octet-stream
X-Block-Checksum: sha256_85860d5cde7492c68f27cf17b9b1d7d0a6ee41d8e1261d7b1e7e4a362a2656bc

<Binary Octet Stream - 4MB Chunks>

Response:

{
  "block_hash": "sha256_85860d5cde7492c68f27cf17b9b1d7d0a6ee41d8e1261d7b1e7e4a362a2656bc",
  "status": "STORED",
  "stored_url": "https://block-storage.csp.io/b_85860d5cde749"
}

3. High-Level Design (HLD)

The architecture isolates block transfers (data planes) from file organization hierarchies (control planes). This ensures that heavy upload operations never starve real-time WebSocket notifications or catalog browsing queries.

graph TD
    %% Client Stack
    subgraph Client Device
        Watcher[Local File Watcher] -->|Detect Mod| SyncAgent[Sync Agent Engine]
        SyncAgent -->|Cache/Read State| LocalDB[(SQLite Local Cache)]
        SyncAgent -->|Chunking/Hashing| RabinChunker[Rabin Chunker Engine]
    end
    
    %% Gateway
    SyncAgent -->|1. REST API / HTTPS| GW[API Gateway / Load Balancer]
    
    %% Ingress Services
    GW -->|2. Sync Metadata| MetaSrv[Metadata Service]
    GW -->|3. Put Chunks| BlockSrv[Block Service]
    SyncAgent -->|4. WebSocket Conn| NotificationSrv[Notification Service]
    
    %% Block Data Plane
    BlockSrv -->|5. Query Hash Existence| DedupSrv[Deduplication Service]
    DedupSrv -->|Cache Read| Cache[(Redis Cache Cluster)]
    DedupSrv -->|6. Storage Path Map| ProductionStorage[(Object Storage S3/GCS)]
    
    %% Control Plane & Metadata
    MetaSrv -->|7. Write Hierarchy/Versions| MetadataDB[(Sharded PostgreSQL)]
    MetadataDB -->|Async Replica| Backup[(Disaster Recovery Replica)]
    
    %% Notifications Flow
    MetaSrv -->|8. Push Version Event| EventBus[Kafka Cluster]
    EventBus -->|9. Consume Change| NotificationSrv
    NotificationSrv -->|10. Alert Peer Devices| RemoteClient((Remote Client Device))

The Synchronization Flow (Sequence Diagram)

This diagram illustrates the end-to-end flow when User modifies a file locally, highlighting chunk deduplication and peer device notifications.

sequenceDiagram
    autonumber
    actor Creator as User Device A
    participant Chunker as Local Rabin Engine
    participant Meta as Metadata Service
    participant Block as Block Service
    participant DB as Metadata DB (Postgres)
    participant Bus as Kafka Event Bus
    participant Notify as Notification Service
    actor Peer as User Device B

    Creator->>Chunker: Modify "Q2_report.xlsx" (12MB)
    Note over Chunker: Generates 3 Chunks (C1, C2, C3)
    Chunker-->>Creator: Returns Checksums [H1, H2, H3]
    Creator->>Meta: POST /v1/metadata/sync with Checksums [H1, H2, H3]
    
    Note over Meta: Queries DB to check if hashes exist
    DB-->>Meta: Hashes H1, H2 exist. Hash H3 is MISSING.
    Meta-->>Creator: Returns: "Upload Block H3" (H1/H2 Deduplicated!)
    
    Creator->>Block: PUT /v1/blocks/H3 (Payload)
    Block->>Block: Save H3 payload to S3 Storage
    Block-->>Creator: Stored Successfully
    
    Creator->>Meta: Commit Metadata Update (Ver 5)
    Meta->>DB: Save File Version 5 with mappings [H1, H2, H3]
    DB-->>Meta: DB Transaction Committed (Strong Consistency)
    Meta->>Bus: Publish event "File Q2_report.xlsx Updated"
    Meta-->>Creator: Sync Complete
    
    Bus->>Notify: Consume Update Event
    Notify->>Peer: Push WebSocket: "Updates available for Q2_report.xlsx"
    Peer->>Meta: Fetch Delta Chunks metadata
    Peer->>Block: Download H3 (Applies Delta locally)

4. Low-Level Design (LLD) & Data Models

Database Selection Rationale

Metadata Database (PostgreSQL with Vitess): Hierarchical file structures are naturally relational (directories contain files, files contain versions). To enforce strict isolation (e.g., two clients creating folders inside the same directory simultaneously), we require ACID transactions and row-level locking. We shard PostgreSQL by workspace_owner_id to guarantee that all folders, files, and versions of a user's company tenant live on a single logical shard.
Local DB (SQLite): A lightweight local relational engine on the client machine is vital. It acts as the ground truth of the local physical files, keeping track of modification timestamps and block hashes to prevent redundant network synchronization requests.

Low-Level SQL Schema (Metadata Store)

-- Represents individual accounts / enterprise workspace environments
CREATE TABLE workspace_owners (
    owner_id VARCHAR(64) PRIMARY KEY,
    owner_name VARCHAR(128) NOT NULL,
    storage_limit_bytes BIGINT DEFAULT 15000000000, -- 15GB default
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Recursive Hierarchy Table sharded by workspace_owner_id
CREATE TABLE files (
    file_id VARCHAR(64) NOT NULL,
    workspace_owner_id VARCHAR(64) NOT NULL,
    name VARCHAR(255) NOT NULL,
    parent_folder_id VARCHAR(64), -- NULL if in root directory
    is_directory BOOLEAN DEFAULT FALSE,
    is_deleted BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (workspace_owner_id, file_id)
);

CREATE INDEX idx_files_parent ON files(workspace_owner_id, parent_folder_id);

-- Version Tracking table supporting historic rollback
CREATE TABLE file_versions (
    version_id VARCHAR(64) NOT NULL,
    workspace_owner_id VARCHAR(64) NOT NULL,
    file_id VARCHAR(64) NOT NULL,
    version_number INT NOT NULL,
    size_bytes BIGINT NOT NULL,
    created_by_user VARCHAR(64) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (workspace_owner_id, version_id),
    FOREIGN KEY (workspace_owner_id, file_id) REFERENCES files(workspace_owner_id, file_id)
);

-- Maps logical files to deduplicated physical block hashes
CREATE TABLE file_version_blocks (
    workspace_owner_id VARCHAR(64) NOT NULL,
    version_id VARCHAR(64) NOT NULL,
    block_hash VARCHAR(64) NOT NULL, -- SHA-256 string
    block_order INT NOT NULL,        -- 0 for first block, 1, 2, 3...
    PRIMARY KEY (workspace_owner_id, version_id, block_order)
);

-- Global Registry of physical block references
CREATE TABLE physical_blocks (
    block_hash VARCHAR(64) PRIMARY KEY,
    byte_size INT NOT NULL,
    s3_storage_path TEXT NOT NULL,
    ref_count BIGINT DEFAULT 1, -- Decremented on file deletions (Garbage Collection)
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

5. Scaling Challenges & Bottlenecks

A. Rabin Fingerprint Chunking (Solving Chunk Insertion Shift)

A core bottleneck in file storage systems is Fixed-Size Chunking. If a file is sliced into static 4MB blocks (0-4MB, 4MB-8MB, 8MB-12MB), inserting a single character at byte index 0 shifts the boundaries of every block. The hashing engine yields entirely different hashes for all blocks, rendering deduplication useless and forcing a complete re-upload.

Solution: Rabin Fingerprints (Rolling Hash Chunking)

Instead of static size limits, a window of bytes (e.g., 64 bytes) slides across the file, calculating a rolling checksum using a polynomial function.

A boundary is declared when the polynomial output satisfies a mathematical condition: $$h(x) \pmod D == \text{Target_Constant}$$ Where $D$ determines the average chunk size (e.g., $D = 4 \times 10^6$ for an average 4MB chunk).

If a character is inserted at index 0, only the very first block undergoes a hash change. The subsequent boundaries align naturally at identical content locations, ensuring that all remaining blocks yield identical checksum hashes. Delta sync uploads only the singular modified chunk.

Fixed Chunking:
[Block 1 (4MB)] [Block 2 (4MB)] [Block 3 (4MB)]
Insert 1 byte -> Boundaries Shift:
[B1_modified (4MB)] [B2_modified (4MB)] [B3_modified (4MB)] -> ALL CHUNKS RE-UPLOADED!

Rabin Rolling Hash:
[Block 1 (boundary matched)] [Block 2 (boundary matched)] [Block 3 (boundary matched)]
Insert 1 byte -> Only block 1 boundary moves locally:
[B1_modified (shifted)] [Block 2 (UNTOUCHED)] [Block 3 (UNTOUCHED)] -> ONLY 1 CHUNK UPLOADED!

6. Real-World Trade-offs

A. Client-Side vs. Server-Side Hashing & Chunking

Option A: Client-Side Fingerprinting: The user's device splits the file, calculates the rolling Rabin hashes, and transmits only the checksum array to check for deduplication.
- Trade-off: High client CPU and battery consumption on mobile devices. However, this saves enormous server bandwidth and egress costs since duplicate blocks never travel over the network.
Option B: Server-Side Fingerprinting: Client uploads raw files; the server performs chunking and deduplication on the fly.
- Trade-off: Wastes client cellular data, burdens server infrastructure with parsing multi-GB streams, and exposes raw unencrypted files to the API layer.
Our Decision: Perform Client-Side Chunks & Hash Computing for desktop sync clients (which have ample CPU and power) and fallback to Server-Side Processing for mobile clients to preserve device battery life.

7. Failure Scenarios & Fault Tolerance

A. Conflicted Revisions (Concurrent Offline Edits)

If User A and User B both edit the exact same document offline:

The Problem: There is no live lock coordinator. When they reconnect, both upload revisions claiming to be Version 3 of the file.
Solution: Optimistic Concurrency Control in the Metadata DB database transaction.
- The server attempts to insert a record into file_versions under version 3.
- The database unique constraint PRIMARY KEY (workspace_owner_id, file_id, version_number) blocks the second committer.
- The second client receives a conflict exception. To prevent data loss, the sync agent creates a branched conflicted copy, saving the second upload as file_name_Conflicted_Copy_DeviceB.xlsx and notifies the client to resolve the discrepancy.

B. Garbage Collecting Orphaned Blocks

With intensive deduplication and file deletions, blocks in physical_blocks may lose all active file version references.

Solution: A background distributed batch processing job runs a weekly map-reduce pipeline. It aggregates all active hashes in file_version_blocks and performs a difference comparison against the physical_blocks registry. Blocks with zero reference connections are marked, decremented, and deleted from S3 origin storage.

8. Staff Engineer Perspective (Operational Deep Dive)

Pitfall

Orphaned Block S3 Race Conditions

A naive architectural design deletes a physical block from S3 as soon as its ref_count hits zero.

Consider this catastrophic race condition:

User A deletes their file version, causing the Metadata Server to decrement block H3's reference count to zero.
Simultaneously, User B uploads a completely unrelated file that happens to contain block H3. The Deduplication Service queries the DB, finds block H3 exists (with ref_count = 0), increments the counter back to 1, and tells User B the upload is skipped.
Meanwhile, the async deletion worker reads the state from step 1, deletes physical block H3 from S3. User B's file is now silently corrupted, pointing to a deleted S3 block path.

To mitigate this, never perform synchronous physical block deletion. Implement a soft-tombstone delay. Blocks with zero active references are placed in a garbage-collection queue with a 7-day TTL grace period before physical eradication from object storage.

9. Candidate Verbal Script (Mock Interview Guide)

Below is a first-person mock interview response displaying elite technical depth:

Candidate: "To design Google Drive, my fundamental focus is optimizing network transport efficiency and maximizing storage cost margins. Slicing files into static 4MB blocks fails under real-world usage because small edits shift the byte index and invalidate all downstream deduplication hashes. To solve this, I will implement client-side Rabin Fingerprints for rolling chunk boundaries. This guarantees that inserting or deleting a single word only triggers the synchronization of the locally affected chunk, maintaining structural alignment across the rest of the file.

For the catalog and nested hierarchy management, I require strict transactional guarantees to prevent folder sync anomalies. I will select a sharded PostgreSQL database utilizing Vitess. By sharding the tables using workspace_owner_id, we guarantee that all recursive file lookups and folder permission evaluations for a tenant complete on a single physical host, avoiding latency-heavy cross-shard distributed transactions.

To handle real-time sync announcements to client devices, I'll leverage a WebSocket-based Notification Service. When the Metadata Service commits a new version to PostgreSQL, it publishes an event to a high-throughput Kafka topic. Stateless Notification workers consume this stream and push a payload containing only the modified path to the active WebSocket channels. The peer devices then pull only the missing chunk offsets, resulting in seamless, sub-second delta synchronization without polling overhead."