System Design: Designing Google Drive
Designing a distributed file storage system like Google Drive or Dropbox requires more than just uploading files to S3. You must handle large files efficiently, synchronize state across multiple devices, and minimize storage costs through deduplication.
1. Core Requirements
- Upload/Download: Users can store and retrieve files.
- Sync: Files must stay in sync across all a user's devices.
- Large Files: Supporting files up to 50GB.
- Conflict Handling: Managing simultaneous edits to the same file.
2. The Chunking Strategy
Instead of treating a file as one big blob, we split it into smaller Chunks (e.g., 4MB each).
- Efficiency: If only a small part of a 1GB file changes, we only need to re-upload the modified chunk, not the whole file.
- Resilience: If an upload fails, we only resume from the failed chunk.
- Storage: Chunks are stored in a distributed object store like Amazon S3.
3. Data Deduplication
Deduplication is the process of eliminating duplicate copies of data to save storage space.
- How it works: Each chunk is hashed (e.g., SHA-256). If two users upload the same file, the chunk hashes will be identical. The system only stores the chunk once in S3 and adds two references to it in the Metadata database.
- The Saving: This drastically reduces storage costs for popular files (like viral videos or OS updates).
4. Metadata Architecture
The Metadata database is the source of truth for the file system. It stores:
- File Names and Hierarchy (folders).
- Chunk Map: Which hashes make up which file version.
- User Permissions.
- Store: Use a distributed SQL database like CockroachDB or a highly consistent NoSQL store like DynamoDB (with strong consistency enabled).
5. Synchronization & Delta Sync
To keep devices in sync, we use a Notification Service (via WebSockets or Long Polling).
- Device A uploads a new chunk and updates Metadata.
- The Metadata Service notifies the Notification Service.
- The Notification Service pings Device B.
- Device B requests only the "Delta" (the changed chunk hashes) from the Metadata Service and downloads them.
6. Offline Support & Conflicts
- Client Metadata: Every client maintains a local database (SQLite) to track file states.
- Conflict Resolution: If two devices edit a file offline, the system typically uses a "Last Write Wins" strategy or creates a conflicted copy (e.g., "File_copy_from_DeviceB.txt").
Summary
Google Drive is a masterclass in Chunk-based storage. By separating file data (S3) from metadata (SQL) and implementing intelligent deduplication, you can build a system that manages petabytes of data while keeping the user experience seamless.
