Designing a Distributed ID Generator
Prerequisite: To understand why distributed IDs are hard, first read about Database Sharding and Partitioning.
In a distributed system, you often need to generate unique identifiers (IDs) for every record (e.g., Tweets, Orders, Photos). A standard database auto-increment won't scale because it requires a central "master" that becomes a bottleneck. We need a Distributed ID Generator that is unique, time-ordered, and highly available.
1. The Problem with UUIDs
UUIDs (e.g., 123e4567-e89b-12d3-a456-426614174000) are 128-bit strings.
- Problem A: They are long (128 bits vs 64 bits), increasing index size and memory footprint.
- Problem B: They are not naturally time-ordered. Random inserts into a B-Tree index cause massive performance degradation (page splits).
2. The Solution: Twitter Snowflake
Snowflake generates a 64-bit integer that is roughly time-ordered.
Bit Structure (64 bits):
- 1 bit: Unused (sign bit).
- 41 bits: Timestamp (ms) since a custom epoch. Provides ~69 years of unique IDs.
- 10 bits: Worker ID (Machine ID). Identifies the specific machine/server (allows for 1,024 unique workers).
- 12 bits: Sequence number. A counter for IDs generated in the same millisecond on the same worker (4,096 IDs per ms per worker).
![Snowflake Bit Structure Diagram Placeholder]
3. Why it Scales
- Decentralized: Each worker generates IDs independently. No central coordinator required.
- High Throughput: A single worker can generate over 4 million IDs per second.
- Time-ordered: Sorting by ID is equivalent to sorting by time, which optimizes database index performance.
4. The Critical Challenge: Clock Drift
What if one server's clock is ahead of others, or an NTP sync moves the clock backward?
- The Problem: If the clock moves backward, the generator could produce a duplicate ID that was already used in the "future."
- The Solution: The implementation must check if
current_timestamp < last_timestamp. If so, it should wait for the clock to catch up or throw an exception.
Summary
The Snowflake algorithm is a masterclass in bit manipulation and distributed autonomy. By embedding time and location into a 64-bit integer, you build a system that generates millions of unique, index-friendly IDs per second with zero coordination.
Next: Distributed Locking with Zookeeper Previous: Kafka Consumer Groups Explained
