Distributed ID Generation
--
· Characteristics of Unique ID
∘ Why we need monotonicity
· Alex xu Mind map
∘ chapter notes
· Different appraoches to generate UUID
∘ 1. UUID
∘ 2.a. Snowflake
∘ 2.b. Problem with snowflake design
∘ 2.c. (Snowflake ) using logical clocks
∘ 3. DB auto-increment
∘ 4. DB segment / range handler
∘ 5. Redis
∘ 6. Sonyflake (64 bits)
∘ 7. Use UNIX time stamps
∘ 8. TrueTime API
· System Diagram
· Steps Followed by ID Generator
∘ Generating the Epoch Timestamp
∘ Create Node ID
∘ Create Counter
∘ ID Generated
· Advance
· Best article
· TODO
· References
Characteristics of Unique ID
- Characteristics?
→ Unique (across multiple databases ),
→ sortable,
→ ID increment by time (not necessarily increment by 1) - contain numeric only? (yes or no depends)
- length? — 64bit?
- scale ?— generate 10,000 IDs per second with minimal delay
- Uniqueness: We need to assign unique identifiers to different events for identification purposes.
- Scalability: The ID generation system should generate at least a billion unique IDs per day.
- Availability: Since multiple events happen even at the level of nanoseconds, our system should generate IDs for all the events that occur.
- 64-bit numeric ID: We restrict the length to 64 bits because this bit size is enough for many years in the future. Let’s calculate the number of years after which our ID range will wrap around.
→ Total numbers available = 2⁶⁴ = 1.8446744 x 10¹⁹
→ Estimated number of events per day = 1 billion = 10⁹
→ Number of events per year = 365 billion = 365×10⁹
→ Number of years to deplete identifier range = 2⁶⁴/(365×10⁹) = 50,539,024.8595 years
→ 64 bits should be enough for a unique ID length considering these calculations.
Why we need monotonicity
- ordering
- conflict resolution | example — earlier one wins [transaction conflict]
- apps like twitter, Facebook, Instagram need it for timeline rendering
- who will get a lock? someone with smaller id? Example — amazon need it when there is race condition while allocation of inventory
Different appraoches to generate UUID
1. UUID
- A UUID has 128 bits. — it’s a long ID and it’s not something you’d easily write down or remember.
- It is simple to generate and no need to call another service. — Every service in your distributed system can roll out its own unique ID, no chit-chat needed.
- However, it is not sequential and inefficient for database indexing.
- Additionally, UUID doesn’t guarantee global uniqueness. We need to be careful with ID conflicts (although the chances are slim.)
- UUIDs aren’t sortable (except for versions 1 and 2).
- It’s simple, there’s no need for initial setups or a centralized system to manage the ID.
- Using 128-bit numbers as primary keys makes the primary-key indexing slower, which results in slow inserts.
— A workaround might be to interpret an ID as a hex string instead of a number. However, non-numeric identifiers might not be suitable for many use cases.
— The ID isn’t of 64-bit size. Moreover, there’s a chance of duplication. Although this chance is minimal, we can’t claim UUID to be deterministically unique.
— Additionally, UUIDs given to clients over time might not be monotonically increasing.
The following table summarizes the requirements we have fulfilled using UUID:
2.a. Snowflake
Snowflake’s ID generation process has multiple components: timestamp, machine ID, and serial number. The first bit is unused to ensure positive IDs. This generator doesn’t need to talk to an ID generator via the network, so is fast and scalable.
Snowflake implementations vary. For example, data center ID can be added to the “MachineID” component to guarantee global uniqueness.
2.b. Problem with snowflake design
The physical clocks are unreliable. For such clocks, the error can be 17 seconds per day. If we measure time using these on a server, the time drifts away.
Considering a single server, we won’t be affected by the drifting away of time since all transactions land on a single server. But in a distributed environment, the clocks won’t remain synced.
Due to the unreliability of measuring accurate time, no matter how often we synchronize these clocks with each other or other clocks with accurate measurement methods, there will always be skew between the various clocks involved in a distributed system.
Another weak point of this system is its reliance on time. NTP can affect the working of this system. If the clock on one of the servers drifts two seconds in the future, other servers are two seconds behind. The NTP clock recognizes it and recalibrates its clock. Now, all serves will be aligned. However, in that drifting process, IDs could have been generated for a time that hasn’t occurred yet, and now we’ll have a pair of possible nonconcurrent events with the same time stamp. Lastly, the causality of our events won’t be maintained.
Note: The Network Time Protocol (NTP) is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. NTP intends to synchronize all participating computers within a few milliseconds of Coordinated Universal Time (UTC). It mitigates the effects of variable network latency.
Having accurate time still remains an issue. We can read a machine’s time-of-day clock with microsecond or even nanosecond resolution. Even with this fine-grained measurement, the risks of NTP remain. Since we can’t rely on physical clocks, let’s put logical clocks to use.
2.c. (Snowflake ) using logical clocks
- In Lamport clocks, each node has its counter. All of the system’s nodes are equipped with a numeric counter that begins at zero when first activated. Before executing an event, the numeric counter is incremented by one. The message sent from this event to another node has the counter value. When the other node receives the message, it first updates its logical clock by taking the maximum of its clock value. Then, it takes the one sent in a message and then executes the message.
- Lamport clocks provide a unique partial ordering of events using the happened-before relationship. We can also get a total ordering of events by tagging unique node/process identifiers, though such ordering isn’t unique and will change with a different assignment of node identifiers. However, we should note that Lamport clocks don’t allow us to infer causality at the global level. This means we can’t simply compare two clock values on any server to infer happened-before relationship. Vector clocks overcome this shortcoming.
Vector clocks maintain causal history — that is, all information about the happened-before relationships of events. So, we must choose an efficient data structure to capture the causal history of each event.
Consider the design shown below. We’ll generate our ID by concatenating relevant information, just like the Twitter snowflake, with the following division:
- Sign bit: A single bit is assigned as a sign bit, and its value will always be zero.
- Vector clock: This is 53 bits and the counters of each node.
- Worker number: This is 10 bits. It gives us 2^{10} = 1,024 worker IDs.
The following slides explain the unique ID generation using vector clocks, where the nodes A, B, and C reside in a data center.
Our approach with vector clocks works. However, in order to completely capture causality, a vector clock must be at least nn nodes in size. As a result, when the total number of participating nodes is enormous, vector clocks require a significant amount of storage. Some systems nowadays, such as web applications, treat every browser as a client of the system. Such information increases the ID length significantly, making it difficult to handle, store, use, and scale.
3. DB auto-increment
Most database products offer auto-increment identity columns. Since this is supported in the database, we can leverage its transaction management to handle concurrent visits to the ID generator. This guarantees uniqueness in one table. However, this involves network communications and may expose sensitive business data to the outside. For example, if we use this as a user ID, our business competitors will have a rough idea of the total number of users registered on our website.
4. DB segment / range handler
An alternative approach is to retrieve IDs from the database in batches and cache them in the ID servers, each ID server handling a segment of IDs. This greatly saves the I/O pressure on the database.
Let’s try to overcome the problems identified in the previous methods. We can use ranges in a central server. Suppose we have multiple ranges for one to two billion, such as 1 to 1,000,000; 1,000,001 to 2,000,000; and so on. In such a case, a central microservice can provide a range to a server upon request.
Any server can claim a range when it needs it for the first time or if it runs out of the range. Suppose a server has a range, and now it keeps the start of the range in a local variable. Whenever a request for an ID is made, it provides the local variable value to the requestor and increments the value by one.
Let’s say server 1 claims the number range 300,001 to 400,000. After this range claim, the user ID 300,001 is assigned to the first request. The server then returns 300,002 to the next user, incrementing its current position within the range. This continues until user ID 400,000 is released by the server. The application server then queries the central server for the next available range and repeats this process.
This resolves the problem of the duplication of user IDs. Each application server can respond to requests concurrently. We can add a load balancer over a set of servers to mitigate the load of requests.
We use a microservice called range handler that keeps a record of all the taken and available ranges. The status of each range can determine if a range is available or not. The state — that is, which server has what range assigned to it — can be saved on a replicated storage.
This microservice can become a single point of failure, but a failover server acts as the savior in that case. The failover server hands out ranges when the main server is down. We can recover the state of available and unavailable ranges from the latest checkpoint of the replicated store.
Pros
This system is scalable, available, and yields user IDs that have no duplicates. Moreover, we can maintain this range in 64 bits, which is numeric.
Cons
We lose a significant range when a server dies and can only provide a new range once it’s live again. We can overcome this shortcoming by allocating shorter ranges to the servers, although ranges should be large enough to serve identifiers for a while.
5. Redis
We can also use Redis key-value pair to generate unique IDs. Redis stores data in memory, so this approach offers better performance than the database.
6. Sonyflake (64 bits)
Get inspired by Snowflake, Sonyflake makes a few alterations in the distribution of its bits:
Sign bit (1 bit)
Timestamp (39 bits): Sonyflake operates at 10 milliseconds, expands the duration coverage from ~70 years (like in Snowflake) to ~174 years
Machine/ Process ID (16 bits)
Sequence (8 bits): This permits 256 IDs every 10 ms, this is somewhat slower than Snowflake, it increases the chance of ID overlap during peak time.
Given its specifications, Sonyflake seems suitable more for small to medium-sized systems where extreme speed and scale aren’t important.
7. Use UNIX time stamps
UNIX time stamps are granular to the millisecond and can be used to distinguish different events. We have an ID-generating server that can generate one ID in a single millisecond. Any request to generate a unique ID is routed to that server, which returns a time stamp and then returns a unique ID. The ability to generate an ID in milliseconds allows us to generate a thousand identifiers per second. This means we can get 24(hour)∗60(min/hour)∗60(sec/min)∗1000(ID/sec)=86400000IDs24(hour)∗60(min/hour)∗60(sec/min)∗1000(ID/sec)=86400000IDs in a day. That’s less than a billion per day.
Our system works well with generating IDs, but it poses a crucial problem. The ID-generating server is a single point of failure (SPOF), and we need to handle it. To cater to SPOF, we can add more servers. Each server generates a unique ID for every millisecond. To make the overall identifier unique across the system, we attach the server ID with the UNIX time stamp. Then, we add a load balancer to distribute the traffic more efficiently. The design of a unique ID generator using a UNIX time stamps is given below:
Pros
This approach is simple, scalable, and easy to implement. It also enables multiple servers to handle concurrent requests.
Cons#
For two concurrent events, the same time stamp is returned and the same ID can be assigned to them. This way, the IDs are no longer unique.
8. TrueTime API
- Google’s TrueTime API in Spanner is an interesting option. Instead of a particular time stamp, it reports an interval of time.
- When asking for the current time, we get back two values: the earliest and latest ones. These are the earliest possible and latest possible time stamps.
- Based on its uncertainty calculations, the clock knows that the actual current time is somewhere within that interval.
- The width of the interval depends, among other things, on how long it has been since the local quartz clock was last synchronized with a more accurate clock source.
- Google deploys a GPS receiver or atomic clock in each data center, and clocks are synchronized within about 7 ms. This allows Spanner to keep the clock uncertainty to a minimum. The uncertainty of the interval is represented as epsilon.
The following slides explain how TrueTime’s time master servers work with GPS and atomic clocks in multiple data centers.
Spanner guarantees that two confidence intervals don’t overlap (that is, Aearliest < Alatest < Bearliest < Blatest), then B definitely happened after A.
We generate our unique ID using TrueTime intervals. Let’s say the earliest interval is TE, the latest is TL, and the uncertainty is ε
. We use TE in milliseconds as a time stamp in our unique ID.
- Time stamp: The time stamp is 41 bits. We use TE as a time stamp.
- Uncertainty: The uncertainty is four bits. Since the maximum uncertainty is claimed to be 6–10 ms, we’ll use four bits for storing it.
- Worker number: This is 10 bits. It gives us 2¹⁰ = 1,024 worker IDs.
- Sequence number: This is eight bits. For every ID generated on the server, the sequence number is incremented by one. It gives us 2⁸ = 256 combinations. We’ll reset it to zero when it reaches 256.
Pros
- TrueTime satisfies all the requirements. We’re able to generate a globally unique 64-bit identifier. The causality of events is maintained. The approach is scalable and highly available.
Cons
- If two intervals overlap, then we’re unsure in what order A and B occurred.
- It’s possible that they’re concurrent events, but a 100% guarantee can’t be given.
- Additionally, Spanner is expensive because it ensures high database consistency.
- The dollar cost of a Spanner-like system is also high due to its elaborate infrastructure needs and monitoring.
System Diagram
An application calls the ID generator to generate a new ID. For example, a social networking application calls the ID generator to assign a unique ID to each post.
This is how the request will flow:
Each new post reaches the load balancer and is directed to an application server. The application server calls the distributed ID generator to assign a unique ID to the new request. The ID generator creates an ID and returns it to the application server.
Next, let’s zoom into the “ID Generator” block in the above diagram and understand how it works. Here’s the system diagram for the ID Generator:
- When a new request reaches the ID Generator, a vacant id of datatype ‘long’ is generated. This id has 64 bits that are initially vacant.
- Next, the system will determine the time of the request and append this epoch time 22 bits to the left in the id so that there’s space for appending the Node ID and Counter later on.
- Next, the system determines the Node ID (or Machine ID) and fills it into the id, 12 bits to the left, leaving space for the counter.
- Finally, the system appends the counter (incrementing with every new request).
- The final ID is returned to the application server. The application server will store this id against the request in its database.
Steps Followed by ID Generator
Generating the Epoch Timestamp
- The epoch timestamp is the number of seconds that have elapsed since January 1, 1970, midnight (GMT). However, since our timestamp needs to be of millisecond precision, we will use
EpochMilli
to find the timestamp in milliseconds. - This value can be generated by the system as a long datatype and appended to the unique ID as the prefix. When time is set as a prefix for the ID, you can be sure that every ID generated by the system, even across multiple machines, will be larger than the IDs generated before it.
- For instance, it is 22 October 2021, 12:39:16 at the time of writing this sentence. Converting it to
EpochMilli
timestamp, we have 1634906356000. - To get the
EpochMilli
timestamp for the request at that instant, make sure you import the library java.time.Instant at the start of the program. You can then use the functionInstant.now()
. Adjust it for the Custom Epoch (time at the start of the system to give full 69.73 years before rolling back) to get the timestamp that you can return to the ID generating function. - Suppose our ID generator started on 23 October, 2018, 5:26:20, we can take the Custom Epoch time from this reference. The field
CUSTOM_EPOCH
is thus set to 1540272380000.
private static long timestamp() {
return Instant.now().toEpochMilli() — CUSTOM_EPOCH;
}
With the above code, we have the first 41 bits of our id. We fill the timestamp into the id using a left shift:
long id = Timestamp << (NODEIDBITS + COUNTERBITS);
When we shift Timestamp 10+12 bits to the left when adding it to the id, we’ll have the right 22 bits vacant for appending the Node ID and Counter bits later.
This is what the is looks like now:
Create Node ID
To generate the Node ID or Machine ID, we will use the machine’s MAC address. This ensures that the Machine ID is unique for every machine. To get the MAC address of the machine, make sure you import the library java.net.NetworkInterface at the start of the program.
You can then use the command;
CopynetworkInterface.getHardwareAddress()
to get the Node ID component of int datatype.
Now that we have the next component, i.e. the Node ID, let’s use it to fill the next 10 bits of the id:
Copyid |= (nodeId << COUNTERBITS);
By shifting the Node ID 12 bits to the left, we place the Node ID in its correct place, next to the Timestamp, while keeping the last 12 bits vacant for the Counter bit.
This is what the id looks like now:
Create Counter
Now the last 12 bits are still vacant. This space is reserved for the Counter.
We know that the counter cannot exceed 212–1=4095. When the counter reaches the value of 4095, they will roll back to 0 in the next id. For this, we need to define the max value of the counter.
private static final int maxCounter =
(int)(Math.pow(2, COUNTERBITS) — 1);
To increment the counter for each new request that comes in during the same millisecond, we have the following code. Note that if the counter reaches maxCounter
, the request will have to wait until the next millisecond to be given an id. If the request comes in for a fresh millisecond, the counter is set to 0, and increments for each new request that comes in during the same millisecond.
if (currentTimestamp == lastTimestamp) {
counter = (counter + 1) & maxCounter;
if(counter == 0) {
// maxCounter reached, wait until next millisecond
currentTimestamp = wait(currentTimestamp);
}
} else {
// reset counter to 0 for each new millisecond
counter = 0;
}
The counter is appended to the id with the following code.
Copyid |= counter;
The above code places the counter into its right place in the id (rightmost bits that were vacant and reserved for the counter.
ID Generated
The id looks like this now:
All the bits are filled and the complete id is ready for the request. Depending on the application, the id can be associated with a post, a user, a transaction, or something else. The unique id is returned to the app server. The app server stores it in its database against the request.
Advance
- Clock Synchronization —
— In our design, we assume ID generation servers have the same clock. This assumption might not be true when a server is running on multiple cores.
— The same challenge exists in multi-machine scenarios.
— You don’t need to discuss the solutions to this problem, it’ll increase the complexity of the interview and might set a different path of discussion altogether.
— Network Time Protocol is the most popular solution to the this problem. For interested readers, refer to the reference materials below. - Section Length Tuning — For example, fewer sequence numbers but more timestamp bits are effective for low concurrency and long term applications.
- High Availability — Since an ID generator is a very critical system and many different systems depend on it, it must be highly available.
Best article
TODO
- https://freedium.cfd/https://medium.com/double-pointer/system-design-interview-scalable-unique-id-generator-twitter-snowflake-or-a-similar-service-18af22d74343
- https://freedium.cfd/https://levelup.gitconnected.com/how-to-generate-unique-ids-in-distributed-systems-6-key-strategies-37a8ab3b367d
- https://youtu.be/V3btjD1Jv6Y?si=b8vj_biEo040ak9U&t=341