0% found this document useful (0 votes)

142 views21 pages

Unique ID Generation in Distributed Systems

Q: How does the UUID version 4 ensure uniqueness and what are the trade-offs involved in using it for ID generation in distributed systems?

UUID version 4 generates unique IDs using randomness, where each character is chosen randomly across 122 bits (128 bits in total with 6 bits reserved). The likelihood of collision is extremely low, with reports suggesting that generating 1 billion UUIDs per second for 86 years would only yield a 50% chance of collision . The trade-offs involve the fact that UUID version 4 does not contain any intrinsic information about itself, like sortable IDs or timestamps. This can be a limitation in scenarios where sortable or time-based IDs are necessary. Furthermore, the long length of UUIDs can be cumbersome for database performance and human readability .

Q: How do the bit allocations of Snowflake and Sonyflake ID systems impact their ability to distribute load across various nodes or datacenters?

The bit allocations in Snowflake and Sonyflake ID systems critically influence how they handle distributed loads. Snowflake uses 5 bits for the datacenter ID and 5 bits for the machine ID, allowing for 32 datacenters and 32 machine identifiers, making it efficient for extensive distributed load . Sonyflake allocates more bits to the machine ID (16 bits) and fewer for sequence numbers (8 bits), contrasting Snowflake's 12-bit sequence. This allows Sonyflake to handle more machines but with less granularity at the sequence level . The variation leads Snowflake to be more suitable for larger-scale operations requiring multiple datacenters, while Sonyflake offers deeper internal network distributions but with slower ID generation in packet-heavy systems .

Q: What modifications does Sonyflake introduce to Snowflake's ID generation model, and how do these changes impact its use in different scale systems?

Sonyflake modifies Snowflake's ID generation model by reallocating its bit distribution and reducing temporal precision for broader timespan coverage. Specifically, it uses a 39-bit timestamp with a 10-millisecond increment, enabling it to span over 174 years compared to Snowflake's millisecond to 70-year timespan . It allocates 16 bits for the Machine/Process ID and 8 bits for sequence numbers, resulting in a slower ID generation pace relative to Snowflake due to fewer IDs generated per millisecond. These changes make Sonyflake more suited for smaller systems where extreme speed and scalability aren't essential, offering prolonged durability of ID schemes across decades .

Q: How could NanoID's character set advantage impact its use in web development compared to traditional UUIDs?

NanoID's use of a 64-character set that includes URL-safe characters (including hyphens and underscores) makes it particularly advantageous for web development where URLs must be compact and non-disruptive . This attribute can reduce encoding issues and enhance performance in web applications by decreasing URL lengths, improving readability, and ensuring compatibility across different browsers and servers. Traditional UUIDs, being longer and less URL-optimized, may complicate URL construction and HTTP requests .

Q: What challenges do decentralized sequence-based ID generation models face, and how do these affect their implementation in distributed systems?

Decentralized sequence-based ID generation models face challenges like non-global sequences which complicate sorting, and possible load imbalances if a particular node handles an excessive number of requests. Each node in such a model increments its own counter, leading to inconsistency in ordering across nodes . Moreover, high traffic can lead to nodes processing sequential requests inefficiently, thereby increasing latency . These challenges affect their implementation by requiring meticulous coordination and additional mechanisms to ensure balanced load handling and coherent data organization across decentralized systems .

Q: In what way does the restructuring of time in UUID version 6 address sorting issues in distributed databases?

The restructuring of time in UUID version 6 addresses sorting issues in distributed databases by rearranging the UUID format to incorporate time components at the start, thus ensuring a time-ordered sequence of UUIDs. This reordered time component allows for better database optimization by making them easily sortable chronologically, which is beneficial in querying and managing large datasets efficiently . This innovation effectively answers the sorting inefficiencies present in typical UUID implementations that do not encode time-first components .

Q: In what scenarios is Twitter's Snowflake ID generation method particularly useful, and what are its limitations for smaller systems?

Twitter's Snowflake ID generation is particularly useful in scenarios involving large-scale distributed systems that require high throughput and unique, sortable IDs. The Snowflake ID's structure, consisting of a timestamp, datacenter ID, machine ID, and a sequence number, allows for efficient ID generation and sorting, which is critical for systems like Twitter's massive user base . However, its complexity can be a limitation for smaller systems as it might involve over-engineering with unnecessary features like multiple datacenters or high granularity in timestamps. Small to medium-sized businesses might find this level of engineering excessive, and simpler methods could suffice .

Q: Why might using a centralized sequence-based ID generation system lead to failure in a high-demand distributed environment?

Using a centralized sequence-based ID generation system can lead to failure in high-demand distributed environments due to its inherent bottleneck and single point of failure risks. When multiple nodes or systems rely on a central counter, any delay or failure in this central entity can lead to system-wide transaction slowing or, worse, complete standstill . Additionally, under high load or concurrent requests, the central system may become overloaded, thus failing to meet demand or causing delays in ID generation .

Q: What are the specific advantages of using NanoID over UUID in distributed systems?

NanoID offers specific advantages over UUID in distributed systems, primarily through its shorter length and efficiency. NanoID compresses down to 21 characters using a 64-character set, which is more compact than the typical 36-character UUID . This reduction in size becomes beneficial in terms of storage efficiency and URL-friendliness. However, the difference in storage size between their binary representations is only 2 bits, often negligible . The compactness and efficiency of NanoID make it favorable where resource use and URL length are critical factors.

Q: How do ObjectIDs maintain unique generation in distributed systems like MongoDB, and what advantages do they present?

ObjectIDs maintain uniqueness in distributed systems like MongoDB by combining a 4-byte timestamp, a 5-byte machine-specific random number, and a 3-byte incrementing counter . This structure allows each ObjectID to contain a globally sortable component (the timestamp) and be nearly collision-proof due to the combination of machine-specific randomness and a counter. Advantages of ObjectIDs include their compact 96-bit size, making them shorter than UUIDs, and their capability to be used for sorting. Additionally, they embed creation time and system details, helpful for debugging and system audits .

The document discusses strategies for generating unique IDs in distributed systems. It summarizes 6 key strategies: UUIDs, NanoIDs, sequences, ObjectIDs, Twitter Snowflakes, and Sonyflakes. Each method has benefits and challenges for ensuring ID uniqueness at scale. UUIDs are very unlikely to collide but are long, while Snowflakes are more compact at 64 bits but have a 70 year time limit. Sequences can be sorted but don't perform well under load. ObjectIDs include timestamps but reveal machine details.

Uploaded by

butko.yehor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views21 pages

Unique ID Generation in Distributed Systems

Uploaded by

butko.yehor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Open in app

Member-only story

How to Generate Unique IDs in Distributed

Systems: 6 Key Strategies
In a distributed environment, two nodes can simultaneously assign IDs, the
challenge is ensuring these IDs remain unique, avoiding overlaps and ensuring
system consistency.

Phuong Le (@func25) · Follow

Published in Level Up Coding
8 min read · Oct 10

Listen Share More

Distributed Unique ID (source: [Link])

If you found the infographic helpful, please give it some 👏 to let me know. I’d also love to
hear any feedback you have.

Imagine you have two nodes operating concurrently in a distributed system and
both are responsible for generating IDs for objects stored in a shared storage.

How do you ensure that these nodes produce unique IDs that won’t collide?

Various strategies cater to these requirements, with each being apt to these
requirements:

Uniqueness: Every ID generated should be unique across all nodes in the

system.

Scalable: The system should be able to generate IDs at a high rate without
collisions.

Monotonically Increasing: (If needed) The IDs should increase over time.
I’ve come across many ways to generate IDs and together, let’s take a closer look at a
few of them:

UUID.

NanoID.

Sequence.

ObjectID.

Twitter Snowflake.

Sonyflake (inspired by Snowflake).

Each method has its own benefits and challenges, as we go through each one, I’ll
share my thoughts.

1. UUID or GUID (128 bits)

When talking about generating unique IDs, UUIDs or Universal Unique Identifiers
come to mind.

A UUID is made of 32 hexadecimal characters. Remember, each character is 4 bits.

So, all in all, it’s 128 bits. And when you include the 4 hyphens, you’ll see 36
characters:

6e965784–98ef-4ebf-b477–8bd14164aaa4

5fd6c336-48c4-4510-bfe5-f7928a83a3e2

0333be18-5ecc-4d7e-98d4-80cc362e4ade
UUID (source: [Link])

There are 5 common types of UUID:

Version 1 — Time-based MAC: This UUID uses the MAC Address of your
computer and the current time.

Version 2 — DCE Security: Similar to Version 1 but with extra info like POSIX
UID or GID.

Version 3 — Name-based MD5: This one takes a namespace and a string, then
uses MD5 to create the UUID.

Version 4 — Randomness: Every character is chosen randomly.

Version 5 — Name-based SHA1: Think of Version 3, but instead of MD5, it uses

SHA-1.

Version 6 — Reordered Time: A tweak to Version 1, allowing UUIDs to be sorted

by creation time, optimizing database storage.

…You may want to consider other drafts like Version 6 — Reordered Time and
Version 7 — Unix Epoch Time, etc, among the latest proposals at ramsey/uuid

I won’t go into the details of each version right now. But if you’re unsure about which
to choose, I’ve found Version 4 — Randomness to be a good starting point. It’s
straightforward and effective.

“Random and unique? How’s that even possible?”

The magic lies in its super low chance of collision.

Pulling from what I’ve read on Wikipedia, imagine generating 1 billion UUIDs every
second for 86 whole years and only then would you have a 50% chance of getting a
single match.

“Why do some say UUID has only 122 bits when it’s clearly 128 bits?”

When people talk about UUIDs, they often refer to the most common type, which is
variant 1, version 4.

In this type, 6 out of the 128 bits are already set for specific purposes: 4 bits tell us
it’s version 4 (or “v4”), and 2 bits are reserved for variant information.

So, only 122 bits are left to be filled in randomly.

Pros
It’s simple, there’s no need for initial setups or a centralized system to manage
the ID.

Every service in your distributed system can roll out its own unique ID, no chit-
chat needed.
Cons
With 128 bits, it’s a long ID and it’s not something you’d easily write down or
remember.

It doesn’t reveal much information about itself.

UUIDs aren’t sortable (except for versions 1 and 2).

2. NanoID (126 bits)

Drawing from the concept of UUID, NanoID streamlines things a bit with just 21
characters but these characters are sourced from an alphabet of 64 characters,
inclusive of hyphens and underscores.

Doing the math, each NanoID character takes up 6 bits, as opposed to the 4 bits of
UUID and a quick multiplication, and we see NanoID coming in at a neat 126 bits.

NUp3FRBx-27u1kf1rmOxn
XytMg-01fzdSaHoKXnPMJ
_4hP-0rh8pNbx6-Qw1pMl
“Does storing NanoID vs. UUID in a database make much of a difference?”

Well, if you’re saving them as strings, NanoID might be a bit more efficient, being 15
characters shorter than UUID, but in their binary forms, the difference is a mere 2
bits, often a minor detail in most storage scenarios.

Nano ID (source: [Link])

Pros
NanoID uses characters (A-Za-z0–9_-) which is friendly with URLs.

At just 21 characters, it’s more compact than UUID, shaving off 15 characters to
be precise (though it’s 126 bits versus UUID’s 128)

Cons
NanoID is newer and might not be as widely supported as UUID.

3. Sequences
Sequence or auto-increment might come to mind, as it’s the method that databases
like PostgreSQL and MySQL commonly use

At its core, there’s a centralized counter ticking upwards, but picture a scenario with
millions of simultaneous requests. This central point then turns into both a
bottleneck and a potential single point of failure.

“So, what? We can’t distribute the load or something?”

Absolutely. Instead of one centralized generator, each node can have its very own ID
generator, incrementing as it goes:
Node A: 10 20 30 40
Node B: 1 11 21 31 41
Node C: 2 12 22 32 42

// Alternatively
Node A: a_1 a_2 a_3 a_4
Node B: b_1 b_2 b_3 b_4

But, sorting becomes a bit tricky.

Sequence (source: [Link])

While the system is distributed, each node can still be a bottleneck, if a node is
overwhelmed with countless requests, it will process them one after the other.

Pros
It’s a straightforward approach with the added bonus of sortable IDs, making it
apt for small to medium-sized systems.

Cons
Doesn’t perform well under sudden, high-volume request spikes.

Removing nodes in a decentralized system can complicate matters

IDs from a decentralized model don’t follow a global sequence, complicating any
sorting efforts.

4. ObjectID (96 bits)

ObjectID is MongoDB’s answer to a unique document ID, this 12-byte identifier
typically resides in the “_id” field of a document, and if you’re not setting it yourself,
MongoDB steps in to do it for you.

Here’s what makes up an ObjectID :

Timestamp (4 bytes): This represents the time the object was created, measured
from the Unix epoch (a timestamp from 1970, for those who might need a
refresher).

Random Value (5 bytes): Each machine or process gets its own random value.

Counter (3 bytes): A simple incrementing counter for a given machine

“But how does each process ensure its random value is unique?”

With 5 bytes, we’re talking about 2⁴⁰ potential values, given the limited number of
machines or processes, collisions are exceedingly rare

ObjectID (source: [Link])

When representing ObjectIDs, MongoDB goes with hexadecimal, turning those 12

bytes (or 96 bits) into 24 characters

6502b4ab cf09f864b0 074858

6502b4ab cf09f864b0 074859
6502b4ab cf09f864b0 07485a

For those acquainted with Go, here’s a peek at its implementation:

var objectIDCounter = readRandomUint32()
var processUnique = processUniqueBytes()

func NewObjectIDFromTimestamp(timestamp [Link]) ObjectID {

var b [12]byte

[Link].PutUint32(b[0:4], uint32([Link]()))
copy(b[4:9], processUnique[:])
putUint24(b[9:12], atomic.AddUint32(&objectIDCounter, 1))

return b
}

Pros
Ensures a global order without needing a centralized authority to oversee
uniqueness

In terms of byte size, it’s more compact than both UUID and NanoID.

Using IDs for sorting is straightforward, and you can easily see when each object
was made.

Reveals the specific process or machine that created an item.

Scales gracefully, thanks to its time-based structure ensuring no future conflicts.

Cons
Despite its relative compactness, 96 bits can still be considered long.

Be careful when sharing ObjectIDs with clients, they might reveal too much.

5. Twitter Snowflake (64 bits)

Commonly known as “Snowflake ID”, this system was developed by Twitter to
efficiently generate IDs for their massive user base.

Also, a Snowflake ID boils down to a 64-bit integer, which is more compact than
MongoDB’s ObjectID

Sign Bit (1 bit): This bit is typically unused, though it can be reserved for specific
functions.

Timestamp (41 bits): Much like ObjectID, it represents data creation time in
milliseconds, spanning ~70 years from its starting point.
Datacenter ID (5 bits): Identifies the physical datacenter location. With 5 bits, we
can have up to 2⁵ = 32 datacenter.

Machine/ Process ID (5 bits): Tied to individual machines, services, or processes

creating data.

Sequence (12 bits): An incrementing counter that resets to 0 every millisecond.

“Hold on. 70 years? So from 1970, it’s done by 2040?”

Exactly.

Many Snowflake implementations use a custom epoch that begins more recently,
like Nov 04 2010 [Link] UTC, for instance. As for its advantages, they’re pretty
evident given the design.

Twitter Snowflake (source: [Link])

You can take a look at how it’s implemented in Go with bwmarrin/snowflake :

func (n *Node) Generate() ID {

[Link]()

now := [Link]([Link]).Nanoseconds() / 1000000

if now == [Link] {
[Link] = ([Link] + 1) & [Link]

if [Link] == 0 {
for now <= [Link] {
now = [Link]([Link]).Nanoseconds() / 1000000
}
}
} else {
[Link] = 0
}

[Link] = now

r := ID((now)<<[Link] |
([Link] << [Link]) |
([Link]),
)

[Link]()
return r
}

Cons
Might be over-engineered for medium-sized businesses, especially with complex
setups like multiple data centers, millisecond-level timestamps, sequence
resets...

It packs features some may find excessive, but for giants like Twitter, it’s right on the
money.

6. Sonyflake (64 bits)

Get inspired by Snowflake, Sonyflake makes a few alterations in the distribution of
its bits:

Sign bit (1 bit)

Timestamp (39 bits): Sonyflake operates at 10 milliseconds, expands the

duration coverage from ~70 years (like in Snowflake) to ~174 years

Machine/ Process ID (16 bits)

Sequence (8 bits): This permits 256 IDs every 10 ms, this is somewhat slower
than Snowflake, it increases the chance of ID overlap during peak time.
Sonyflake (source: [Link])

Given its specifications, Sonyflake seems suitable more for small to medium-sized
systems where extreme speed and scale aren’t important.

Reflecting on these methods, there’s flexibility here and you’re not confined to a
one-size-fits-all solution.

Depending on your unique challenges and goals, there’s room to adapt these
systems or even design a custom ID generator that aligns perfectly with your
business needs.

Connect with me, I’m sharing insights on [Link]

and thoughts on Twitter.

System Design Interview Software Development Programming

Written by Phuong Le (@func25)

1K Followers · Writer for Level Up Coding

I enjoy helping new writers because I was one too. You can connect with me on LinkedIn:
[Link]/in/quocphuong or email me at phuongle0205@[Link].

Phuong Le (@func25) in Level Up Coding

Goroutine Scheduler Revealed: You’ll Never See Goroutines the Same

Way Again
You might have heard of the Goroutine Scheduler before, but how well do we really know how it
works? How does it pair goroutines with…

· 8 min read · Oct 29

269 4
Dr. Ashish Bamania in Level Up Coding

Google Rejected Max Howell(Creator Of Homebrew) For Getting This

Interview Question Wrong. Can You?
Can you solve this Google interview question?

· 4 min read · Oct 3

7.1K 75

Jayden Levitt in Level Up Coding

I Earned $5,795.00 From One Blog Post Using a Simple Strategy I See Few
Writers Sharing.
It’s a breeze to replicate.

· 8 min read · Oct 10

8.5K 176

Phuong Le (@func25) in Level Up Coding

Goroutines: Think You Know Go Basics? Think Again

Goroutines are the heart of Golang, with the efforts of the Go Team, developers can handle
tasks concurrently using just the ‘go’ keyword

· 7 min read · Sep 29

166 3

See all from Phuong Le (@func25)

See all from Level Up Coding

Recommended from Medium

Niels Claeys in datamindedbe

How we reduced our docker build times by 40%

This post describes two ways to speed up building your Docker images: caching build info
remotely, using the link option when copying files

5 min read · Oct 4

1.2K 11
Dr. Ashish Bamania in Level Up Coding

Google Rejected Max Howell(Creator Of Homebrew) For Getting This

Interview Question Wrong. Can You?
Can you solve this Google interview question?

· 4 min read · Oct 3

7.1K 75

Lists

General Coding Knowledge

20 stories · 541 saves

It's never too late or early to start something

15 stories · 193 saves

Coding & Development

11 stories · 258 saves

Stories to Help You Grow as a Software Developer

19 stories · 521 saves
Daniel Foo

Software Architecture and Design Trend 2023

2023 is almost coming to the end. It’s always a good idea to reflect back on what has been the
popular topic on Software Architecture and…

7 min read · Oct 1

Tom Jay
Stop using Integer ID’s in your Database
I’ve seen this over and over for the last 30 years, people let the database set the ID or Primary
Key of a table from the database, at…

· 3 min read · May 22

139

Karan Pratap Singh

System Design: Uber

Let’s design an Uber like ride-hailing service, similar to services like Lyft, OLA Cabs, etc.

11 min read · Jun 25

3
Simeon Grancharov in Stackademic

Solving Concurrency Problems with Redis and Golang

Introduction

11 min read · Oct 9

See more recommendations

Common questions