0% found this document useful (0 votes)
48 views

How To Generate Unique IDs in Distributed Systems

The document discusses strategies for generating unique IDs in distributed systems. It summarizes 6 key strategies: UUIDs, NanoIDs, sequences, ObjectIDs, Twitter Snowflakes, and Sonyflakes. Each method has benefits and challenges for ensuring ID uniqueness at scale. UUIDs are very unlikely to collide but are long, while Snowflakes are more compact at 64 bits but have a 70 year time limit. Sequences can be sorted but don't perform well under load. ObjectIDs include timestamps but reveal machine details.

Uploaded by

butko.yehor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

How To Generate Unique IDs in Distributed Systems

The document discusses strategies for generating unique IDs in distributed systems. It summarizes 6 key strategies: UUIDs, NanoIDs, sequences, ObjectIDs, Twitter Snowflakes, and Sonyflakes. Each method has benefits and challenges for ensuring ID uniqueness at scale. UUIDs are very unlikely to collide but are long, while Snowflakes are more compact at 64 bits but have a 70 year time limit. Sequences can be sorted but don't perform well under load. ObjectIDs include timestamps but reveal machine details.

Uploaded by

butko.yehor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Open in app

Member-only story

How to Generate Unique IDs in Distributed


Systems: 6 Key Strategies
In a distributed environment, two nodes can simultaneously assign IDs, the
challenge is ensuring these IDs remain unique, avoiding overlaps and ensuring
system consistency.

Phuong Le (@func25) · Follow


Published in Level Up Coding
8 min read · Oct 10

Listen Share More


Distributed Unique ID (source: blog.devtrovert.com)

If you found the infographic helpful, please give it some 👏 to let me know. I’d also love to
hear any feedback you have.

Imagine you have two nodes operating concurrently in a distributed system and
both are responsible for generating IDs for objects stored in a shared storage.

How do you ensure that these nodes produce unique IDs that won’t collide?

Various strategies cater to these requirements, with each being apt to these
requirements:

Uniqueness: Every ID generated should be unique across all nodes in the


system.

Scalable: The system should be able to generate IDs at a high rate without
collisions.

Monotonically Increasing: (If needed) The IDs should increase over time.
I’ve come across many ways to generate IDs and together, let’s take a closer look at a
few of them:

UUID.

NanoID.

Sequence.

ObjectID.

Twitter Snowflake.

Sonyflake (inspired by Snowflake).

Each method has its own benefits and challenges, as we go through each one, I’ll
share my thoughts.

1. UUID or GUID (128 bits)


When talking about generating unique IDs, UUIDs or Universal Unique Identifiers
come to mind.

A UUID is made of 32 hexadecimal characters. Remember, each character is 4 bits.


So, all in all, it’s 128 bits. And when you include the 4 hyphens, you’ll see 36
characters:

6e965784–98ef-4ebf-b477–8bd14164aaa4

5fd6c336-48c4-4510-bfe5-f7928a83a3e2

0333be18-5ecc-4d7e-98d4-80cc362e4ade
UUID (source: blog.devtrovert.com)

There are 5 common types of UUID:

Version 1 — Time-based MAC: This UUID uses the MAC Address of your
computer and the current time.

Version 2 — DCE Security: Similar to Version 1 but with extra info like POSIX
UID or GID.

Version 3 — Name-based MD5: This one takes a namespace and a string, then
uses MD5 to create the UUID.

Version 4 — Randomness: Every character is chosen randomly.

Version 5 — Name-based SHA1: Think of Version 3, but instead of MD5, it uses


SHA-1.

Version 6 — Reordered Time: A tweak to Version 1, allowing UUIDs to be sorted


by creation time, optimizing database storage.

…You may want to consider other drafts like Version 6 — Reordered Time and
Version 7 — Unix Epoch Time, etc, among the latest proposals at ramsey/uuid

I won’t go into the details of each version right now. But if you’re unsure about which
to choose, I’ve found Version 4 — Randomness to be a good starting point. It’s
straightforward and effective.

“Random and unique? How’s that even possible?”

The magic lies in its super low chance of collision.


Pulling from what I’ve read on Wikipedia, imagine generating 1 billion UUIDs every
second for 86 whole years and only then would you have a 50% chance of getting a
single match.

“Why do some say UUID has only 122 bits when it’s clearly 128 bits?”

When people talk about UUIDs, they often refer to the most common type, which is
variant 1, version 4.

In this type, 6 out of the 128 bits are already set for specific purposes: 4 bits tell us
it’s version 4 (or “v4”), and 2 bits are reserved for variant information.

So, only 122 bits are left to be filled in randomly.

Pros
It’s simple, there’s no need for initial setups or a centralized system to manage
the ID.

Every service in your distributed system can roll out its own unique ID, no chit-
chat needed.
Cons
With 128 bits, it’s a long ID and it’s not something you’d easily write down or
remember.

It doesn’t reveal much information about itself.

UUIDs aren’t sortable (except for versions 1 and 2).

2. NanoID (126 bits)


Drawing from the concept of UUID, NanoID streamlines things a bit with just 21
characters but these characters are sourced from an alphabet of 64 characters,
inclusive of hyphens and underscores.

Doing the math, each NanoID character takes up 6 bits, as opposed to the 4 bits of
UUID and a quick multiplication, and we see NanoID coming in at a neat 126 bits.

NUp3FRBx-27u1kf1rmOxn
XytMg-01fzdSaHoKXnPMJ
_4hP-0rh8pNbx6-Qw1pMl
“Does storing NanoID vs. UUID in a database make much of a difference?”

Well, if you’re saving them as strings, NanoID might be a bit more efficient, being 15
characters shorter than UUID, but in their binary forms, the difference is a mere 2
bits, often a minor detail in most storage scenarios.

Nano ID (source: blog.devtrovert.com)

Pros
NanoID uses characters (A-Za-z0–9_-) which is friendly with URLs.

At just 21 characters, it’s more compact than UUID, shaving off 15 characters to
be precise (though it’s 126 bits versus UUID’s 128)

Cons
NanoID is newer and might not be as widely supported as UUID.

3. Sequences
Sequence or auto-increment might come to mind, as it’s the method that databases
like PostgreSQL and MySQL commonly use

At its core, there’s a centralized counter ticking upwards, but picture a scenario with
millions of simultaneous requests. This central point then turns into both a
bottleneck and a potential single point of failure.

“So, what? We can’t distribute the load or something?”

Absolutely. Instead of one centralized generator, each node can have its very own ID
generator, incrementing as it goes:
Node A: 10 20 30 40
Node B: 1 11 21 31 41
Node C: 2 12 22 32 42

// Alternatively
Node A: a_1 a_2 a_3 a_4
Node B: b_1 b_2 b_3 b_4

But, sorting becomes a bit tricky.

Sequence (source: blog.devtrovert.com)

While the system is distributed, each node can still be a bottleneck, if a node is
overwhelmed with countless requests, it will process them one after the other.

Pros
It’s a straightforward approach with the added bonus of sortable IDs, making it
apt for small to medium-sized systems.

Cons
Doesn’t perform well under sudden, high-volume request spikes.

Removing nodes in a decentralized system can complicate matters

IDs from a decentralized model don’t follow a global sequence, complicating any
sorting efforts.

4. ObjectID (96 bits)


ObjectID is MongoDB’s answer to a unique document ID, this 12-byte identifier
typically resides in the “_id” field of a document, and if you’re not setting it yourself,
MongoDB steps in to do it for you.

Here’s what makes up an ObjectID :

Timestamp (4 bytes): This represents the time the object was created, measured
from the Unix epoch (a timestamp from 1970, for those who might need a
refresher).

Random Value (5 bytes): Each machine or process gets its own random value.

Counter (3 bytes): A simple incrementing counter for a given machine

“But how does each process ensure its random value is unique?”

With 5 bytes, we’re talking about 2⁴⁰ potential values, given the limited number of
machines or processes, collisions are exceedingly rare

ObjectID (source: blog.devtrovert.com)

When representing ObjectIDs, MongoDB goes with hexadecimal, turning those 12


bytes (or 96 bits) into 24 characters

6502b4ab cf09f864b0 074858


6502b4ab cf09f864b0 074859
6502b4ab cf09f864b0 07485a

For those acquainted with Go, here’s a peek at its implementation:


var objectIDCounter = readRandomUint32()
var processUnique = processUniqueBytes()

func NewObjectIDFromTimestamp(timestamp time.Time) ObjectID {


var b [12]byte

binary.BigEndian.PutUint32(b[0:4], uint32(timestamp.Unix()))
copy(b[4:9], processUnique[:])
putUint24(b[9:12], atomic.AddUint32(&objectIDCounter, 1))

return b
}

Pros
Ensures a global order without needing a centralized authority to oversee
uniqueness

In terms of byte size, it’s more compact than both UUID and NanoID.

Using IDs for sorting is straightforward, and you can easily see when each object
was made.

Reveals the specific process or machine that created an item.

Scales gracefully, thanks to its time-based structure ensuring no future conflicts.

Cons
Despite its relative compactness, 96 bits can still be considered long.

Be careful when sharing ObjectIDs with clients, they might reveal too much.

5. Twitter Snowflake (64 bits)


Commonly known as “Snowflake ID”, this system was developed by Twitter to
efficiently generate IDs for their massive user base.

Also, a Snowflake ID boils down to a 64-bit integer, which is more compact than
MongoDB’s ObjectID

Sign Bit (1 bit): This bit is typically unused, though it can be reserved for specific
functions.

Timestamp (41 bits): Much like ObjectID, it represents data creation time in
milliseconds, spanning ~70 years from its starting point.
Datacenter ID (5 bits): Identifies the physical datacenter location. With 5 bits, we
can have up to 2⁵ = 32 datacenter.

Machine/ Process ID (5 bits): Tied to individual machines, services, or processes


creating data.

Sequence (12 bits): An incrementing counter that resets to 0 every millisecond.

“Hold on. 70 years? So from 1970, it’s done by 2040?”

Exactly.

Many Snowflake implementations use a custom epoch that begins more recently,
like Nov 04 2010 01:42:54 UTC, for instance. As for its advantages, they’re pretty
evident given the design.

Twitter Snowflake (source: blog.devtrovert.com)

You can take a look at how it’s implemented in Go with bwmarrin/snowflake :

func (n *Node) Generate() ID {

n.mu.Lock()

now := time.Since(n.epoch).Nanoseconds() / 1000000

if now == n.time {
n.step = (n.step + 1) & n.stepMask

if n.step == 0 {
for now <= n.time {
now = time.Since(n.epoch).Nanoseconds() / 1000000
}
}
} else {
n.step = 0
}

n.time = now

r := ID((now)<<n.timeShift |
(n.node << n.nodeShift) |
(n.step),
)

n.mu.Unlock()
return r
}

Cons
Might be over-engineered for medium-sized businesses, especially with complex
setups like multiple data centers, millisecond-level timestamps, sequence
resets...

It packs features some may find excessive, but for giants like Twitter, it’s right on the
money.

6. Sonyflake (64 bits)


Get inspired by Snowflake, Sonyflake makes a few alterations in the distribution of
its bits:

Sign bit (1 bit)

Timestamp (39 bits): Sonyflake operates at 10 milliseconds, expands the


duration coverage from ~70 years (like in Snowflake) to ~174 years

Machine/ Process ID (16 bits)

Sequence (8 bits): This permits 256 IDs every 10 ms, this is somewhat slower
than Snowflake, it increases the chance of ID overlap during peak time.
Sonyflake (source: blog.devtrovert.com)

Given its specifications, Sonyflake seems suitable more for small to medium-sized
systems where extreme speed and scale aren’t important.

Reflecting on these methods, there’s flexibility here and you’re not confined to a
one-size-fits-all solution.

Depending on your unique challenges and goals, there’s room to adapt these
systems or even design a custom ID generator that aligns perfectly with your
business needs.

Connect with me, I’m sharing insights on https://www.linkedin.com/in/quocphuong


and thoughts on Twitter.

System Design Interview Software Development Programming


Follow

Written by Phuong Le (@func25)


1K Followers · Writer for Level Up Coding

I enjoy helping new writers because I was one too. You can connect with me on LinkedIn:
www.linkedin.com/in/quocphuong or email me at [email protected].

More from Phuong Le (@func25) and Level Up Coding

Phuong Le (@func25) in Level Up Coding

Goroutine Scheduler Revealed: You’ll Never See Goroutines the Same


Way Again
You might have heard of the Goroutine Scheduler before, but how well do we really know how it
works? How does it pair goroutines with…

· 8 min read · Oct 29

269 4
Dr. Ashish Bamania in Level Up Coding

Google Rejected Max Howell(Creator Of Homebrew) For Getting This


Interview Question Wrong. Can You?
Can you solve this Google interview question?

· 4 min read · Oct 3

7.1K 75

Jayden Levitt in Level Up Coding


I Earned $5,795.00 From One Blog Post Using a Simple Strategy I See Few
Writers Sharing.
It’s a breeze to replicate.

· 8 min read · Oct 10

8.5K 176

Phuong Le (@func25) in Level Up Coding

Goroutines: Think You Know Go Basics? Think Again


Goroutines are the heart of Golang, with the efforts of the Go Team, developers can handle
tasks concurrently using just the ‘go’ keyword

· 7 min read · Sep 29

166 3

See all from Phuong Le (@func25)

See all from Level Up Coding


Recommended from Medium

Niels Claeys in datamindedbe

How we reduced our docker build times by 40%


This post describes two ways to speed up building your Docker images: caching build info
remotely, using the link option when copying files

5 min read · Oct 4

1.2K 11
Dr. Ashish Bamania in Level Up Coding

Google Rejected Max Howell(Creator Of Homebrew) For Getting This


Interview Question Wrong. Can You?
Can you solve this Google interview question?

· 4 min read · Oct 3

7.1K 75

Lists

General Coding Knowledge


20 stories · 541 saves

It's never too late or early to start something


15 stories · 193 saves

Coding & Development


11 stories · 258 saves

Stories to Help You Grow as a Software Developer


19 stories · 521 saves
Daniel Foo

Software Architecture and Design Trend 2023


2023 is almost coming to the end. It’s always a good idea to reflect back on what has been the
popular topic on Software Architecture and…

7 min read · Oct 1

Tom Jay
Stop using Integer ID’s in your Database
I’ve seen this over and over for the last 30 years, people let the database set the ID or Primary
Key of a table from the database, at…

· 3 min read · May 22

139

Karan Pratap Singh

System Design: Uber


Let’s design an Uber like ride-hailing service, similar to services like Lyft, OLA Cabs, etc.

11 min read · Jun 25

3
Simeon Grancharov in Stackademic

Solving Concurrency Problems with Redis and Golang


Introduction

11 min read · Oct 9

See more recommendations

You might also like