dice-free-license-cc0-980x649

Random ID Generator

Let’s continue our system design interview questions discussion. If you are new to this series, you can check our previous posts. Basically, each week we are going to pick several interesting interview questions and provide in-depth analysis.

It’s worth to note that the post is not about giving you something like a standard answer. Instead, we focus more on analyzing the problem and how to come up with reasonable approaches. This is even more true for system design interviews because the question can be extremely open-ended.

This week, we will talk about how to a random ID generator. We will cover a bunch of topics including scaling the ID generator and pros and cons of each approach.

 

Random ID Generator

In case you don’t know what an ID generator is, let me briefly explain this here. Suppose you are building a social network product like Twitter, you need to store each user in your database with a user ID, which is unique and used to identify each user in the system.

In some systems, you can just keep incrementing the ID from 1, 2, 3 … N. In other systems, we may need to generate the ID as a random string. Usually, there are few requirements for ID generators:

  • They cannot be arbitrarily long. Let’s say we keep it within 64 bits.
  • ID is incremented by date. This gives the system a lot of flexibility, e.g. you can sort users by ID, which is same as ordering by register date.

There can be some other requirements, especially when you want to scale the system to support millions or even billions of users.

 

Single machine

As we suggested previously, it’s good to start with something simple and keep optimizing it later, which is more true when the question is broad. If I got this question in a system design interview, most likely I’ll start with a single machine design. Not only is this easy to design, but it’s good enough for most of the cases.

In the simplest case, we can just keep incrementing ID from 1, 2, 3 … N, which in fact is one of the most popular ways to generate ID in many real life projects. If user A’s ID is larger than user B, then we know that A registered later.

However, this approach is hard to scale. Let’s say after one year there are too many users everyday that we have to scale the database to multiple instances. You’ll see that this approach won’t work because it may generate duplicate IDs for different users.

 

3rd party service

To scale the ID generator to multiple machines, one natural solution is to keep a single separate server that is only responsible for ID generation. More specifically, when a user registers to the product, for whichever database that handles this request, it’ll connect to the 3rd party server to ask for a random ID. Since all the ID generation is handled in a single server, there’s no risk of generating duplicate IDs.

However, the downside of this solution is obvious. Suppose the product is so popular that there can be a huge number of people registering within a single second, the 3rd party server will soon become the bottleneck. The server may either block registration or just crash.

 

Multiple machine solution

Thus, we have to scale the ID generation to multiple servers. If we want to have no communication between ID generation servers, each server itself should be able to generate unique IDs that are incremented by time. It should be quite natural to think about using timestamp to generate IDs.

Since within a single timestamp there can also be multiple users, we could solve this with two approaches.

  1. We assign a server ID to each ID generation server and the final ID is a combination of timestamp and the server ID.
  2. We can also allow multiple requests within a single timestamp on a single server. We can keep a counter on each server, which indicates how many IDs have been generated in the current timestamp. So the final ID is a combination of timestamp, serverID and the counter.

As mentioned previously, the ID cannot be arbitrarily long so that the counter may end up with only 8 bits for instance. In this case, the server can handle 256 requests within a single timestamp at most. If it frequently exceeds this limit, we need to add more instances.

In fact, this solution is what Twitter does to solve the problem. They open sourced their ID generator called Snowflake.

 

Clock synchronization

We ignored a crucial problem in the above analysis. In fact, there’s a hidden assumption that all ID generation servers have the same clock to generate the timestamp, which might not be true in distributed systems.

In reality, system clocks can drastically skew in distributed systems, which can cause our ID generators provide duplicate IDs or IDs with incorrect order. Clock synchronization is out of the scope of this discussion, however, it’s important for you to know such issue in this system. There are quite a few ways to solve this issue, check NTP if you want to know more about it.

 

Summary

Random ID generator is a very practical issue in many real life projects. Similar to many other systems, it seems to be easy at first glance, but there are quite a lot issues when scaling to multiple machines. If you want to know more about this topic, you can also check Flickr’s ticket servers, which is another approach besides Snowflake.

The post is written by Gainlo - a platform that allows you to have mock interviews with employees from Google, Amazon etc..

I'd like to learn more

Share on Facebook12Tweet about this on TwitterShare on LinkedIn20Share on Reddit11

5 thoughts on “Random ID Generator

  1. I think you should remove the “random” part of the title, since you are discussing a plain id generator (not a random one).

    “A random-number generator (RNG) is a computational or physical device designed to generate a sequence of numbers or symbols that can not be reasonably predicted better than by a random chance.”

  2. My solution to the problem was to use a Unique key on the database so that the ID is always unique. I’d also use a hardware RNG to generate a number in the 64 bit integer range. If the row won’t insert because of it having a duplicate key, then the DB will throw an error. We could try to insert it again under another number, or we could just pass the very rare error message to the end user so they can resubmit.

    The trade-off is that the DB will have to verify that that ID is unique, and thus it could cost an O(log n) overhead from a binary search with each insert. Another trade-off is that we wouldn’t know when they registered by looking at the ID.

    On the upside, the hardware module will give an excellent entropy source, thus reducing ID collisions, and it’s far less involved to set up the system this way as you don’t need to account for timestamps, server ID, etc. As a 64 bit integer is far larger than the total Earth population, then the risk of an ID collision is already immensely small.

    My example favors time over space, as we could still use the traditional createdAt, updatedAt, (and possibly deletedAt for soft-deletes), to handle the timestamps for actions on the account.

  3. The post mentions “In reality, system clocks can drastically skew in distributed systems, which can cause our ID generators provide duplicate IDs”, how could this happen due to Clock Synchronization?
    If each machine still uses “combination of timestamp, serverID and the counter.” I can understand IDs with incorrect order exist but how do duplicates come in place?

Leave a Reply

Your email address will not be published. Required fields are marked *