Design a Key-Value Store (Part II)

This is the second post of Design a Key-Value Store series posts. If you haven’t read the first post, please go check it.

In our previous post, we mostly focus on the basic concepts of key-value store, especially the single machine scenario. When it comes to scaling issues, we need to distribute all the data into multiple machines by some rules and a coordinator machine can direct clients to the machine with requested resource.

There are many things you need to consider when designing the distributed system. When splitting data into multiple machines, it’s important to balance the traffic. That’s why it’s better to make sure that keys are distributed randomly.

In this post, we will continue our discussion about distributed key-value storage system. We’re going to cover topics like system availability, consistency and so on.


System availability

To evaluate a distributed system, one key metric is system availability. For instance, suppose one of our machines crashes for some reason (maybe hardware issue or program bugs), how does this affect our key-value storage system?

Apparently, if someone requests resources from this machine, we won’t be able to return the correct response. You may not consider this issue when building a side project. However, if you are serving millions of users with tons of servers, this happens quite often and you can’t afford to manually restart the server every time. This is why availability is essential in every distributed system nowadays. So how would you address this issue?

Of course you can write more robust code with test cases. However, your program will always have bugs. In addition, hardware issues are even harder to protect. The most common solution is replica. By setting machines with duplicate resources, we can significantly reduce the system downtime. If a single machine has 10% of chance to crash every month, then with a single backup machine, we reduce the probability to 1% when both are down.


Replica VS sharding

At first glance, replica is quite similar to sharding. So what’s the relation of these two? And how would you choose between replica and sharding when designing a distributed key-value store?

First of all, we need to be clear about the purpose of these two techniques. Sharding is basically used to splitting data to multiple machines since a single machine cannot store too much data. Replica is a way to protect the system from downtime. With that in mind, if a single machine can’t store all the data, replica won’t help.



By introducing replicas, we can make the system more robust. However, one issue is about consistency. Let’s say for machine A1, we have replica A2. How do you make sure that A1 and A2 have the same data? For instance, when inserting a new entry, we need to update both machines. But it’s possible that the write operation fails in one of them. So over time, A1 and A2 might have quite a lot inconsistent data, which is a big problem.

There are a couple of solutions here. First approach is to keep a local copy in coordinator. Whenever updating a resource, the coordinator will keep the copy of updated version. So in case the update fails, the coordinator is able to re-do the operation.

Another approach is commit log. If you have been using Git, the concept of commit log should be quite familiar to you. Basically, for each node machine, it’ll keep the commit log for each operation, which is like the history of all updates. So when we want to update an entry in machine A, it will first store this request in commit log. And then a separate program will process all the commit logs in order (in a queue). Whenever an operation fails, we can easily recover as we can lookup the commit log.

The last approach I’d like to introduce is to resolve conflict in read. Suppose when the requested resource locates in A1, A2 and A3, the coordinator can ask from all three machines. If by any chance the data is different, the system can resolve the conflict on the fly.

It’s worth to note that all of these approaches are not mutually exclusive. You can definitely use multiple ones based on the application.


Read throughput

I’d also like to briefly mention read throughput in this post. Usually, key-value storage system should be able to support a large amount of read requests. So what approaches will you use to improve read throughput?

To improve read throughput, the common approach is always taking advantage of memory. If the data is stored in disk inside each node machine, we can move part of them in memory. A more general idea is to use cache. Since the post – design a cache system has an in-depth analysis of this topic, I won’t talk about it too much here.



Don’t take the analysis here as something like standard answers. Instead, those common solutions should give you inspirations to help you come up with different ideas.

There’s no solution that works for every system and you should always adjust your approach based on particular scenarios.

The post is written by Gainlo - a platform that allows you to have mock interviews with employees from Google, Amazon etc..

I'd like to learn more

Share on Facebook0Tweet about this on TwitterShare on LinkedIn0Share on Reddit0

6 thoughts on “Design a Key-Value Store (Part II)

  1. This is regarding Consistency. Suppose a resource at a machine is updated ? How are the replicas of that resources updated ? What protocol do we follow ? Coordinator based approach just keep track that resource has been updated but there would be some limit on the amount of data which coordinator maintain. So how and when does actual propagation of update takes place.
    In commit log approach, how does the background process know what is the updated value of the resource ? Do we maintain timestamp also ? If yes, we also know that in distributed environment timestamp of different are not exactly synced.

    1. Coordinator keeps only updates, not the entire data set. Some sort of request/response protocol can be used. That is, once the update request had been processed by the replicas, they can acknowledge with a response and the coordinator can go on with the rest of the updates.

      Commit log is implemented like a queue. So all the updated values are already dequeued. You wouldn’t be able to see them in the commit log right?

  2. “If a single machine has 10% of chance to crash every month, then with a single backup machine, we reduce the probability to 1% when both are down.”

    Not sure how 1% came about.

      1. I think the confusion is with the word “crash”. Crash generally means a failure event. In this case, the author was speaking of “downtime”. If a single machine has a 10% chance of being down at any given moment, then with a single backup machine we reduce the probability of both machines being down at the same time to 1%.

        BTW – if you have a machine that’s down 10% of the time, you have a really big problem.

Leave a Reply

Your email address will not be published. Required fields are marked *