Over the past couple of weeks, a lot of people requested us to discuss eCommerce website. Not only has this topic been asked in quite a lot system design interviews, but also eCommerce websites are so popular today that a lot of techniques and researches are developed for it.
Before digging into this topic, it’s better to understand why design eCommerce website is popular in system design interviews. First of all, building an eCommerce website requires things like database design, system availability, concurrency consideration and so on so forth. All of them are extremely important in today’s distributed systems. In addition, everyone has used some eCommerce website like Amazon. If you are generally curious about surroundings, you should have already thought about this topic.
In our guideline 8 Things You Need to Know Before a System Design Interview, we said that a common strategy of system design interview is starting with simple and basic things instead of jumping into details directly. So how would you design the basic data structure of an eCommerce website? And what about the database schema?
I’ll skip the data structure for user models as it should be quite similar to other applications. Let’s focus on the product. In the simplest scenario, we need three major objects: Product, User and Order.
Product defines the basic model for a product in the shopping cart. Some important fields include price, the amount left, name, description, and the category. Category can be tricky here. Of course you can make it a string field in the SQL database, but a better approach is to have a Category table that contains category ID, name and maybe other information. So the each product can keep a category ID.
Order stores information about all the orders made by users. So each row contains the product ID, user ID, amount, timestamp, status and so on. So when a user proceeds to checkout, we aggregate all the entries associated with this user to display in the shopping cart (of course we should filter out items that were bought in the past).
NoSQL in eCommerce
In case many people don’t know about NoSQL, in layman’s term, NoSQL database tries to store a bunch of things in a single row instead of multiple tables. For instance, instead of having a separate Order table, we can store all the items a user has bought in the same row of User table. As a result, when fetching a user, not only will we get all the personal information, but also his purchase history.
Why can NoSQL be (slightly) better in this case? Let’s use Product model as an example. Suppose we are selling books. A product has category book and tons of attributes like author, publish date, version, the number of pages etc. and this SQL table may have 20 columns. That’s fine.
And now, we also want to sell laptops. So a product should also store attributes of a laptop including brand name, size, color etc.. As you can imagine, with more categories introduced, the Product table can have tons of columns. If each category has 10 attributes in average, it’s gonna be 100 columns with only 10 categories supported!
However, for NoSQL database like MongoDB, a great advantage is that it supports huge number “columns” like this. Each row can have a large number columns but not all of them are set. It’s like storing JSON object as a row (in fact, MongoDB is using something very similar called BSON). As a result, we can just store all those attributes (columns) of a product in a single row, which is exactly what NoSQL database good at.
Let’s move on to talk about scaling issues. When scaling an eCommerce website to multiple machines, there are tons of problems popping up. The most important thing is that eCommerce website has almost zero tolerance to most of this issues.
Take concurrency as an example. let’s say there’s only one book left in the store and two people buy it simultaneously. Without any concurrency mechanism, it’s absolutely possible that both have bought it successfully. How do you achieve concurrency in eCommerce websites?
Let’s analyze this step by step. From what we learned from OS classes, we know that lock is the most common technique to protect common resources. Suppose both user A and B want to buy the same book. What we can do is when A fetches the data about this book, place a lock on this row so that no one else can access it. Once A finishes the purchase (decrease the amount left), we release the lock so that B can access the data. The same approach should apply for all the resources and this can solve the problem totally.
The above solution is called pessimistic concurrency control. Although it prevents all the conflicts caused by concurrency, the downside is that it’s costly. Obviously, for every data access we need to create and release a lock, which may be unnecessary most of the time.
Can we solve the problem without a lock?
We’ll talk about a better solution for concurrency issue without using a lock in the next post. There are so many topics I’d like to cover about eCommerce websites.
In fact, many techniques are common across all distributed systems, what’s important is to compare the pros and cons of each approach and select the one that works best for the particular application.
In the next post, we’ll continue discussing concurrency and will also talk about system availability and consistency.
By the way, if you want to have more guidance from experienced interviewers, you can check Gainlo that allows you to have mock interview with engineers from Google, Facebook etc..