This week, we’re going to discuss a recent Uber interview question – weighted random numbers. This is also one of my favorite questions to ask and I’m surprised at how many people failed this problem.

In a nutshell, we’ll talk about a couple topics in this post, including sampling, basic data structures, time/space complexity and tips for coding this question.

## Question

Write a function that returns values randomly, according to their weight.

Let me give you an example. Suppose we have 3 elements with their weights: A (1), B (1) and C (2). The function should return A with probability 25%, B with 25% and C with 50% based on the weights.

The answer is not obvious, but it’s not too hard to think. Also, writing bug-free code would fail majority candidates. This is really the perfect question for coding interviews.

If you haven’t solved this problem before, I’ll give you 20min for both thinking and coding. Please try to solve it before reading the analysis below.

## Analysis

The problem seems quite simple at first glance, but converting your thought into the algorithm is hard. When you get stuck, a practical tip is to try with simple examples, which works pretty well for this type of question.

Let’s use the same example above. If we do a random sample regardless of the weight, we’ll end up give all of them the probability of 1/3. As we’d like to double the chance of selecting element C, an intuitive approach is to have two elements C in the “bag” and do the sampling again. In other words, we have {A, B, C, C} in the set now and we can again do the random sample from it.

With that in mind, a naive solution is to duplicate each element with the time of its weight and then do the random sampling from the new set. There are two potential problems here, maybe you can think about it first.

## A more general approach

Two problems of the above solution:

- This won’t work when the weight is not an integer.
- If the weight is large, it uses too much memory as it stores the same element for many times.

Let’s think about a more general approach.

In essence, we can denote the new set above as a horizontal line like this:

The sampling is like randomly select a point and see which area it falls into. Going into this idea, we can have the following algorithm:

**W**is the sum of all the weights (length of the horizontal line)- Get a random number
**R**from [0, W] (randomly select a point) - Go over each element in order and keep the sum of weights of visited elements. Once the sum is larger than
**R**, return the current element. This is finding which area includes the point.

## Complexity & optimization

The time/space complexity should be extremely straightforward. Since we are not using extra memory, the space complexity is O(1). Time complexity is O(N), because we have to iterate over all the elements in both step 1 and 3. It’s worth to note that even if we can put step 1 in preprocessing, we still have to do the traversal in step 3.

Any way to optimize this?

If making step 3 to O(1) is hard, let’s try make it O(logN). The first thing we should consider is binary search. If we keep an array that each element is the accumulative sum of weights of all the previous elements, we can do a binary search to locate the element. This improves the algorithm to O(logN) (excluding the preprocessing).

If we want to make it O(1), we may need to add some constraints. As we’ve been mentioning for so many times, the tradeoff between time and space complexity provides a lot of hints about optimization. Obviously, we’d like to speed it up and we can think about how to use more memories here.

The most time-consuming part is at step 3 and we need to have an approach of O(1) to get the corresponding area. Let’s say if we preprocessing all the elements once by keeping an array **M **where **M[i]** stores the element corresponding to point **i**. This allows us to return the area immediately from the array.

In essence, this approach is like storing results for every single input in a hashmap and it has space complexity O[W] where W is the sum of all the weights.

## Tips for coding

It should be very straightforward to finish the code. I would suggest everyone take this as a practice. Here is a couple of tips for this question:

- Be careful about the edge case. When comparing the current sum to the random number, should you use “>”, “>=” or it makes no difference?
- If you could define a class, it’ll be better to have the preprocessing code that calculates the sum of all the weights.

An extra question, how do you test your function? You can leave your comment here.

## Summary

A surprising fact is that most interview questions are easier than people expected ,however, what’s more surprising is that so many people failed in those “simple” questions.

The weighted random numbers question is a great example. No one thinks it’s hard, but few can actually solve it perfectly. In fact, this is the type of questions people should spend the most time on.

By the way, if you want to have more guidance from experienced interviewers, you can check Gainlo that allows you to have mock interview with engineers from Google, Facebook ,etc..

The post is written by
**
Gainlo** - a platform that allows you to have mock interviews with employees from Google, Amazon etc..

I'd like to learn more

Simple snippet of code using cumulative frequency calculation:

void weightedrandom(map data, map cfdata)

{

int cumfreq = 0;

for(auto it = data.begin(); it!= data.end(); it++)

{

cumfreq = it->second + cumfreq;

cfdata[it->first] = cumfreq;

}

int range = cumfreq;

int randomv = random() % (range+1);

for(auto it2= cfdata.begin(); it2 != cfdata.end(); it2++) {

if(randomv second){

cout<first << "\n";

break;

}

}

}

final int randomNum = myRandom.nextInt(100);

if (randomNum > 50) { return 3; }

else if (randomNum > 20) { return 2; }

else { randomNum 1; }

Problems with your first solution:-

“Two problems of the above solution:-

This won’t work when the weight is not an integer.

If the weight is large, it uses too much memory as it stores the same element for many times.”

And then you arrived at the 3rd solution: –

“In essence, this approach is like storing results for every single input in a hashmap and it has space complexity O[W] where W is the sum of all the weights.”

I am afraid that your final solution has same problems as the first one, isn’t it ?