northern-light-984001_640

Find The Longest Substring With K Unique Characters

If it’s your first time to see this series of posts, I’d like to emphasize again that the whole point of the article is not providing something like standard answers, but tell you how to analyze a coding interview question step by step. At the end of each post, I’ll summarize some common techniques that can be reused in other questions.

By reading this post, you can have a clear idea about how I would approach this problem from the beginning. I believe that solutions don’t come from nowhere and I would like to uncover all these hidden patterns that can help you come up with the right approach.

Question

Find the longest substring with K unique characters.

Take string “aabbccdd” as an example.
if K is 1, the longest substring can be “aa”.
If K is 2, the longest substring can be “aabb”.
If K is 3, the longest substring can be “aabbcc”.

This question was asked by Facebook a month ago. It seems that string related problems are quite popular recently among companies like Google, Amazon etc..

Brute force

If you want to come up with a naive solution, brute force should be definitely the number one thing in your head. In this example, it should be extremely straightforward for you to get this approach – Iterate all the substrings and then check if any of them contains K unique characters. After the loop, you can return the longest one.

Let’s say the string has length N, the substring iteration has time complexity of O(N^2). To check if substring contains K unique characters, you have to go over the substring once, which is O(N). Therefore, the overall time complexity would be O(N^3), which is quite slow as expected.

Time complexity analysis

Let’s talk about some high-level strategies first. If we want to make it faster than O(N^3), it should be something like O(N^2), O(NlogN), O(N) or even O(1).

Since we have to iterate the string at least once, the lower bound should be O(N). So O(1) or O(logN) can be ignored. Also, the problem seems unrelated to sorting or search (if you sort the array, the substring order is not kept), so it’s unlikely to have logN. Thus O(NlogN) is out. As a result, we can focus on O(N^2) and O(N).

Optimization

If you have read our previous post – If a String Contains an Anagram of Another String, you should realize how similar these two questions are.

Similarly, you can take use of sliding window. Use a start pointer and an end pointer to represent a substring in a sliding window, if the current substring has K unique characters or less, it means the string can potentially be longer and you just need to move forward the end pointer by one character.

If after moving forward the end pointer the current substring exceeds K unique characters, it indicates that the previous substring is a potential candidate that has K unique characters. In this case, you just need to move forward the start pointer until the current substring has K or less unique characters.

Remember that the reason sliding window works here (and other places as well) is that you never need to move the end pointer back, which makes the time complexity O(N).

Along the way, you can use a hash map to keep track of the character frequency of each substring and a global counter to track current number of unique characters. When moving the sliding window, adjust the hash map accordingly and once new key is inserted or an old one gets cleared, increase/decrease the global counter.

Takeaways

It’s worth to summarize several techniques used in this coding interview question. Also, i highly recommend you compare this question with If a String Contains an Anagram of Another String.

  • To come up with a naive solution, brute force can be the first thing in you head. It works a starting point and you can keep optimizing based on that.
  • You can get a lower bound of the time complexity, which sometimes can give you ideas of what to optimize. In this question, apparently iteration over all substrings is the bottleneck.
  • We use sliding window once again. If the end pointer don’t need to go back, this can make the iteration time to O(N).
  • Remember that the extra hash map only need 256 slots (assuming all characters are ASCII), so the space complexity is still O(1).

The post is written by Gainlo - a platform that allows you to have mock interviews with employees from Google, Amazon etc..

I'd like to learn more

Share on Facebook8Tweet about this on TwitterShare on LinkedIn10Share on Reddit10

9 thoughts on “Find The Longest Substring With K Unique Characters

  1. >Take string “aabbccdd” as an example.
    >if K is 1, the longest substring can be “a”.

    Shouldn’t it be “the longest substring can be ‘aa'” ?

  2. I don’t think you can do better than O(K*N). It’s true that the end pointer moves forward at every step of the algorithm, but you’ve swept the “adjusting the start pointer” part under the rug, which is not necessarily obvious. You can do it by lookback from the end pointer every time you need to, but that lookback can be over approximately N characters, approximately K times, such as in “abcccccccccccccccccccba” with K = 2, as well as over approximately K characters, but approximately N times, such as in “abcabcabcabcabcabc” with K = 2.

    You could also do it by keeping track of the indices of the first occurrence of any letter in the current window immediately following any other letter (such as in a nested hashmap with K elements at the first level and (K*K-1)/2 at the second level), so updating the start pointer would be O(1) whenever a new letter is seen, but that would require up to time K to update the hashmap each time the end pointer moves, leading again to an O(K*N) time complexity.

    1. Thanks for your comment.

      If you keep a counter of how many keys are in the hash map, you can actually achieve O(N). Every time you move forward the window, you only need to update the hash map for the start and end character. If a new key is added or an old key is removed, just update the counter.

      Also, if we assume K <= 256, O(KN) is no different from O(N).

  3. 1. In this hashmap, is the key Substring and value the #of unique characters? or the other way around?
    2. When do we insert the new key? Is it for every substring?
    3. When does the old key cleared?

    1. 1. In the hashmap, the key is each character in the window, the value is its frequency. For instance, if the substring in the window is “abccd”, then the hashmap is {a;1, b:1, c:2, d:1}.

      2. Whenever we move the “sliding window”, we need to update the hashmap to match the current substring. e.g. for [abccd]d, if we move forward to a[bccdd], then the hashmap should be {b:1, c:2, d:2}.

      3. The old key gets cleared while moving the window (same as question 2).

  4. Every time we are just reducing the window till the first character in window gets completely removed.So if we keep track of the last index of the first character cant we just move the window forward after that index ????

  5. Why can’t we achieve O(NlogN). Here is what I have tried, could you please check If I am doing something wrong.
    1. calculate frequency of unique characters at each index of input string in a array (say unique[n])
    obviously this array is Non decreasing sequence of numbers.
    2. Now starting from last Index (say j) of input string do the following
    a. Do a binary search for from index 0 to j for a match whose difference of unique characters are equal to M.
    b. If for some index k the difference in unique characters are same, Continue search for first occurrence of such number in unique array. e.g 1 2 3 3 3 4 and if value at index k is 3 then we should use binary search to find first occurrence of 3 and return substring.

    I think this algorithm time complexity is O(N logN).
    code:
    package strings;

    import java.io.BufferedReader;
    import java.io.InputStreamReader;

    public class LongestSubstringWithKCharacters {
    public static void main(String[] argv ) throws Exception {
    BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
    String s = br.readLine();
    int m = Integer.parseInt(br.readLine());
    char[] sa = s.toCharArray();
    int n = sa.length;
    int[] unique = new int[n];
    unique[0] = 1;
    int[] ha = new int[26];
    ha[sa[0]-‘a’] = 1;
    for(int i = 1; i =0; j– ) {
    l = 0;
    r = j;
    while( l < r ) {
    int mid = (l+r)/2;
    if(unique[j]-(unique[mid]-1) == m ) {
    int l1 = 0, r1= mid;
    while( l1 <= r1 ) {
    int mid1 = (l1+r1)/2;
    if( mid1 == 0 && unique[mid1] == unique[mid]) {
    System.out.println(s.substring(mid1, j+1));
    return;
    }
    if(unique[mid1] == unique[mid] && unique[mid1-1] < unique[mid]) {
    System.out.println(s.substring(mid1, j+1));
    return;
    } else if( unique[mid1] m ) {
    l = m+1;
    } else {
    r = m-1;
    }
    }
    }
    System.out.println(“not found”);
    }
    }

Leave a Reply

Your email address will not be published. Required fields are marked *