Split a String Into the Max Number of Unique Substrings — Algorithms&Visualizations

In this article, we are going to be discussing an algorithm that lends itself to visualization. We will get started from a concrete problem, a code challenge that I’ve recently got to solve, trying to figure out a general approach that can be reused in many other circumstances.

The Problem
This is a medium difficulty LeetCode problem called “Split a String Into the Max Number of Unique Substrings”. The description is the following:

Given a string s, return the maximum number of unique substrings that the given string can be split into.

You can split string s into any list of non-empty substrings, where the concatenation of the substrings forms the original string. However, you must split the substrings such that all of them are unique.

A substring is a contiguous sequence of characters within a string.

1 <= s.length <= 16
contains only lower case English letters

s = “ababccc"
The maximum number of unique substrings forming the original string is 5:
[“a”, “b”, “ab”, “c”, “cc”].
Note that not all the possible splits are valid. [“a”, ”b”, ”a”, ”bc”, “cc”] is not valid, because “a” is not unique.

The approach
The approach that I am going to present might not be the best approach to solve the problem, and most probably you can find more efficient solutions in the comment section on LeetCode. However, the goal of this article is not just solving this specific challenge, in fact providing you with a technique that will come in handy for future situations as well. Let’s check this out.

First off, I want to introduce the concept of Stars and Bars. Quoting Wikipedia, it is a graphical aid for deriving certain combinatorial theorems. I am not going to be wasting other words, if it is a graphical aid, let’s jump into the visualization:

Given a set of elements, how many combinations of splitting are there?
In other words, how many ways
of putting bars between stars do we have?

It’s worth mentioning that for N stars, there are at most N-1 bars (the image shows how a set of 7 stars can be totally split by 6 bars).

Does all of this make any sense to you? If not yet, let me try again:

Here it is, the exact representation of the original problem, by tapping into the stars and bars visualization.

We have a representation of the problem, but as a software developer, how can I represent the bars and their status (up between two stars or down)?
There are surely many ways to do it, but there is one that seems extremely natural to me. Using the binary representation. Stay with me, I promise that everything will make sense.

Look at this picture. Each bar can have only two possible statuses. Up or down.

So, we can represent it using a bit for each: 1 if up, 0 otherwise.

If you don’t like the bits, probably you have been wondering why not using an array of booleans instead. This is surely a good point! Performance aside, if you went for an array of booleans, you would miss one of the coolest capabilities of the bit representation. Let me show you:

I am sure you have immediately caught on to what I was talking about!

All the possible combinations are simply represented by all the possible numbers whose representation goes from:
[000000] to [111111]
where the first one means no bars, and the last one means all the elements are separated.

Even if you were not a combinatorial mathematics guru, you should have got a handle on this problem by now.

It’s time to put all the pieces together to get to a solution to the original problem. We will proceed following these two points:

  1. Find the number whose binary representation involves enough bits in order to represent all the possible bars.
  2. Not all of the combinations are correct, some of them don’t represent a solution. We will filter them out, keeping only the solution with the maximum number of bars eventually.

Let’s get started figuring the first point out:

As the image displays, in order to represent N bars, we need at least N bits.

In this specific case, in order to represent 6 bars, we need (at least) 6 bits.

As you probably already know, there are different ways to represent a number. short int, int, long int, long, long long int, unsigned int…
Which type addresses our need the best? Checking the constraints of the original problem, we learned that the input string is 16 chars at most. In other words, for 16 stars, we need at least 15 bars (bits).
Taking stock of the constraints, let’s pick an INT, because it is most probably represented by 32 bits, unless your data model is LP32, in that case, it will be represented by 16 bits. We were looking for a data type of at least 15 bits, so a 16 bit INT should suffice, right? Absolutely yes, but it might lead to a common mistake. Let me explain why:

Let’s take back the first example, where the input string wasababccc”.
As already mentioned we need 6 bars, meaning at least 6 bits.
A 16-bit represented INT is perfectly fine here, because looping from
[0000 0000 0000 0000] to [000 000 0011 1111] will give us all the possible combinations of bars.
Ok, but from a developer point of view, what does looping from [0000 0000 0000 0000] to [0000 0000 0011 1111] actually mean? It means looping from 0 to 63.
Ok, and how can I deduce 63 from “ababccc”? Simple:
pow(2,string.size()-1)-1 = pow(2,6) -1 = 64–1= 63

Ok, it looks like a successful approach, but can you spot the risk of having only 16 bits?
If you don’t, consider as input string “abcdefghilmnopqr”. The length of the string is 16, which means we need 15 bars, or in other words 15 bits.
INT type has at least 16 bits, which are enough to represent all the bars.
It should e clear by now, that we will be checking status by status from
[0000 0000 0000 0000] to [1111 1111 1111 1111]. Meaning from 0 to pow(2, s.size()-1) -1 = pow(2, 15) -1 = RUNTIME ERROR. Wait, what?

Not all the types are created equals! Some of them are supposed to represent positive and negative quantities, others only positive ones. INT can represent negative numbers, thus its rightmost bit is meant to represent the sign.

In other words, 16 bits can be used to represent a range of only positive numbers:
from [0000 0000 0000 0000] (0) to [1111 1111 1111 1111] (65535)
or a range of positive and negative numbers:
from [1000 0000 0000 0000] (-32768) to [0111 1111 1111 1111] (32767)

(check how the negative numbers are stored in memory)

To sum up, in order to compute pow(2, 15) we need to have at least 16 bits to represent a positive number. A 16-bit represented INT has 16 bits, but only 15 of those can be used to represent a positive number.

Now that I hopefully got the point across, you might be wondering why I didn’t pick a LONG INT, which is surely represented by 32 bits, whatever the data model is. You’re perfectly right! But also an UNSIGNED SHORT INT is fine (16 bits all dedicated to representing positive numbers).

The Code

After reading the article, the code should be rather straightforward. I know that bit operations can be tricky sometimes, thus I’ve decided to add long-winded comments to help the understanding.

That’s all! I hope you enjoyed this article and hopefully learned something valuable. Have a nice day!

“Learning without thought is pointless; thought without learning is dangerous” — Confucius

Software Engineer