Thursday, July 20, 2023

1044. Longest Duplicate Substring | Hard level | Binary search + Rabin Carp hash

 Given a string s, consider all duplicated substrings: (contiguous) substrings of s that occur 2 or more times. The occurrences may overlap.

Return any duplicated substring that has the longest possible length. If s does not have a duplicated substring, the answer is "".

Approach 1: Binary Search + Rabin-Karp

String Searching Algorithms

The problem is a follow-up of Longest Repeating Substring,
and typically used to check if you're comfortable with
string searching algortihms.

Best algorithms have a linear execution time on average.
The most popular ones are
Aho-Corasick,
KMP and
Rabin-Karp:
Aho-Corasick is used by fgrep,
KMP is used for chinese string searching,
and Rabin-Karp is used for plagiarism detection and in bioinformatics to look for similarities in two
or more proteins.

The first two are optimized for a single pattern search,
and Rabin-Karp for a multiple pattern search,
that is exactly the case here.

Split into two subtasks

Here we have a "two in one" problem:

  1. Perform a search by a substring length in the interval from 1 to N.

  2. Check if there is a duplicate substring of a given length L.


The linear time implementation of this idea is a bit
tricky because of two technical problems:

  1. How to implement a string slice in a constant time?

  2. Hashset memory consumption could be huge for very long strings.
    One could keep the string hash instead of string itself
    but hash generation costs O(L)O(L) for the string of length L,
    and the complexity of the algorithm would be O((N−L)L)O((N - L)L),
    N - L for the slice and L for the hash generation.
    Therefore, we should think about how to generate a hash in a constant time.

Let's now address these problems.

String slice in a constant time

That's a very language-dependent problem. For the moment for
Java and Python there is no straightforward solution,
and to move sliding window in a constant time
one has to convert string to another data structure.

Python is already providing memoryview,
which is known to be surprisingly slow,
and there is a lot of discussion about strview.

The simplest solution both for Java and Python is to convert string to integer array of ascii-values.

Rolling hash: hash generation in a constant time

To generate hash of array of length L, one needs O(L)O(L) time.

How to have constant time of hash generation? Use the advantage of
slice: only one integer in, and only one - out.

That's the idea of rolling hash.
Here we'll implement the simplest one, polynomial rolling hash.
Beware that's polynomial rolling hash is NOT the Rabin fingerprint.

Since one deals here with lowercase English letters, all values
in the integer array are between 0 and 25:

arr[i] = (int)S.charAt(i) - (int)'a'

So one could consider string abcd -> [0, 1, 2, 3] as a number
in a numeral system with the base 26.
Hence abcd -> [0, 1, 2, 3] could be hashed as

h0=0×263+1×262+2×261+3×260h_0 = 0 \times 26^3 + 1 \times 26^2 + 2 \times 26^1 + 3 \times 26^0

Let's write the same formula in a generalized way, where cic_i
is an integer array element and a=26a = 26 is a system base.

h0=c0aL−1+c1aL−2+...+ciaL−1−i+...+cL−1a1+cLa0h_0 = c_0 a^{L - 1} + c_1 a^{L - 2} + ... + c_i a^{L - 1 - i} + ... + c_{L - 1} a^1 + c_L a^0

h0=∑i=0L−1ciaL−1−ih_0 = \sum_{i = 0}^{L - 1}{c_i a^{L - 1 - i}}

Now let's consider the slice abcd -> bcde. For int arrays that means
[0, 1, 2, 3] -> [1, 2, 3, 4], to remove number 0 and to add number 4.

h1=(h0−0×263)×26+4×260h_1 = (h_0 - 0 \times 26^3) \times 26 + 4 \times 26^0

Let's look at what changed piece by piece. First, we subtracted 0×2630 \times 26^3 from h0h_0; this removed the contribution of the first element in the array from the hash. Then we multiplied the remaining hash value by 2626, which increased the power of the base value for each of the elements remaining in the array (i.e., 2×261)×26=2×2622 \times 26^1) \times 26 = 2 \times 26^2). Finally, we add the contribution of the new element (e) to the hash. This results in:

h1=1×263+2×262+3×261+4×260h_1 = 1 \times 26^3 + 2 \times 26^2 + 3 \times 26^1 + 4 \times 26^0

Thus after applying a constant amount of operations to the hash for abcd, we have obtained the hash for the next substring, bcde.

In general form:

h1=(h0a−c0aL)+cL+1h_1 = (h_0 a - c_0 a^L) + c_{L + 1}

Now hash regeneration is perfect and fits in a constant time.
There is one more issue to address: the possible overflow problem.

How to avoid overflow:

aLa^L could be a large number and hence
the idea is to set limits to avoid the overflow.
To set limits means to limit a hash by a modulus
and instead of using the hash itself, we will use h % modulus.

We should select a modulus that is large enough for our purpose, but how
large is that? You can read more about the topic here.

We must use caution when using a rolling hash to assess the equality of two substrings. The modulus can be thought of as the number of bins that we will use to store the starting index of seen substrings. So there is a higher probability of having two different substrings being stored in the same bin (h % modulus) when the modulus is small.

When two different strings have the same hash value, we call this a collision. In an ideal setting, where every test case is known, this issue could be resolved by adjusting the modulus to avoid collisions. However, in a real-world setting, whenever two substrings have the same hash, we must verify that the substrings are truly equal. This leads to a Rabin-Karp time complexity of O(L(N−L))O(L(N - L)) in the worst case, when many substring hashes collide.

Fortunately, we can reduce the probability of collisions by selecting a good value for our modulus.

Generally speaking, a good modulus will have two traits:

  1. The modulus is not too big and not too small. That is to say, it is large, which helps reduce the probability of collisions, but it is still small enough to fit in a 32-bit integer.
  2. It is prime. This helps increase the uniformity of our hash values after taking h % modulus, which in turn decreases the probability of collisions occurring. If you would like to know more about why this is, you can read about it here.

Here, we will use our favorite modulus, 109+710^9 + 7, which satisfies both of these conditions.

One last note, there is another overflow issue here that is purely Java-related.
While in Python, the hash regeneration goes perfectly fine,
in Java, the same thing is better to rewrite to avoid
long overflow. Check here
the nice explanation by @hqt.


Binary search algorithm

  • Use binary search by a substring length to check lengths from 1 to N
    left = 1, right = N. While left != right:

    • L = left + (right - left) / 2

    • If search(L) != -1 (i.e. there is a duplicate substring), left = L + 1

    • Otherwise (no duplicate substring), right = L.

  • Return a duplicate string of length left - 1, or an empty string if
    there is no such a string.

Rabin-Karp algorithm

  • Compute the hash of the substring consisting of the first L characters of string S.

  • We will initialize the hash map of already seen substrings with this hash value as a key and a list containing the starting index of the substring (0) as the value.

    The reason we store the first index of the substring is so that if we see this hash value again, we can compare the current substring to each substring that has the same hash value to see if the two strings actually match or if this is a hash collision.

    Every time we compare two strings will cost O(L)O(L) time. If we designed a very poor hash function or picked a very weak modulus value (like 1), we could potentially spend O(L⋅(N−L)2)O(L \cdot (N - L)^2) time comparing each substring of length L to all previous substrings of length L on each call to search.

    Fortunately, the hash function we are using guarantees that there will not be any collisions between hash values that are less than MOD (before taking the modulus). Furthermore, selecting a large, prime modulus helps create a more uniform distribution of the hash values that are greater than MOD. So the probability of two hash values colliding is very small, and on average, we expect the number of collisions to be negligible. Therefore, we can expect the search function to take O(N)O(N) time on average.

  • Iterate over the start position of each substring in S from 11 to N−LN - L. Note we already initialized our hashmap with the substring starting at index zero.

    • Compute the rolling hash based on the previous hash value.

    • If the hash is in t

No comments:

Post a Comment