Saturday, March 5, 2016

HackerRank: Bear And Steady Gene - binary search algorithm (V)

March 5, 2016

Problem statement:

https://www.hackerrank.com/contests/hourrank-6/challenges/bear-and-steady-gene  

A gene is represented as a string of length  (where  is divisible by ), composed of the letters , and . It is considered to be steady if each of the four letters occurs exactly  times. For example,  and are both steady genes.

Study code using binary search:

https://gist.github.com/jianminchen/d01faa03ca9b06696db3

https://gist.github.com/jianminchen/395eb9e76fe19cc9338f

Comment:
Binary search is better than linear search using two pointers.

Brute force O(n^2) -> linear search O(n) -> Binary search O(logn)

Let us walk through this binary search algorithm using test case GAAATAAA.
int low = 0;
int high = n; // n = 8
int need = 8/4 = 2
int[] cnt = {6, 0, 1, 1}   // "ACGT"
mid = 4,
so, copy cnt to tmp[],
and then, first half, take it away, see left half meet the standard:
tmp[i]<=2,  i = 0, ..., 3
because first half contains 3 'A', and then, tmp[0] = 3,

deal with second half, go through a loop:
get confused, let us debug through the code.

debug through the code, still confuse! 
Java code:
https://gist.github.com/jianminchen/d01faa03ca9b06696db3
C# code
https://gist.github.com/jianminchen/76dffc51880a80279f25

April 28, 2016
Come back to the problem on binary search solution, figure out the design:
First, go through two test cases: "GTTCCAAA" and "GAAATTCC"

1. "GTTCCAAA",

Binary search is doing this way:
Biggest value of search string length is 8.
First, divide range from 0 to 8 into half, 4
Find first string with length 4,
one is "GTTC", one is "CAAA",
and then, remove count of first half, rest string (two parts, seperated) "CAAA", since A is repeated 3 times, not in the range;
instead of stopping, wrongly conclude that 4 is not high value, go through all the possible substring with length 4, by sliding window of search string: "GTTC" forward from left to right, but keep the same window size
"GTTC" -> "TTCC"->"TCCA"->stop here, since "TCCA" is removed from counting of GENES="ACGT", fit into the requirement.

Next, low=0, high=4, mid = 2,

Because both test cases with one 'A' to replace, Julia figured out through the debugging:

The slide window of fixed length technique,
How to slide?
Which direction to slide?

Kind of clever in design.
Time complexity analysis: length search using binary search, n - length of string, O(log n) times; each search for length m, go over each string once, since using calculated counting array, only do add one/ remove one at a time, one char only does the work once. So, it is O(n) on this.

Total time complexity: O(n logn)
Conclusion: this binary search is not better than linear search is previous blog (IV).



No comments:

Post a Comment