Sunday, March 6, 2016

HackerRank: Bear And Steady Gene - Binary Search (II)

March 6, 2016

Problem statement:

https://www.hackerrank.com/contests/hourrank-6/challenges/bear-and-steady-gene  

A gene is represented as a string of length  (where  is divisible by ), composed of the letters , and . It is considered to be steady if each of the four letters occurs exactly  times. For example,  and are both steady genes.

Study code using binary search:

cannot figure out the design - confused!
https://gist.github.com/jianminchen/d01faa03ca9b06696db3

This one is easy to follow
https://gist.github.com/jianminchen/395eb9e76fe19cc9338f

Julia's C# practice code:
https://gist.github.com/jianminchen/af1b3c064c523444463f

Algorithm talk:
Work on test case GAAATAAA, let me explain the idea using binary search.

            A1 = Math.Max(0, A1 - n / 4);  // test case: GAAATAAA, A1 = 4 
            C1 = Math.Max(0, C1 - n / 4);  // C1 = 0 
            G1 = Math.Max(0, G1 - n / 4);  // G1 = 0
            T1 = Math.Max(0, T1 - n / 4);   // T1 = 0

            if (A1 == 0 && C1 == 0 && G1 == 0 && T1 == 0)
            {
                return 0;   //
            }

            int ans = n;   // default value is maximum value - string length
            for (int i = 0; i < n; i++)  // go through a loop, start from 0 to n-1. 
            {
                if (A[n] - A[i] < A1 || C[n] - C[i] < C1 || G[n] - G[i] < G1 || T[n] - T[i] < T1) // substring position from i to n, check the count of A, C, G, T
                    break;

                int l = i + 1, r = n;  // left, right two pointer
                while (l < r)
                {
                    int mid = (l + r - 1) / 2;
                    if (A[mid] - A[i] < A1 || C[mid] - C[i] < C1 || G[mid] - G[i] < G1 || T[mid] - T[i] < T1) // not enough
                        l = mid + 1;   // set left pointer to mid + 1
                    else
                        r = mid;      
                }

                ans = Math.Min(ans, l - i);   // substring is from i to l,
            }

            return ans;

In other words, GAAATAAA,

i = 0, l = 1, r = 8,
find ans = Math.Min(ans, l-i)   <-  ans = 6, l = 6

i = 1,
find ans = Math.Min(ans, l-i)  <- ans = 5, l = 6, i=1

i = 2,
find ans = Math.Min(ans, l-i)  <- ans = 5, l = 7, i=2

i =3; 
find ans = Math.Min(ans, l-i)  <- ans = 5, l = 8, i=3

i=4
break the for loop
if (A[n] - A[i] < A1 || C[n] - C[i] < C1 || G[n] - G[i] < G1 || T[n] - T[i] < T1)

                    break;
since A[n] - A[i]  =  3 < A1 = 4, TAAA, we need the substring at least 4 of A

comment:
This algorithm uses extra space for ACGT count for each i from 0 to n-1, total space is less than 1MB. 
int[] A = new int[500500]; int[] C = new int[500500]; int[] G = new int[500500]; int[] T = new int[500500];
4 bytes for int, then 500K x 4 x 4 = 8M bytes

No comments:

Post a Comment