[Codility] GenomicRangeQuery

Algorithm

[Codility] GenomicRangeQuery

Denken_Y 2019. 6. 2. 10:29

A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and Thave impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?

The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).

For example, consider string S = CAGCCTA and arrays P, Q such that:

P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6

The answers to these M = 3 queries are as follows:

The part of the DNA between positions 2 and 4 contains nucleotides Gand C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.

The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.

The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.

가장 처음 생각할 수 있는 알고리즘은 각 M개의 Query에 대해 최소 impact Factor를 찾는 알고리즘일 수 있다. 하지만, 이 알고리즘의 경우에는 O(N*M)의 시간복잡도를 가지게 되어 더 효율적인 알고리즘을 생각해보아야한다.

그렇다면, 어떻게해야할까? 알고리즘 문제를 풀 때 시간복잡도를 줄이기 위해서는 여러개의 Count를 쓰는 방법들이 이용된다. 마찬가지로 이문제에서도 Count를 이용하는데 각 A, C, G, T 항목에 대한 카운터를 사용한다. 인덱스 0부터 카운트 하기 때문에 String S의 length + 1 크기이다. (T의 경우엔 최대값 + 남은 경우이므로 카운트 할 필요가 없다.)

새로운 문자가 들어올 때마다 각 문자에 대한 값을 올린 배열을 만든 후, 각 M개의 Query의 범위 P[K] ~ Q[K]의 최소값을 탐색한다.

ex] CAGCCTA

A = [00111112], C = [01112333], G = [00011111] 의 Counter 가 있을 때 각 범위에서

A가 존재한다는 의미는 A[P[K]] != A[Q[K]+1] 일경우 정확히는 뒤의 값이 더 클 때이다. 그리고 최소인값은 A->C->G->T 순서로 각 카운터를 검사해보고 제일 처음 나오는 경우 그 대응 값을 return 배열에 넣으면 된다.

import java.util.*;

class Solution {
    public int[] solution(String S, int[] P, int[] Q) {
        
        int aCnt[] = new int[S.length()+1];
        int cCnt[] = new int[S.length()+1];
        int gCnt[] = new int[S.length()+1];
        
        int min[] = new int[P.length];
        
        int ptmp;
        int qtmp;
        
        for(int i = 0; i < S.length(); i++){
            
            aCnt[i+1] = aCnt[i];
            cCnt[i+1] = cCnt[i];
            gCnt[i+1] = gCnt[i];
            
            if(S.charAt(i) == 'A'){
                aCnt[i+1]++;
            }
            else if(S.charAt(i) == 'C'){
                cCnt[i+1]++;
            }
            else if(S.charAt(i) == 'G'){
                gCnt[i+1]++;
            }
        }
        
        for(int j = 0; j < P.length; j++){
            ptmp = P[j];
            qtmp = Q[j];
            
            if(aCnt[ptmp] != aCnt[qtmp+1]){
                min[j] = 1;
            }
            else if(cCnt[ptmp] != cCnt[qtmp+1]){
                min[j] = 2;
            }
            else if(gCnt[ptmp] != gCnt[qtmp+1]){
                min[j] = 3;
            }
            else{
                min[j] = 4;
            }
        }
        
        return min;
    }
}

저작자표시 비영리 변경금지