Morris' Blog

2012-05-11 19:56:13Morris

[UVA][SA] 11512 - GATTACA

留言 0 收藏 0 推薦 0

GATTACA

The Institute of Bioinformatics and Medicine (IBM) of your country has been studying the DNA sequences of several organisms, including the human one. Before analyzing the DNA of an organism, the investigators must extract the DNA from the cells of the organism and decode it with a process called ``sequencing''.

A technique used to decode a DNA sequence is the ``shotgun sequencing''. This technique is a method applied to decode long DNA strands by cutting randomly many copies of the same strand to generate smaller fragments, which are sequenced reading the DNA bases (A, C, G and T) with a special machine, and re-assembled together using a special algorithm to build the entire sequence.

Normally, a DNA strand has many segments that repeat two or more times over the sequence (these segments are called ``repetitions''). The repetitions are not completely identified by the shotgun method because the re-assembling process is not able to differentiate two identical fragments that are substrings of two distinct repetitions.

The scientists of the institute decoded successfully the DNA sequences of numerous bacterias from the same family, with other method of sequencing (much more expensive than the shotgun process) that avoids the problem of repetitions. The biologists wonder if it was a waste of money the application of the other method because they believe there is not any large repeated fragment in the DNA of the bacterias of the family studied.

The biologists contacted you to write a program that, given a DNA strand, finds the largest substring that is repeated two or more times in the sequence.

Input

The first line of the input contains an integer T specifying the number of test cases ( 1 T 100 ). Each test case consists of a single line of text that represents a DNA sequence S of length n ( 1 n 1000 ).

You can suppose that each sequence S only contains the letters A, C, G and T.

Output

For each sequence in the input, print a single line specifying the largest substring of S that appears two or more times repeated in S , followed by a space, and the number of ocurrences of the substring in S .

If there are two or more substrings of maximal length that are repeated, you must choose the least according to the lexicographic order.

If there is no repetition in S , print ``No repetitions found!''

Sample Input

6
GATTACA
GAGAGAG
GATTACAGATTACA
TGAC
TGTAC
TTGGAACC

Sample Output

A 3
GAGAG 2
GATTACA 2
No repetitions found!
T 2
A 2

suffix array
建造高度數組, 求最高即可

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct SuffixArray {
    int sa[1005], h[1005], n;
    int w[1005], ta[1005], tb[1005];
    char str[1005];
    void sort(int *x, int *y, int m) {
        static int i;
        for(i = 0; i < m; i++)
            w[i] = 0;
        for(i = 0; i < n; i++)
            w[x[y[i]]]++;
        for(i = 1; i < m; i++)
            w[i] += w[i-1];
        for(i = n-1; i >= 0; i--)
            sa[--w[x[y[i]]]] = y[i];
    }
    bool cmp(int *r, int a, int b, int l) {
        if(r[a] == r[b]) {
            if(a+l >= n || b+l >= n)
                return false;
            return r[a+l] == r[b+l];
        }
        return false;
    }
    void build_h() {
        int i, j, k;
        for(i = 0; i < n; i++) ta[sa[i]] = i;
        for(i = 0; i < n; i++) {
            if(ta[i] == 0) {
                h[ta[i]] = 0;
                continue;
            }
            if(i == 0 || h[ta[i-1]] <= 1)
                k = 0;
            else
                k = h[ta[i-1]]-1;
            while(str[sa[ta[i]-1]+k] == str[sa[ta[i]]+k])
                k++;
            h[ta[i]] = k;
        }
    }
    void build() {// x: rank, y: second key
        int i, k, m = 128, p;
        int *x = ta, *y = tb, *z;
        n = strlen(str);
        x[n] = 0;
        for(i = 0; i < n; i++)
            x[i] = str[i], y[i] = i;
        sort(x, y, m);
        for(k = 1, p = 1; p < n; k *= 2, m = p) {
            for(p = 0, i = n-k; i < n; i++)
                y[p++] = i;
            for(i = 0; i < n; i++) {
                if(sa[i] >= k) {
                    y[p++] = sa[i]-k;
                }
            }
            sort(x, y, m);
            z = x, x = y, y = z;
            for(i = 1, p = 1, x[sa[0]] = 0; i < n; i++)
                x[sa[i]] = cmp(y, sa[i-1], sa[i], k) ? p-1 : p++;
        }
    }
};
int main() {
    SuffixArray in;
    int t;
    scanf("%d", &t);
    getchar();
    while(t--) {
        gets(in.str);
        in.build();
        in.build_h();
        int ans = 0, anst = 0;
        for(int i = 0; i < in.n; i++)
            ans = ans > in.h[i] ? ans : in.h[i];
        if(ans == 0) {
            puts("No repetitions found!");
        } else {
            for(int i = 0; i < in.n; i++) {
                if(ans == in.h[i]) {
                    for(int j = 0; j < ans; j++)
                        printf("%c", in.str[in.sa[i]+j]);
                    while(i < in.n && ans == in.h[i])
                        anst++, i++;
                    break;
                }
            }
            printf(" %d\n", anst+1);
        }
    }
    return 0;
}

上一篇：[UVA][字串搜尋] 422 - Word-Search Wonder

下一篇：[UVA][SA] 1223 - Editor

Morris' Blog

[UVA][SA] 11512 - GATTACA

GATTACA

Input

Output

Sample Input

Sample Output

你可能感興趣的文章

完美聲音是否反而令人疲勞？

大家聖誕節快樂

電影巨洪2025 대홍수 2025

柬埔寨事件，180º改變（耐用）

反社會