2013-05-07 10:44:58Morris

[UVA][最大平均值問題+二分答案] 1392 - DNA Regions

A DNA sequence or genetic sequence is a succession of letters representing the primary structure of a real or hypothetical DNA molecule or strand, with the capacity to carry information. The possible letters are A, C, G, and T, representing the four nucleotide subunits of a DNA strand: adenine, cytosine, guanine and thymine bases covalently linked to phospho-backbone.

DNA sequences undergo mutations during the evolution of species, which means that some letters are randomly replaced with others. Therefore, the DNA sequences of two closely related species are very similar, and the difference increases as the distance between the species increases. The mutations do not occur with uniform frequency throughout the sequence; typically there are fewer mutations at the biologically important parts, since even a single mutation can be lethal at such a place. On the other hand, if a part of the sequence does not carry any biologically relevant information, then mutations on this part have no effect. It follows that if we compare the DNA sequences of two species and a particular region of the sequence contains fewer than the average number of mutations, then most probably this part of the sequence plays an important biological role. Therefore, it is of crucial importance to identify such regions. More precisely, a conserved region is a consecutive interval of the DNA sequence such that in this region at most p percent of the letters are different in the two sequences. Your task is to write a program that, given two DNA sequences, finds the longest conserved region.

Input 

The input contains several blocks of test cases. Each case begins with a line containing two integers: 1$ le$n$ le$150000 , the length of the genetic sequences and 1$ le$p$ le$99 , the maximum percentage of mutated letters allowed in a conserved region. This is followed by two lines, each containing a DNA sequence of length n . The sequence contains only the letters `A', `C', `G', and `T'.

The input is terminated by a test case with n = 0 .

Output 

For each test case, you have to output a line containing a single integer: the length of the longest conserved region between the two sequences. If there are no conserved regions in the input, then output `No solution.' (without quotes).

Sample Input 

14 25
ACCGGTAACGTGAA
ACTGGATACGTAAA
14 24
ACCGGTAACGTGAA
ACTGGATACGTAAA
8 1
AAAAAAAA
CCCCCCCC
8 33
AAACAAAA
CCCCCCCC 
0 0

Sample Output 

8
7
No solution.
1

題目意思:
給定兩個字串,然後同時挑同一個區間 [l, r],其不同的字元最多占有 p%。

解法:
知道最大平均值問題後,這題就很好下手了。
首先最大平均值問題是要找某個區間平均值最高為多少,而規定其區間長度至少 L。

藉此我們可以利用二分搜尋這個長度 L,然後調用上方的問題,得到其平均值,然後判斷二分。

總效率是 O(nlogn)

另一個解法:
將 C[i] = (A[i] == B[i]), 之後將 C[i] -= p%, 之後找最長連續和大於零。
效率也是 O(nlogn)


#include <stdio.h>
#include <string.h>
#include <algorithm>
using namespace std;
double get_avg(int l, int r, int sum[]) {
return (double)(sum[r]-sum[l])/(r-l);
}
int q[150005], sum[150005];
double globalCut;
int nextL;
double mxAvgProblem(int n, int L, int sum[]) {
double ans = (double)sum[n]/n;
int front, rear, i, j;
front = 0, rear = -1;
for(i = L; i <= n; i++) {
int tmp = i-L;
while(front < rear && get_avg(q[rear], tmp, sum) <= get_avg(q[rear-1], q[rear], sum))
rear--;
q[++rear] = tmp;
while(front < rear && get_avg(q[front], i, sum) <= get_avg(q[front+1], i, sum))
front++;
double tans = get_avg(q[front], i, sum);
if(tans > ans)
ans = tans;
if(ans >= globalCut) {
nextL = i-q[front];
return ans;
}
}
return ans;
}
int main() {
int n, p;
int i, j;
char A[150005], B[150005];
while(scanf("%d %d", &n, &p) == 2) {
if(n == 0) break;
scanf("%s %s", A+1, B+1);
double s = (100-p)/100.0, f;
for(i = 1; i <= n; i++)
sum[i] = sum[i-1] + (A[i] == B[i]);
int l = 0, r = n, m;
int ret = 0;
globalCut = s;
while(l <= r) {
m = (l+r)/2;
f = mxAvgProblem(n, m, sum);
if(f >= s)
l = nextL+1, ret = max(ret, nextL);
else
r = m-1;
}
if(ret)
printf("%d\n", ret);
else
puts("No solution.");
}
return 0;
}