Practical Linear Space Algorithms for Computing String ...

Viewer
Transcript

Practical Linear Space Algorithms for Computing String-Edit Distances Tony Y.T. Chan The University of Akureyri, Solborg, Akureyri 600, Iceland [email protected]

Abstract. String-edit operations consist of insertion of a symbol, deletion of a symbol, and substituting one symbol with another. String-edit distances have been applied in problems of error correction and pattern recognition. In this paper, two practical algorithms for computing the edit distance between two strings are presented. The space complexity for the first is m + n + O (1), where m and n are the lengths of the input strings. The second requires only min( m, n ) + O (1). Keywords: String-edit distance, time and space complexities, algorithms.

1 Introduction The string-editing problem that we are dealing with here is well-known in the fields of pattern recognition [1–4], file comparison [5], spelling correction and genome study [6]. A lower bound on time-complexity was established [7]. We are given a finite set of symbols, called the alphabet, and a finite set of edit operations. Each operation is the insertion, deletion, or substitution of one symbol and each operation has a cost which is a non-negative real number. Then, given two strings generated from the alphabet, we want to find the minimum total cost of a sequence of operations to transform one string into the other by these edit operations. This minimum total cost is known as the distance between the two input strings. Using dynamic programming approach, Wagner and Fischer [8] described Algorithm A (see below) to find the edit distance in O ( mn) space as well as time, where m and n are the numbers of symbols in the first and the second input string respectively. In this paper, two practical linear-space algorithms are presented to find the distance between two strings. Algorithm B runs in O ( m + n) space and

O ( mn) time but in actual CPU time units, it runs faster than Algorithm A. Algorithm C requires only O (min(m, n)) space, not counting the storage locations for the input strings. It runs in O ( mn) time and in practice, it runs slightly faster than Algorithm A. Edit distances has been applied to spelling correction software to identify a misspelled word’s closest neighbors. For example, assuming all edit operations have the same cost of 1, the distance between the strings December and ecema is 4. It costs D.-S. Huang, K. Li, and G.W. Irwin (Eds.): ICIC 2006, LNBI 4115, pp. 504 – 513, 2006. © Springer-Verlag Berlin Heidelberg 2006

Practical Linear Space Algorithms for Computing String-Edit Distances

505

1 to delete the D, 1 to change the b to a, and 2 to delete er. The total cost is 4 to change December to ecema. In bioinformatics, a gene sequence is often coded simply as a string from the alphabet {A,T,G,C} of the 4 standard nucleotide bases. Human genes vary in size from a few hundred bases to more than 2 million bases. In this case, the saving in space from quadratic to linear would be tremendous. The algorithms presented here can be used to find homologous genes. Two sequences are homologous (related to each other through evolution) when they share a lot of similar subsequences. In pattern recognition, chromosomes are often represented by strings. Chan [9] used edit distance to distinguish between median and telocentric chromosomes.

2 Algorithm A Let Σ be a finite set of symbols, i.e., the alphabet; del( a ), a ∈ Σ, be a non-negative real number representing the cost of deleting a from a string; ins( a ), a ∈ Σ, be a nonnegative real number representing the cost of inserting a into a string; and sub( a, b), a, b ∈ Σ, be a non-negative real number representing the cost of substituting a for b. Now given two strings A and B, generated from the alphabet, find the minimum total cost to transform A into B by a sequence of these edit operations. Let A[i ] ∈ Σ,1 ≤ i ≤ m, be the ith symbol of A and similarly for B[ j ]. Wagner and Fischer described a quadratic time and space algorithm to find the distance as follows: Algorithm A Global inputs: del, ins, sub Inputs: A, B Output: D[m, n] 1. m := length(A) 2. n := length(B) 3. D [0, 0] := 0 4. for j := 1 to n do D[0, j] := D[0, j − 1]+ ins(B[j]) 5. for i := 1 to m do D[i, 0] := D[i − 1, 0]+ del(A[i]) 6. for i := 1 to m do 7. for j := 1 to n do begin 8. m1 := D[i − 1, j − 1]+ sub(A[i], B[j]) 9. m2 := D[i − 1, j]+ del(A[i]) 10. m3 := D[i, j − 1]+ ins(B [j]) 11. D[i, j] := min(m1, m2, m3) 12. end 13. end In practice, to satisfy the metric axioms, the insertion and deletion costs are the same for the same symbol, so we only need one function indel for both ins and del; also sub(a, b) = sub(b, a), and sub(a, a) = 0, i.e., the distance between a and b is the same as the distance between b and a, and the distance between a and a must be zero. From the implementation point of view, if we represent a symbol by an integer which is an index to the alphabet set, we can conveniently implement A and B as

506

T.Y.T. Chan

1-dimensional integer arrays of length m and n respectively and we can implement indel as a real vector and sub as a real matrix. For example, if there are five symbols in the alphabet, then the integers 1, 2, 3, 4, and 5 can represent the five symbols and indel is a 1-dimensional array of length 5 containing non-negative real numbers, while sub is a 5 by 5 symmetric matrix with 0’s on the main diagonal. An APL implementation can be found in [10] with its CPU time summary statistics. Algorithm A takes O(mn) space because the matrix D has exactly (m + 1)(n + 1) cells. Line 3 initializes the cell D[0, 0]. Line 4 is a loop initializing the rest of the cells in the 0th row. Line 5 is a loop initializing the rest of the cells in the 0th column. Lines 6 to 12 contain the double lo op that fills out the remaining cells of D one by one in row-major order. Note that at any given point inside the double loop, the calculation of D[i, j] depends only on the cell directly above, on the cell directly to the north-west corner, and on the cell directly to the left. It is precisely because of this observation that linear space algorithms are possible. Fig. 1 shows this.

Alignment Cost • Fill out an (m+1) x (n+1) matrix. • O(mn) time and space complexity

0 s

i

d • Local alignment has same complexity • O(mn) space 3 Fig. 1. Initial cost is 0. Final distance is 3; d is for deletion, s for substitution, and i for insertion.

3 Algorithm B Algorithm B below takes advantage of the ideas that we do not need to store the whole matrix D to calculate the distance, and that insertion and deletion are symmetric operations. Algorithm B Global inputs: indel, sub Inputs: A, B Output: d[n] 1. m := length(A) 2. n := length(B) 3. q := m 4. d[q] := 0 5. for j := 1 to n do begin

Practical Linear Space Algorithms for Computing String-Edit Distances

507

6. q := q + 1 7. d[q] := d[q − 1]+ indel(B [j]) 8. end 9. p := m 10. for i := 1 to m do begin 11. p := p − 1 12. q := p 13. d[q] := d[q + 1]+ indel(A[i]) 14. for j := 1 to n do begin 15. q := q + 1 16. m1 := d[q]+ sub(A[i], B[j]) 17. m2 := d[q + 1]+ indel(A[i]) 18. m3 := d[q − 1]+ indel(B[j]) 19. d[q] := min(m1, m2, m3) 20. end 21. end

4 Proof of Correctness of Algorithm B Wagner and Fischer [8] proved that Algorithm A correctly computes the edit distance between strings A and B. The correctness of Algorithm B can be proved by showing that d[q] in statement 19 of Algorithm B has the same value as D[i, j] in statement 11 of Algorithm A. Lemma. In Algorithm B, q = m − i + j . Proof. There are four cases which need to be considered. First, when i = j = 0, from statement 3, we have q = m = m − 0 + 0. Second, when i = 0 and j > 0, we can seen from the single loop, i.e., statements 5 to 8, that every time j is incremented by 1, q also is incremented by 1 so that q = m + j = m − 0 + j. Third, when i > 0 and j = 0, we can see from the outer loop, specifically statements 9 to 12, that every time i is incremented by 1, q is decremented by 1 so that q = m − i = m − i + 0. Fourth, when i > 0 and j > 0, we can see from statements 14 and 15 that every time j is incremented by 1, q is also incremented by 1. From the third case, we know that in statement 12, q = m-i so that in statement 15, q = m − i + j. Hence, in Algorithm B, q = m − i + j as stated in the lemma. Theorem. For every assignment to D[i, j] in Algorithm A, there is a corresponding assignment to di[q] in Algorithm B, where q = m − i + j and di[q] is the value of d[q] at the ith pass. Proof. There are four cases. First, in statement 3 of Algorithm A, when i = j = 0, D[0, 0] := 0. This is translated as statement 4 in Algorithm B as d[q] := 0 where q = m − i + j .

508

T.Y.T. Chan

Second, in statement 4 of Algorithm A, when i = 0 and j > 0, we have D[0, j] := D[0, j − 1]+ ins(B[j]). This is translated in the single loop, as statement 7 of Algorithm B, when i = 0 and j > 0, which reads d[q] := d[q − 1]+ indel(B[j]). Since the two lines of codes here have the same i and j, by the lemma, D[0, j] corresponds to d[q]. We know that ins(B [j]) = indel(B [j]). It remains only to be shown that d0[q − 1] = D[0, j − 1] in order to establish d0[q] = D[0, j]. By the lemma, [0, j − 1] maps to m − 0 + j − 1 = m + j − 1 = q − 1 so that d0[q − 1] = D[0, j − 1]. The correctness of the value in d0[q −1] is inherited from the correctness of the value in D[0, j −1]. Alternatively, the correctness of the value in d[q − 1] can be see from analyzing the single loop in Algorithm B. Each time through the j loop, d[q − 1] is simply the value of the previous calculation. Third, when i > 0 and j = 0, we compare statement 5 of Algorithm A with statement 13 of Algorithm B. The proof for this case is analogous to case two. We need only to show that [i − 1, 0] maps to m − (i − 1) + 0 = m − i + 1 = q + 1. Lastly, we have the double loops, when i > 0 and j > 0. Comparing statement 8 of Algorithm A with statement 16 of Algorithm B, we can show that [i − 1, j − 1] maps to m − (i − 1) + j − 1 = m − i + 1 + j − 1 = q. Comparing statement 9 of Algorithm A with statement 17 of Algorithm B, we can show that [i − 1, j] maps to m−(i−1)+ j = m−i +1 +j = q +1. Comparing statement 10 of Algorithm A with statement 18 of Algorithm B, we can show that [i, j − 1] maps to m − i + j − 1 = m − i + j − 1 = q − 1. Now we have proved the correctness of Algorithm B by finding a mapping that relates every [i, j] to q and proving that for every D[i, j] there is a corresponding identical value di[q].

5 Space and Time Analysis of Algorithm Apart from storage for the two input strings, the dominating data structure is the 1-d array d, which has exactly 1 + m + n words. In fact, the algorithm takes m + n + K words where K is a constant independent of m and n. So the space complexity is O(m+n) which is substantially better than that of O(mn) for Algorithm A. The most time-consuming part of the algorithm is the inner loop which is executed exactly mn times. So the time complexity is O(mn) which is the same as that of Algorithm A. But a closer analysis reveals that Algorithm B has a lower multiplier in the quadratic term, so that, in practice, Algorithm B actually takes less CPU time than Algorithm A. For Algorithm A, each time through the inner loop, there are four 2-d array indexings. For Algorithm B, each time through the inner loop, there are four 1-d array indexings plus one extra addition and one more assignment operation (statement 15).

6 Algorithm C Observe that in Algorithm B, the single loop basically initializes the 0th row of D and packs it to the right end of d starting at location d[m] ending at location d[m + n]; then at the ith pass through the double loop, it packs the ith row of D starting at location

Practical Linear Space Algorithms for Computing String-Edit Distances

509

d[m − i] ending at location d[m − i + n]. At any point in time, we need only n+1 words to store the current row of i plus one word for temporary storage. This observation gives rise to the following algorithm that uses a ring structure dd instead of a simple 1-d array d to calculate the values in place. Algorithm C Global inputs: indel, sub Inputs: A,B Output: z 1. m := length(A) 2. n := length(B) 3. if m > n then begin 4. m n := n m 5. A B := B A 6. end 7. dd[0] := 0 8. for j := 1 to n do dd[j] := dd[j − 1]+ indel(B[j]) 9. n2 := n + 2 10. r := n 11. for i := 1 to m do begin 12. r := mod(r + 1, n2) 13. dd[r] := dd[mod(r + 1, n2)]+ indel(A[i]) 14. for j := 1 to n do begin 15. r := mod(r + 1, n2) 16. m1 := dd[r]+ sub(A[i],B[j]) 17. m2 := dd[mod(r + 1, n2)]+ indel(A[i]) 18. m3 := dd[mod(r − 1, n2)]+ indel(B[j]) 19. dd[r] := min(m1,m2,m3) 20. end 21. end 22. z := dd[mod(r, n2)] The proof of correctness for Algorithm C is again an exercise in index mapping. It is similar to that for Algorithm B and is omitted.

7 Space and Time Analysis of Algorithm C Apart from storage for the two input strings, the only data structure that depends on the input strings is the ring dd which has exactly n + 2 words. Lines 3 to 6 make sure B is the shorter input string. Thus, the space complexity is O(min(m, n)). By inspection the time complexity is O(mn). Experiments demonstrate that it takes about the same number of CPU seconds as that of Algorithm A. Algorithm C takes slightly more CPU time than Algorithm B, but uses less space. As a special case, this ring idea can also be applied to Hirschberg’s Algorithm B [11] for calculating the length of maximal common subsequences. The improved version will reduce the actual local storage requirements from Hirschberg’s 2(n + 1) to n + 2 while keeping the actual CPU time about the same.

510

T.Y.T. Chan

Edit distance • AATTGGAC • | |||| • ACATGGAT

• Edit operations – Insert – Delete – Replace

• Edit distance: minimum number of edit operations

• A-ATTGGAC • | || ||| • ACAT-GGAT

Fig. 2. Two examples of alignment

8 Applications National Institute of Health (NIH), US Department of Health and Human Services, provides a service called GenBank. It contains all publicly available DNA sequences. As of April 2004, there are over 38,989,342,565 bases in 32,549,400 sequence records. Pekso [12] mentioned that “Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.” The following is a sample record. It is the Saccharomyces cerevisiae TCP1-beta gene. There are 1510 a’s, 1074 c’s, 835 g’s, and 1609 t’s.

1 61 121 181 241

gatcctccat ccgacatgag ctgcatctga gaaccgccaa ccacactgtc . . . 4921 ttttcagtgt 4981 tgccatgact

atacaacggt acagttaggt agccgctgaa tagacaacat attattataa

atctccacct atcgtcgaga gttctactaa atgtaacata ttagaaacag

caggtttaga gttacaagct gggtggataa tttaggatat aacgcaaaaa

tctcaacaac aaaacgagca catcatccgt acctcgaaaa ttatccacta

gaaccattg tagtcagct caagaccaa aataaaccg ataattcaa

tagattgctc taattctttg agctgttctc tcagctcctc atatttttc cagattctaa ttttaagcta ttcaatttct ctttgatc

Use edit distance to calculate dissimilarity between AATTGGAC and ACATGGAT. The first alignment requires change the second A from the first string to a C, a T to an A, and finally a C to T. A total of 3 edit operations are needed. The

Practical Linear Space Algorithms for Computing String-Edit Distances

511

second alignment requires deleting the C from the second string, deleting the second T from the first string and finally change the last C to T. The total costs also is 3 edit operations. Another application is the parity string problem. Some examples of odd parity strings are 01, 010, 1011. The total number of 1’s in the string is odd. Some examples of even parity strings are 000, 1111, 010100. The total number of 1’s in the string is even. The pattern language is simply the set of bit strings. There are exactly two classes: even parity and odd parity. The only edit operation we need is the substitution of a bit by another bit, where a bit can be a 1, a 0, or the empty string. Step 1: For patterns p1 , p2 ∈ P, let ∆1 ( p1 , p2 ) = the minimum number of substitutions needed to transform p1 into p2 . E.g., 1000 can be transformed into 0001 by deleting the 1 at the begining and inserting an 1 at the end, so that ∆1 (1000, 0001) = 2. ( ( ( ( Step 2: The average intra-group distance for Q1 = q1 , q2 ,..., qn is

{

ρ (∆ w ) =

2

n1 ( n1

i −1

n1

∑∑∆ − 1) i =2

w

1

}

( ( ( qi , q j ).

j =1

( ( Step 3: The average inter-group distance between groups Q1 and Q2 is

υ (∆ w ) =

1

n1 n2

n1

n2

∑∑ ∆ i =1

w

( ( ( ( ( ( ( qi , rj ), where Q2 = r1 , r2 ,..., rn .

{

j =1

Step 4: Stability quotient is Z (∆ w ) =

ρ (∆ w ) υ (∆ w )

2

}

.

Stability optimization minimizes Z (∆ w ) m

subject to the constraint that

∑w

i

= 1.

i =1

Figure 3 shows the first training set consists of two examples: 01 and 10101. The second training set consists of 11 and 00. The distance between the two set is 1/3. The cost for inserting (or deleting because of symmetry) a 0 is 0, for a 1 is 1/3, for a string 00 is 0, for 01 is 1/3, for 10 is 1/3, and for 11 is 0. So we can see that inserting or deleting 0s is free while inserting or deleting a string that contains a 1 costs 1/3 unless there are two 1s. In order to distinguish between even and odd parity strings, it makes sense that if there is an even number of 1s, the cost is free. This scheme will give the optimal objective function value of 0 as shown in Figure 3.

512

T.Y.T. Chan

Parity Answer ( 01 Q1 10101

1 3

11 00

( Q2

⎧0 ↔ θ ⎫ ⎪1 ↔ θ ⎪ ⎪ ⎪ ⎪00 ↔ θ ⎪ S1 = ⎨ ⎬, ⎪01 ↔ θ ⎪ ⎪10 ↔ θ ⎪ ⎪ ⎪ ⎩11 ↔ θ ⎭ 1 1 1 w* = (0, , 0, , , 0), 3 3 3 * Z =0 Fig. 3. Parity problem: optimal weights

9 Conclusion Two useful algorithms are presented to reduce the quadratic space complexity of Wagner and Fischer’s edit distance algorithm to linear. One idea was that instead of calculated the entries of a matrix in a row-after-row order, we did it in a diagonalafter-diagonal order. Another idea to further reduce the space requirement was to treat a diagonal not as a linear array but as a ring structure. These ideas were implemented and tested in the APL language and the resulting programs were applied to the recognition of chromosomes.

References 1. Fu, K. S.: Syntactic Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs, NJ (1982) 2. Sankoff, D., Kruskal, J. B.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, Massachusetts (1983) 3. Chan, T. Y. T.: Unifying Metric Approach to The Triple Parity. Artificial Intelligence, 141 (2002) 123-135 4. Chan, T. Y. T.: Inductive Pattern Llearning. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 29 (1999) 667-674 5. Myers, E. W.: An O(ND) Difference Algorithm and Its Variations. Algorithmica, 1 (1986) 251-266

Practical Linear Space Algorithms for Computing String-Edit Distances

513

6. Myers, E. W., Miller, W.: Optimal Alignments in Linear Space. Computation Application Bioscience, 4 (1988) 11-17 7. Wong, C. K., Chandra, A. K.: Bounds for The String Editing Pproblem. Journal of the ACM, 23 (1976) 13-16 8. Wagner, R. A., Fischer, M. J.: The String to String Correction Problem. Journal of the ACM, 21 (1974) 168-173 9. Chan, T.Y.T.: Unsupervised Classification of Noisy Chromosomes. Bioinformatics, 17 (2001) 438–444 10. Chan, T. Y. T.: Running Parallel Algorithms with APL on A Sequential Machine. APL Quote Quad, 29 (1999) 25-26 11. Hirschberg, D.S.: A Llinear Space Algorithm for Computing Maximal Common Subsequences. Communications of the ACM, 18 (1975) 341-343 12. Pekso, G. A.: Biology's Structurally Sound Foundations. Nature, 401 (1999) 115-116

Fast exact string matching algorithms - Semantic Scholar

Fast exact string matching algorithms - ScienceDirect.com

Practical String Dictionary Compression Using String ...

Linear Space Theory

practical computing for biologists pdf

Linear space-time precoding for OFDM systems ... - Semantic Scholar

Linear Programming Algorithms for Sparse Filter Design

Algorithms for Linear and Nonlinear Approximation of Large Data

Linear-time Algorithms for Pairwise Statistical Problems

Fast Algorithms for Linear and Kernel SVM+

Two algorithms for computing regular equivalence - Semantic Scholar

A Linear Time Algorithm for Computing Longest Paths ...

Distributed Computing - Principles, Algorithms, and Systems.pdf ...

Linear Systems of Equations - Computing - DIT

String Cosmology in Anisotropic Bianchi-II Space-time ...

petascale computing: algorithms and applications

a practical approach for applying non-linear dynamics ...

Practical Robust Linear Model Predictive Control for ...

$pdf-1399\practical-algorithms-for-3d-computer-graphics-second ...$

pdf-1399\practical-algorithms-for-3d-computer-graphics-second ...

A Practical, Integer-Linear Programming Model for the ...