An Online Algorithm for Finding the Longest Previous ...

Viewer
Transcript

An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara1 and Kunihiko Sadakane2 1

2

Department of Computer Science, University of Tokyo. Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0013, Japan [email protected] Department of Computer Science and Communication Engineering, Kyushu University. Motooka 744, Nishi-ku, Fukuoka 819-0395, Japan [email protected]

Abstract. We present a novel algorithm for ﬁnding the longest factors in a text, for which the working space is proportional to the history text size. Moreover, our algorithm is online and exact; in that, unlike the previous batch algorithms [4, 5, 6, 7, 14], which needs to read the entire input beforehand, our algorithm reports the longest match just after reading each character. This algorithm can be directly used for data compression, pattern analysis, and data mining. Our algorithm also supports the window buﬀer, in that we can bound the working space by discarding the history from the oldest character. Using the dynamic rank/select dictionary [17], our algorithm requires n log σ + o(n log σ) + O(n) bits of working space, and O(log3 n) time per character, O(n log3 n) total time, n is the length of the history, and σ is the alphabet size. We implemented our algorithm and compared it with the recent algorithms [4, 5, 14] in terms of speed and the working space. We found that our algorithm can work with a smaller working space, less than 1/2 of those for the previous methods in real-world data, and with a reasonable decline in speed.

1

Introduction

The problem in searching for the longest previous factor is as follows: given a history T [0, i − 1], and a next character T [i] = c, ﬁnd the longest substring (or factor) that occurs in the history: T [j, . . . , j + l − 1] = T [i − l + 1, . . . , i], and report (j, l), the position, and the length of the matched substring. This problem is fundamental in many applications, including data compression and pattern analysis. For example, we can directly use this algorithm for the LZ77 compression methods [27]; we compress the data by replacing a substring of a text with a pointer to its longest previous occurrence in the input. Our algorithm also solves the LZ-factorization problem, which has become interesting, because it can be used for succinct indexing, and for ﬁnding the runs in the string [5]. A straightforward solution to this problem is to search the history on the ﬂy by using the sequential search. However, this requires O(n) time for each D. Halperin and K. Mehlhorn (Eds.): ESA 2008, LNCS 5193, pp. 696–707, 2008. c Springer-Verlag Berlin Heidelberg 2008

An Online Algorithm for Finding the Longest Previous Factors

697

character, where n is the length of the history, and thus the total complexity becomes O(n2 ), which would be prohibitive for a large n. Therefore, many previous studies used the indexing method, wherein the index is constructed for the history, and then search is performed in less than O(n) time. In particular, previous studies for this problem [4, 5, 6, 7, 14, 15] take the batch approach; ﬁrst, we read the whole input beforehand, and then construct the index such as suﬃx arrays (SA) or suﬃx trees. Then, we report the match information at last. The time complexity of their algorithm is linear in the text size. For example, in the most recent study [4], they proposed to search the factors using SA; they ﬁrst build SA. Then, at each position, they perform binary search on SA to ﬁnd the longest match substring, and check whether these matched factors appeared previously by checking the range minimum queries on corresponding SAs. A working space is about 6 times of the original text size. However, this batch method has several drawbacks. If the data is larger than the available resource, we need to divide the data into small pieces, and would lose some information. Another problem is that if we do not know the length of the input text beforehand, such as is the case in the data streaming problem, these algorithms cannot work. We instead take the online approach, and we also maintain an index to make the search faster. The problem here is that it would require a very large working space if we apply the usual data structure (e.g. trie, hash) to this problem. In this paper, we will ﬁnd the actual longest match. We therefore employ the recent compressed full-text indexing methods [22]. Compressed full-text indexing methods are data structures that support various string processing eﬃciently, by using a space proportional to that of the text itself. In particular, we use the succinct version of the enhanced suﬃx arrays (ESA) [1], in which the sizes are linear in the text size. ESA supports almost the same operations of suﬃx trees, and can solve various string problems, which cannot be supported by the original suﬃx arrays. Note that, in the previous study [1], the longest matching problem is solved by using ESA; however, their method is the batch algorithm. On the other hand, our new method reports the longest matching information online. Moreover, we propose a method to simulate the sliding window, to limit the working space by discarding the history from the oldest character. With recent dynamic rank/select dictionary [17], our algorithm requires n log σ+o(n log σ)+O(n) bits of working space, and O(log3 n) time per character, where n is the length of history, and σ is the alphabet size. Since our algorithm employs well-studied succinct data structures: rank/select and range minimum queries (rmq), it is easy to implement. To measure the practical performance, we implemented our algorithm with simpler data structures, and compared it with previous methods. We found that our algorithm requires less than 1/2 of the working space of the previous methods in real-word data, and is about 4 times slower than the previous method [4, 5, 14].

698

2

D. Okanohara and K. Sadakane

Preliminaries

For the computation model, we use the word RAM, with word-length Θ(log n), where any arithmetic operation for Θ(log n)-bit numbers and Θ(log n)-bit memory I/Os are achieved in constant time. Let Σ be a ﬁnite ordered alphabet, and σ = |Σ|. Let T [0 . . . n − 1] be an input text of length n, drawn from Σ. In this paper, we adopt for technical convenience the assumption that T is preceded by $ (T [−1] = $), which is a character from Σ that is lexicographically smaller than all other characters, and appearing nowhere else in T . Note that, although in conventional suﬃx arrays, it is assumed that T is followed by T [n] = $, we append the special character at the beginning of T , because we will construct the preﬁx arrays instead of the suﬃx arrays. 2.1

Rank and Select

Let us deﬁne rankc (T, i) as the number of occurrences of a character c in T [0, i], and selectc (T, j) as the position of (j + 1)-th occurrence of c in T . We also deﬁne predc (T, i) as the largest position of the occurrence of c before i in T , and succc (T, i) as the smallest position of the occurrence of c after i in T . These pred and succ can be supported by using a constant number of rank and select operations. That is, predc (T, i) = selectc (T, rankc (T, i − 1) − 1), and succc (T, i) = selectc (T, rankc (T, i)). Here, we deﬁne selectc (T, −1) = −1, and selectc (T, k) = n when k + 1 is larger than the number of c in T . For the dynamic-case, let us deﬁne the operation insert(T, c, p) as the insertion of a character c at T [p], and also deﬁne the operation delete(T, p) as the deletion of a character at T [p]. To achieve these operations, we can use the data structure in [17] that supports rank and select in O((1 + logloglogσ n ) log n) time, and insert and delete in O((1 + logloglogσ n ) log n) amortized time in n log σ + o(n log σ) bits. If σ < log n, the operation time is O(log n) time. 2.2

Range Minimum Query

The range minimum query (RMQ) problem is as follows: given an array E[0, n−1] of elements from a totally ordered set, rmqE (l, r) returns the index of the smallest element in E[l, r], i.e., rmqE (l, r) = arg mink∈{l,...,r} {E[k]}, or the leftmost such element in the case of a tie. The most simple but naive algorithm for this problem searches the array from l to r each time a query is presented, and in the worst case results in a Θ(n) query time. In the static case, we can build index for RMQ [9, 10], which supports RMQ in constant time using 2n + o(n) bits. In this paper, we will use the dynamic version of RMQ index.

An Online Algorithm for Finding the Longest Previous Factors

2.3

699

Suﬃx Arrays, BWT, FM-Index

Let Ti := T [i, . . . , n] be a suﬃx of T . A suﬃx array [12, 19] of T is a permutation of all the suﬃxes of input text T so that the suﬃxes are lexicographically sorted. Formally, the suﬃx arrays of T is an array A[0 . . . n] containing a permutation of the interval [0 . . . n], such that TA[i] < TA[i+1] , for all 0 ≤ i < n, where “<” between strings is the lexicographical order. The burrows-wheeler transform (BWT) [2] is a reversible transformation from a string to a string. Given a string T [0 . . . n] and its suﬃx array A[0 . . . n], its BWT, B[0 . . . n], is deﬁned as B[i] := T [A[i] − 1], except when A[i] = 0, where B[i] = T [n] = $. By using B (BWT of T ), we can simulate functions of suﬃx arrays, called FMindex. FM-index [8] is a compressed full-text index that supports various operations, including exact string matching, and operations in compressed suﬃx arrays. FM-index consists of B and rank operations on it. FM-index uses LF operations to perform several operations such as exact matching. The operation LF (i) is deﬁned as LF (i) = j such that A[j] = A[i] − 1. The relation between LF and BWT can be shown as LF (i) = rankc (TB , i) + C(c ) [22]; where c = B[i] and C(c ) gives the number of characters smaller than c in T [0 . . . n]. FM-index also supports the operation SAlookup (i), which returns the i-th value of suﬃx arrays [8, 22]. We sampled every d = logσ n log log n SAs, which are stored in o(n log σ) bits of space. Then, one SAlookup operation requires d LF operations. Since one LF operation requires O((1 + logloglogσ n ) log n) time (rank operation), the total time for SAlookup operation is O(log2 n) time [17]. Note that while we consider the dynamic case here, these operations can be O(log n) time faster in the static case. We consider preﬁx arrays instead of suﬃx arrays, and we apply the above discussion to the preﬁx arrays similarly. We deﬁne the BWT of text for preﬁx arrays as B[i] := T [A[i] + 1], where A[i] is the preﬁx array that is a permutation of all the preﬁxes of input text T , so that the preﬁxes are reverse-lexicographically sorted. The left table in the ﬁgure 1 shows the example of preﬁx arrays for the text T = abaababa. 2.4

Hgt Array

The Hgt array H[0, n − 1] for T is deﬁned as H[i] = lcp(TA[i] , TA[i+1] ), where lcp(TA[i] , TA[i+1] ) is the length of the longest common preﬁx between TA[i] and TA[i+1] . That is, H contains the lengths of the longest common preﬁxes of T’s suﬃxes that are consecutive in lexicographic order. If we store H explicitly, we need O(n log n) bits of space. Sadakane [23] gave the data structure to store H eﬃciently in only 2n bits of space. For this, we use the fact that H[i] ≥ H[LF (i)] − 1. Let L[i] be a bit array of 2n bits such that L[i] = 1 if i = H[LF k [p]] + 2n − k and L[i] = 0 otherwise, where p = A−1 [n], and LF k [p] = LF k−1 [LF [p]] for k > 1 and LF 1 [p] = LF [p]. Then, H[i] is calculated by select1 (L, k) − 2k, where k = SA[i] [23]. In this paper, we consider the Hgt array deﬁned on the preﬁx array. Therefore, H[i] is the length of the longest common suﬃx between TA[i] and TA[i+1] . The

700

D. Okanohara and K. Sadakane T = $abaababa s = 4, t = 0

Insertion of a (Prefix = $abaababaa) Deletion of a (T[0]) s = 3, t = 0 s = 2, t = 0

i

prefix B H

i

prefix B H

i

0

$ a

0

0

$ a

0

1 2

$a b $abaa b

1 1

1 2

$a b $abaa b

1 4

0 1 2

$abaababaa $

1

3

$aba a

3

3

$abaababaa $

1

3

$aba a

3

4 5

$abaababa $ $abaaba b

3 0

4

$aba a

3

2 2 0

3 0

3 0

$ab a $abaab a $abaabab a

$abaababa a $abaaba b

$abaababa a $abaaba b

6 7 8

5 6

4 5

7 8

$ab a $abaab a

2 2

6 7 8

$ab a $abaab a $abaabab a

2 2 0

9

$abaabab a

0

prefix B H $a b 1 $abaa b 4

Fig. 1. Example of one step in our algorithm

example of Hgt is shown in the right column of ﬁgure 1. In this ﬁgure we store H explicitly for explanation, while we keep H in compressed form in our algorithm.

3

Algorithm

We propose a novel algorithm for searching for the longest factors in a text. The problem is formally deﬁned as follows; given a history T = [0, . . . , n − 1], and the next character c = T [n], we ﬁnd the longest substring that matches the current suﬃx: T [j, . . . , j + l − 1] = T [i − l + 1, . . . , i], and report the position and the length of the matched substring. We perform this process for all i ∈ [0, . . . , n−1]. Our algorithm relies on the incremental construction of Enhanced Suﬃx Arrays (ESA) [1] in a similar way to Weiner’s suﬃx tree construction algorithm [26]. In this algorithm suﬃxes are inserted from the shortest one to the longest one. This is because the addition of a single symbol to the end of text in a suﬃx array may cause Ω(n) changes, whereas in reverse order construction, this is never the case. However, our algorithm processes from the beginning to the end of the string, and actually builds the preﬁx arrays; we insert preﬁxes from the shortest one to the longest one, and at the i-th step, our algorithm builds a complete ESA for T [0, . . . , i]. For example, if the input text is x1 x2 x3 x4 , we insert the preﬁxes in order x1 , x1 x2 , x1 x2 x3 , and x1 x2 x3 x4 . In each step, our algorithm keeps two arrays, the BWT of T (B), and the Hgt arrays H. Note that these data structures are stored in compressed form. We also store the auxiliary data structures to support rank/select operations on B, and rmq operations on H. Figure 1 is the example of B and H for the text T = abaababa. Besides these data structures, we also keep the following variables: – s: The position for the new preﬁx in the preﬁx array. – lp , ls : The length of the longest common preﬁx between previous/successor preﬁx and the new preﬁx.

An Online Algorithm for Finding the Longest Previous Factors

701

Note that s corresponds to the position where B[s] = $. While the update for B is the same as for the previous studies for the incremental construction of compressed suﬃx arrays [3, 18], others are new. Algorithm 1 shows the pseudo code of our entire algorithm. After reading each character, this reports the position and the length of the longest match. Figure 1 shows the example of one step in our algorithm. The tables to the left and the center of the ﬁgure show B and H for the text T = abaababa, and those for the text after the insertion of a. The table to the right of the ﬁgure shows B and H for the text after deleting the oldest history a = T [0]. 3.1

B Update

Although we use the same algorithm for updating B (The BWT of T ) described in [3, 18], we explain it here again for the sake of clarity. We again note that we construct preﬁx arrays instead of suﬃx arrays and we therefore process from the beginning of the text to the end of the text, unlike the previous studies [3, 18] wherein suﬃx arrays were built from the end of the text to the beginning of the text. We initialized s as 0. At the i-th step, we insert c = T [i] into s-th position in B (insert(B, s, c)). Then, we update s as s = LF (s) = rankc (B, s) + C(c)

(1)

where C(c) returns the number of characters smaller than c in the current B. We deﬁne the operation inc(C, c) that increments C(c ) such that c < c by one and dec(C, c) that decrements C(c ) such that c < c by one. We keep C using a data structure proposed in [20], that supports inc(C, c), dec(C, c) and the lookup C(c) in O(log σ) time. If we allow for the sorting of the symbols by its frequency in T , the look up and update time for the s-th frequent character is O(log(s)) time [20]. 3.2

H Update

We now explain how to update H arrays. First, lp and ls are initialized as 0. Let h(s1 , s2 ) be the length of the longest common preﬁx (Here, we consider the common preﬁx, not the common suﬃx, because we build the preﬁx arrays) between the substring s1 and s2 . At the i-th step we insert the new preﬁx snew = T [0, . . . , i] to the current preﬁx arrays. Then, we need to update the corresponding Hgt values, H[s − 1] and H[s]. Let spre be the previous preﬁx, and ssuc be the successor preﬁx. Then, H[s − 1] = h(spre , snew ), and H[s] = h(snew , ssuc ). First, we consider the case lp = h(spre , snew ). Let c = T [i] be the current character. If the ﬁrst character of spre is not c, lp = 0. Otherwise, ﬁrst characters of spre and snew are c; let us denote spre = cspre and snew = csnew . The value h(spre , snew ) is the range minimum query of H between ppre and pnew , where ppre is the position of spre , and pnew is the position of snew . Then, lp = 1 + h(spre , snew ). The position ppre can be calculated as predc (B, snew ) and snew is the value of s in the previous step. We can calculate ls similarly.

702

D. Okanohara and K. Sadakane

Algorithm 1. The overall algorithm for searching the longest match. After reading each character, it reports the position and the length of the matched substring. Input: A text T [0, . . . , n − 1] s := 0 // The position in B for the next character lp , ls ← 0 // The lcp between previous/successor preﬁx and the new preﬁx B // The BWT of T H // The hgt Array for i = 0 to n − 1 do c ← T [i] insert(B, c, s) if lp ≥ ls then Report: (lp + 1, SAlookup(p)) insert(H, lp + 1, p) else Report: (ls + 1, SAlookup(p + 1)) insert(H, ls + 1, p + 1) end if lp ← rmq(H, predc (B, s), s) ls ← rmq(H, s, succc (B, s)) s ← rankc (B, p) + C(c) inc(C, c) end for

If the H are stored in the compressed form (Section 2.4), we need one SAlookup operation for one character. Figure 1 shows the examples of updating H. In the left table, s = 4 and a new character “a” will be inserted to B[4]. The spre , and ssuc are “$aba” and “$ab” (The both of values of B are a). Then lp = 1 + rmq(H, 3, 3) = 4 and ls = 1 + rmq(H, 4, 5) = 1. The new s is 3, and we update H[2] = 4, and H[3] = 1. To support rmq operations on H, we store a balanced search tree for RMQs of blocks of length log n taking O(n) bits. This requires O(log n) accesses to H, each of which requires O(log2 n) time. Therefore, the total time for rmq over H is O(log3 ).

3.3

Simulating the Window Buﬀer

If the working space is limited, say 1 MB, we often discard the history from the oldest one, and search the longest match from only the previous 1 MB. This is usually called the sliding window buﬀer, which is used for many LZ77 compression implementations. Larsson [15, 16] proposed to simulate the sliding window with suﬃx trees. On the contrary, our algorithm simulates the sliding window buﬀer for suﬃx arrays with Hgt arrays.

An Online Algorithm for Finding the Longest Previous Factors

703

Here, we need to update the data structure in accordance with discarding the history from the oldest one. This can be supported in a very similar way to the insertion operation. To achieve this operation, we keep another variable t, which denotes the position for the oldest character in B. We initialized t as 0 (the position of the ﬁrst character in B is 0, because it is preceded by $). If we apply the delete operation to the current data structure, and let c = B[t]. Then, we update t as, t = selectc (B, t − C(c)).

(2)

We simultaneously perform dec(C, c), and decrease s by 1 if t < s. For the update of H we set H[t − 1] = min(H[t − 1], H[t]), and delete H[t]. We can also update t = Ψ [t] similar to the lookup with samples of SAs, which requires additional space. Note that this operation does not actually delete the oldest character but just ignores it. Therefore, the order of preﬁx arrays is not changed after the deletion. If it actually deletes the character, it may cause Ω(n) changes in B and H. For example, given a text T = $zaaaaa, if the oldest character (z) is deleted, then the orders of preﬁxes are totally reversed. On the other hand, our algorithm preserves the order, and therefore all operations can be done in O(log2 n) time. The longest common suﬃx information larger than the window size (w) should be upper bounded by w. In detail, the code in Algorithm 1 is changed as insert(H, min(lp + 1, w), p), and insert(H, min(ls + 1, w), p + 1) where w is the window size. 3.4

Output LZ-Factorization

The LZ factorization of T is a factorization into chunks T = w1 w2 . . . wk such that each wj , j ∈ 1 . . . k is (1) a letter that does not occur in previous history or otherwise (2) the longest substring that occurs at least twice in w1 , w2 , . . ., wj . For example, when T = abaababa, w1 = a, w2 = b, w3 = a, w4 = aba, w5 = ba. The LZ factorization is used for many applications, such as including data compression, pattern analysis, and ﬁnding the runs in the text [5]. The diﬀerence between LZ-factorization and our method is that the former reports the repetition information at the left of the phrase, while the latter reports all the repetition at each position. The modiﬁcation of our algorithm to output LZ-factorization is straight forward; when the matched length is not increasing, it is the separation of the LZ-factorization, and reports the occurrence of the longest match at that position. Figure 2 shows the algorithm to report LZ-factorization in an online-manner using our algorithm. We specify the factorization by a pair of a position of previous occurrence and its length.

4

Overall Analysis

The update of the position s is achieved by using one rank and one inc(C, i) operations. A rank operation requires O((1 + logloglogσ n ) log n) time [17], and a

704

D. Okanohara and K. Sadakane

Algorithm 2. This algorithm outputs the LZ-factorization of a text T = T [0, . . . , n − 1]. This algorithm employs the result of our algorithm. Input: Text T [0, . . . , n − 1] iprev = 0 // The position of the beginning of the next phrase. (lenprev , posprev ) = (0, 0) for i = 0 to n − 1 do (len, pos) = process(T [i]) // Result of the algorithm 1. if len ≤ lenprev then len = min(len, i − iprev ) Report wt = (len, posprev − len), t = t + 1 iprev = i end if posprev = pos end for Output: w1 , . . . , wt

inc(C, i) operation requires O(log σ) time. The two rmq operations are required in H, which requires O(log3 n) time. The insertion of a new character into T [p] requires O((1 + logloglogσ n ) log n). The update of the Hgt arrays requires O(log2 n) time. Therefore, the bottleneck of our algorithm is the rmq operation on H arrays. Note that this can be improved to O(log2 n) time if we keep the balanced parentheses sequence representing the topology of the suﬃx tree [24]. Due to the space limitation, we defer a discussion of the details to the full paper. For the space analysis, T can be kept in n log σ + o(n log σ) bits of space [17], and The Hgt arrays can be kept in O(n) bits of space. By summarizing the above result, we obtain the following theorem. Theorem 1. Let T be an input text of original length n drawn from Σ and σ = |Σ|. We can solve the online longest previous factor problem using n log σ + O(n) + o(n log σ) bits of space in O(n log3 n) time.

5

Experiments

In the experiments, we used simpler data structures. We store B and H by a balanced binary tree. Each leaf in the tree has the buﬀer of the ﬁxed size to store the portion of the B and H. After the insertion, we check whether the leaf is full or not. If the leaf is full, we split the leaf into two leaves as children of the original leaf. To reduce the space requirement further, if the leaf is full, we ﬁrst check whether the preceding node or succeeding node is full or not. If one of them is not full (say r), we move the buﬀer in the full leaf to the leaf r. We do not use the succinct representation for H, because this representation requires SA lookup operation, which is very slow in practice. We instead used the direct representation, and set the smallest bit width for each node, so that all values of H in the node can be represented correctly.

An Online Algorithm for Finding the Longest Previous Factors

705

Table 1. Description of the data used in experiments String Size (bytes) Σ Description ﬁb35 9227465 2 The 35th Fibonacci string ﬁb36 14930352 2 The 36th Fibonacci string fss9 2851443 2 The 9th run rich string of [11] fss10 12078908 2 The 10th run rich string of [11] rnd2 8388608 2 Random string, small alphabet rnd21 8388608 21 Random string, larger alphabet ecoli 4638690 4 E.Coli Genome chr22 34553758 4 Human Chromosome 22 bible 4047392 62 King James Bible howto 39422105 197 Linux Howto ﬁles chr19 63811651 4 Human Chromosome 19

Table 2. Peak memory usage in bytes per input symbol String OS ﬁb35 6.85 ﬁb36 6.85 fss9 6.84 fss10 6.86 rnd2 3.14 fnd21 3.29 ecoli 4.13 chr22 4.28 bible 3.40 howto 4.63 chr19 3.78

CPSa 17.00 17.00 17.00 17.00 17.00 17.00 17.00 17.00 17.00 17.00 17.00

CPSd 11.50 11.50 11.10 11.10 9.00 9.00 9.00 9.00 9.00 9.00 9.00

kk-LZ CPS6n 19.92 5.75 20.76 5.75 21.27 5.73 22.47 5.50 11.83 5.75 5.75 11.11 5.79 11.03 5.78 5.72 5.78 11.07 5.78

Table 3. Runtime in milliseconds for searching the longest previous factors String OS ﬁb35 22446 ﬁb36 41629 fss9 4623 fss10 31855 rnd2 15821 fnd21 21787 ecoli 7975 chr22 119056 bible 7098 howto 156715 chr19 256483

CPSa 5093 8728 1261 7020 3929 4360 1953 18800 1558 19000 38336

kk-LZ CPS6n 9225 4068 15822 7273 1853 629 9280 5538 5206 10186 - 17605 1028 6448 12855 79861 - 3309 - 60568 29193 166939

We denote our algorithm implemented in the above way os. We also implemented the other SA-based LZ-factorization, the cps of [5]. The implementation kk-lz of Kolpakov and Kucherov’s algorithm was obtained from [14], and cps6n [4] was written by its author. Note that while os reports the matching result online, others report the matching results last. All programs were written in C or C++. All running times given are the average of two runs, and do not include the time spent reading input ﬁles. There are no large variances between the two trials. Memory usage was recorded with the memusage command. Times were recorded with the standard C getrusage function. All experiments were conducted on a 3.0 GHz Xeon processor with 32GB main memory. The operation system is the Linux version 2.6.9. The compiler was g++ (gcc version 4.0.3) executed with the -O3 option. Times for the cps, and cps6n implementations include the time required for SA and LCP array construction. We used libdivsufsort [21] for SA construction, and the linear-time algorithm [13] for LCP construction. The implementation of

706

D. Okanohara and K. Sadakane

kk-lz is only suitable for strings on small alphabets (σ ≤ 4), so the times are only given for some ﬁles. Table 1 shows the list of the test data. All data are taken from [25]. Table 2 shows the result of the peak memory usage of each program. The values in the column of CPSd is taken from [5]. These results indicate that our algorithm requires a smaller memory than other programs, especially when the values in H are small. This is because os dynamically set the bit width for H in each node, so that all values of H in the node can be represented. Table 3 shows the result of the runtime of each program. Almost all the runtimes of os are about 4 times that of cps.

6

Conclusion

In this paper, we presented a novel algorithm for searching the longest match using a small working space. Our algorithm is online, which can process very large text data, and streaming data. The proposed method is based on the construction of enhanced preﬁx arrays. Our method is based on the well-studies rank, select, and rmq operations, and is therefore easy to implement. Our general approach can be adapted to simulate a sliding window buﬀer, by eﬃciently updating the index by discarding the history from the oldest one. The experimental results show that our method requires about 1/2 or 1/4 the working space of those for the previous methods [5, 14] for real world data, with a reasonable decline in the speed Since the compressed suﬃx trees (CST) can be simulated by adding the balanced parenthesis tree (BP) to ESA [24], we can extend our algorithm to build the CST incrementally. Due to space limitations, we defer a discussion of the details to the full paper. As for our next step, we would like to further reduce the working space and time. In particular, the data structures for the Hgt array is the bottleneck of our algorithm, and it is expected to propose a new succinct representation for Hgt array with faster operation. Acknowledgements. The authors would like to thank Simon J. Puglisi, who provided the code and helpful comments. The work is supported in part by the Grant-in-Aid of the Ministry of Education, Science, Sports and Culture of Japan.

References 1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suﬃx trees with enhanced suﬃx arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004) 2. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994) 3. Chan, H., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2), 21 (2007) 4. Chen, G., Puglisi, S.J., Smyth, W.F.: LZ factorization in less time and space. Mathematics in Computer Science (MCS) Special Issue on Combinatorial Algorithms (2008)

An Online Algorithm for Finding the Longest Previous Factors

707

5. Chen, G., Puglisi, S.J., Smyth, W.: Fast and practical algorithms for computing all the runs in a string. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 307–315. Springer, Heidelberg (2007) 6. Crochemore, M., Ilie, L.: LZ factorization in less time and space. Information Processing Letters 106, 75–80 (2008) 7. Crochemore, M., Ilie, L., Smyth, W.F.: A simple algorithm for computing the Lempel–Ziv factorization. In: DCC, pp. 482–488 (2008) 8. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. of FOCS (2000) 9. Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQproblem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006) 10. Fischer, J., Heun, V.: A new succinct representation of rmq-information and improvements in the enhanced suﬃx array. In: Chen, B., Paterson, M., Zhang, G. (eds.) ESCAPE 2007. LNCS, vol. 4614. Springer, Heidelberg (2007) 11. Franek, F., Simpson, R.J., Smyth, W.F.: The maximum number of runs in a string. In: AWOCA, pp. 26–35 (2003) 12. Gonnet, G.H., Baeza-Yates, R., Snider, T.: New indices for text: PAT trees and PAT arrays. Information Retrieval: Algorithms and Data Structures, 66–82 (1992) 13. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longestcommon-preﬁx computation in suﬃx arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001) 14. Kolpakov, R., Kucherov, G.: Mreps, http://bioinfo.lifl.fr/mreps/ 15. Larsson, J.: Extended application of suﬃx trees to data compression. In: Proc. of DCC, pp. 190–199 (1996) 16. Larsson, J.: Structures of String Matching and Data Compression. PhD thesis, Lund University (1999) 17. Lee, S., Park, K.: Dynamic rank-select structures with applications to run-length encoded texts. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 95– 106. Springer, Heidelberg (2007) 18. Lippert, R., Mobarry, C., Walenz, B.: A space-eﬃcient construction of the burrows wheeler transform for genomic data. Journal of Computational Biology (2005) 19. Manber, U., Myers, E.W.: Suﬃx arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993) 20. Moﬀat, A.: An improved data structure for cumulative probability tables. Software: Practice and Experience 29, 647–659 (1999) 21. Mori, Y.: libdivsufsort, http://code.google.com/p/libdivsufsort/ 22. Navarro, G., M¨ akinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007) 23. Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suﬃ arrays. In: ACM-SIAM SODA, pp. 225–232 (2002) 24. Sadakane, K.: Compressed suﬃx trees with full functionality. J. Theory of Computing Systems (2007) 25. Smyth, W.F.: http://www.cas.mcmaster.ca/∼ bill/strbings/ 26. Weiner, P.: Linear pattern matching algorihms. In: Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973) 27. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)

An Improved Divide-and-Conquer Algorithm for Finding ...