An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara1 and Kunihiko Sadakane2 1
2
Department of Computer Science, University of Tokyo. Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0013, Japan
[email protected] Department of Computer Science and Communication Engineering, Kyushu University. Motooka 744, Nishi-ku, Fukuoka 819-0395, Japan
[email protected]
Abstract. We present a novel algorithm for finding the longest factors in a text, for which the working space is proportional to the history text size. Moreover, our algorithm is online and exact; in that, unlike the previous batch algorithms [4, 5, 6, 7, 14], which needs to read the entire input beforehand, our algorithm reports the longest match just after reading each character. This algorithm can be directly used for data compression, pattern analysis, and data mining. Our algorithm also supports the window buffer, in that we can bound the working space by discarding the history from the oldest character. Using the dynamic rank/select dictionary [17], our algorithm requires n log σ + o(n log σ) + O(n) bits of working space, and O(log3 n) time per character, O(n log3 n) total time, n is the length of the history, and σ is the alphabet size. We implemented our algorithm and compared it with the recent algorithms [4, 5, 14] in terms of speed and the working space. We found that our algorithm can work with a smaller working space, less than 1/2 of those for the previous methods in real-world data, and with a reasonable decline in speed.
1
Introduction
The problem in searching for the longest previous factor is as follows: given a history T [0, i − 1], and a next character T [i] = c, find the longest substring (or factor) that occurs in the history: T [j, . . . , j + l − 1] = T [i − l + 1, . . . , i], and report (j, l), the position, and the length of the matched substring. This problem is fundamental in many applications, including data compression and pattern analysis. For example, we can directly use this algorithm for the LZ77 compression methods [27]; we compress the data by replacing a substring of a text with a pointer to its longest previous occurrence in the input. Our algorithm also solves the LZ-factorization problem, which has become interesting, because it can be used for succinct indexing, and for finding the runs in the string [5]. A straightforward solution to this problem is to search the history on the fly by using the sequential search. However, this requires O(n) time for each D. Halperin and K. Mehlhorn (Eds.): ESA 2008, LNCS 5193, pp. 696–707, 2008. c Springer-Verlag Berlin Heidelberg 2008
An Online Algorithm for Finding the Longest Previous Factors
697
character, where n is the length of the history, and thus the total complexity becomes O(n2 ), which would be prohibitive for a large n. Therefore, many previous studies used the indexing method, wherein the index is constructed for the history, and then search is performed in less than O(n) time. In particular, previous studies for this problem [4, 5, 6, 7, 14, 15] take the batch approach; first, we read the whole input beforehand, and then construct the index such as suffix arrays (SA) or suffix trees. Then, we report the match information at last. The time complexity of their algorithm is linear in the text size. For example, in the most recent study [4], they proposed to search the factors using SA; they first build SA. Then, at each position, they perform binary search on SA to find the longest match substring, and check whether these matched factors appeared previously by checking the range minimum queries on corresponding SAs. A working space is about 6 times of the original text size. However, this batch method has several drawbacks. If the data is larger than the available resource, we need to divide the data into small pieces, and would lose some information. Another problem is that if we do not know the length of the input text beforehand, such as is the case in the data streaming problem, these algorithms cannot work. We instead take the online approach, and we also maintain an index to make the search faster. The problem here is that it would require a very large working space if we apply the usual data structure (e.g. trie, hash) to this problem. In this paper, we will find the actual longest match. We therefore employ the recent compressed full-text indexing methods [22]. Compressed full-text indexing methods are data structures that support various string processing efficiently, by using a space proportional to that of the text itself. In particular, we use the succinct version of the enhanced suffix arrays (ESA) [1], in which the sizes are linear in the text size. ESA supports almost the same operations of suffix trees, and can solve various string problems, which cannot be supported by the original suffix arrays. Note that, in the previous study [1], the longest matching problem is solved by using ESA; however, their method is the batch algorithm. On the other hand, our new method reports the longest matching information online. Moreover, we propose a method to simulate the sliding window, to limit the working space by discarding the history from the oldest character. With recent dynamic rank/select dictionary [17], our algorithm requires n log σ+o(n log σ)+O(n) bits of working space, and O(log3 n) time per character, where n is the length of history, and σ is the alphabet size. Since our algorithm employs well-studied succinct data structures: rank/select and range minimum queries (rmq), it is easy to implement. To measure the practical performance, we implemented our algorithm with simpler data structures, and compared it with previous methods. We found that our algorithm requires less than 1/2 of the working space of the previous methods in real-word data, and is about 4 times slower than the previous method [4, 5, 14].
698
2
D. Okanohara and K. Sadakane
Preliminaries
For the computation model, we use the word RAM, with word-length Θ(log n), where any arithmetic operation for Θ(log n)-bit numbers and Θ(log n)-bit memory I/Os are achieved in constant time. Let Σ be a finite ordered alphabet, and σ = |Σ|. Let T [0 . . . n − 1] be an input text of length n, drawn from Σ. In this paper, we adopt for technical convenience the assumption that T is preceded by $ (T [−1] = $), which is a character from Σ that is lexicographically smaller than all other characters, and appearing nowhere else in T . Note that, although in conventional suffix arrays, it is assumed that T is followed by T [n] = $, we append the special character at the beginning of T , because we will construct the prefix arrays instead of the suffix arrays. 2.1
Rank and Select
Let us define rankc (T, i) as the number of occurrences of a character c in T [0, i], and selectc (T, j) as the position of (j + 1)-th occurrence of c in T . We also define predc (T, i) as the largest position of the occurrence of c before i in T , and succc (T, i) as the smallest position of the occurrence of c after i in T . These pred and succ can be supported by using a constant number of rank and select operations. That is, predc (T, i) = selectc (T, rankc (T, i − 1) − 1), and succc (T, i) = selectc (T, rankc (T, i)). Here, we define selectc (T, −1) = −1, and selectc (T, k) = n when k + 1 is larger than the number of c in T . For the dynamic-case, let us define the operation insert(T, c, p) as the insertion of a character c at T [p], and also define the operation delete(T, p) as the deletion of a character at T [p]. To achieve these operations, we can use the data structure in [17] that supports rank and select in O((1 + logloglogσ n ) log n) time, and insert and delete in O((1 + logloglogσ n ) log n) amortized time in n log σ + o(n log σ) bits. If σ < log n, the operation time is O(log n) time. 2.2
Range Minimum Query
The range minimum query (RMQ) problem is as follows: given an array E[0, n−1] of elements from a totally ordered set, rmqE (l, r) returns the index of the smallest element in E[l, r], i.e., rmqE (l, r) = arg mink∈{l,...,r} {E[k]}, or the leftmost such element in the case of a tie. The most simple but naive algorithm for this problem searches the array from l to r each time a query is presented, and in the worst case results in a Θ(n) query time. In the static case, we can build index for RMQ [9, 10], which supports RMQ in constant time using 2n + o(n) bits. In this paper, we will use the dynamic version of RMQ index.
An Online Algorithm for Finding the Longest Previous Factors
2.3
699
Suffix Arrays, BWT, FM-Index
Let Ti := T [i, . . . , n] be a suffix of T . A suffix array [12, 19] of T is a permutation of all the suffixes of input text T so that the suffixes are lexicographically sorted. Formally, the suffix arrays of T is an array A[0 . . . n] containing a permutation of the interval [0 . . . n], such that TA[i] < TA[i+1] , for all 0 ≤ i < n, where “<” between strings is the lexicographical order. The burrows-wheeler transform (BWT) [2] is a reversible transformation from a string to a string. Given a string T [0 . . . n] and its suffix array A[0 . . . n], its BWT, B[0 . . . n], is defined as B[i] := T [A[i] − 1], except when A[i] = 0, where B[i] = T [n] = $. By using B (BWT of T ), we can simulate functions of suffix arrays, called FMindex. FM-index [8] is a compressed full-text index that supports various operations, including exact string matching, and operations in compressed suffix arrays. FM-index consists of B and rank operations on it. FM-index uses LF operations to perform several operations such as exact matching. The operation LF (i) is defined as LF (i) = j such that A[j] = A[i] − 1. The relation between LF and BWT can be shown as LF (i) = rankc (TB , i) + C(c ) [22]; where c = B[i] and C(c ) gives the number of characters smaller than c in T [0 . . . n]. FM-index also supports the operation SAlookup (i), which returns the i-th value of suffix arrays [8, 22]. We sampled every d = logσ n log log n SAs, which are stored in o(n log σ) bits of space. Then, one SAlookup operation requires d LF operations. Since one LF operation requires O((1 + logloglogσ n ) log n) time (rank operation), the total time for SAlookup operation is O(log2 n) time [17]. Note that while we consider the dynamic case here, these operations can be O(log n) time faster in the static case. We consider prefix arrays instead of suffix arrays, and we apply the above discussion to the prefix arrays similarly. We define the BWT of text for prefix arrays as B[i] := T [A[i] + 1], where A[i] is the prefix array that is a permutation of all the prefixes of input text T , so that the prefixes are reverse-lexicographically sorted. The left table in the figure 1 shows the example of prefix arrays for the text T = abaababa. 2.4
Hgt Array
The Hgt array H[0, n − 1] for T is defined as H[i] = lcp(TA[i] , TA[i+1] ), where lcp(TA[i] , TA[i+1] ) is the length of the longest common prefix between TA[i] and TA[i+1] . That is, H contains the lengths of the longest common prefixes of T’s suffixes that are consecutive in lexicographic order. If we store H explicitly, we need O(n log n) bits of space. Sadakane [23] gave the data structure to store H efficiently in only 2n bits of space. For this, we use the fact that H[i] ≥ H[LF (i)] − 1. Let L[i] be a bit array of 2n bits such that L[i] = 1 if i = H[LF k [p]] + 2n − k and L[i] = 0 otherwise, where p = A−1 [n], and LF k [p] = LF k−1 [LF [p]] for k > 1 and LF 1 [p] = LF [p]. Then, H[i] is calculated by select1 (L, k) − 2k, where k = SA[i] [23]. In this paper, we consider the Hgt array defined on the prefix array. Therefore, H[i] is the length of the longest common suffix between TA[i] and TA[i+1] . The
700
D. Okanohara and K. Sadakane T = $abaababa s = 4, t = 0
Insertion of a (Prefix = $abaababaa) Deletion of a (T[0]) s = 3, t = 0 s = 2, t = 0
i
prefix B H
i
prefix B H
i
0
$ a
0
0
$ a
0
1 2
$a b $abaa b
1 1
1 2
$a b $abaa b
1 4
0 1 2
$abaababaa $
1
3
$aba a
3
3
$abaababaa $
1
3
$aba a
3
4 5
$abaababa $ $abaaba b
3 0
4
$aba a
3
2 2 0
3 0
3 0
$ab a $abaab a $abaabab a
$abaababa a $abaaba b
$abaababa a $abaaba b
6 7 8
5 6
4 5
7 8
$ab a $abaab a
2 2
6 7 8
$ab a $abaab a $abaabab a
2 2 0
9
$abaabab a
0
prefix B H $a b 1 $abaa b 4
Fig. 1. Example of one step in our algorithm
example of Hgt is shown in the right column of figure 1. In this figure we store H explicitly for explanation, while we keep H in compressed form in our algorithm.
3
Algorithm
We propose a novel algorithm for searching for the longest factors in a text. The problem is formally defined as follows; given a history T = [0, . . . , n − 1], and the next character c = T [n], we find the longest substring that matches the current suffix: T [j, . . . , j + l − 1] = T [i − l + 1, . . . , i], and report the position and the length of the matched substring. We perform this process for all i ∈ [0, . . . , n−1]. Our algorithm relies on the incremental construction of Enhanced Suffix Arrays (ESA) [1] in a similar way to Weiner’s suffix tree construction algorithm [26]. In this algorithm suffixes are inserted from the shortest one to the longest one. This is because the addition of a single symbol to the end of text in a suffix array may cause Ω(n) changes, whereas in reverse order construction, this is never the case. However, our algorithm processes from the beginning to the end of the string, and actually builds the prefix arrays; we insert prefixes from the shortest one to the longest one, and at the i-th step, our algorithm builds a complete ESA for T [0, . . . , i]. For example, if the input text is x1 x2 x3 x4 , we insert the prefixes in order x1 , x1 x2 , x1 x2 x3 , and x1 x2 x3 x4 . In each step, our algorithm keeps two arrays, the BWT of T (B), and the Hgt arrays H. Note that these data structures are stored in compressed form. We also store the auxiliary data structures to support rank/select operations on B, and rmq operations on H. Figure 1 is the example of B and H for the text T = abaababa. Besides these data structures, we also keep the following variables: – s: The position for the new prefix in the prefix array. – lp , ls : The length of the longest common prefix between previous/successor prefix and the new prefix.
An Online Algorithm for Finding the Longest Previous Factors
701
Note that s corresponds to the position where B[s] = $. While the update for B is the same as for the previous studies for the incremental construction of compressed suffix arrays [3, 18], others are new. Algorithm 1 shows the pseudo code of our entire algorithm. After reading each character, this reports the position and the length of the longest match. Figure 1 shows the example of one step in our algorithm. The tables to the left and the center of the figure show B and H for the text T = abaababa, and those for the text after the insertion of a. The table to the right of the figure shows B and H for the text after deleting the oldest history a = T [0]. 3.1
B Update
Although we use the same algorithm for updating B (The BWT of T ) described in [3, 18], we explain it here again for the sake of clarity. We again note that we construct prefix arrays instead of suffix arrays and we therefore process from the beginning of the text to the end of the text, unlike the previous studies [3, 18] wherein suffix arrays were built from the end of the text to the beginning of the text. We initialized s as 0. At the i-th step, we insert c = T [i] into s-th position in B (insert(B, s, c)). Then, we update s as s = LF (s) = rankc (B, s) + C(c)
(1)
where C(c) returns the number of characters smaller than c in the current B. We define the operation inc(C, c) that increments C(c ) such that c < c by one and dec(C, c) that decrements C(c ) such that c < c by one. We keep C using a data structure proposed in [20], that supports inc(C, c), dec(C, c) and the lookup C(c) in O(log σ) time. If we allow for the sorting of the symbols by its frequency in T , the look up and update time for the s-th frequent character is O(log(s)) time [20]. 3.2
H Update
We now explain how to update H arrays. First, lp and ls are initialized as 0. Let h(s1 , s2 ) be the length of the longest common prefix (Here, we consider the common prefix, not the common suffix, because we build the prefix arrays) between the substring s1 and s2 . At the i-th step we insert the new prefix snew = T [0, . . . , i] to the current prefix arrays. Then, we need to update the corresponding Hgt values, H[s − 1] and H[s]. Let spre be the previous prefix, and ssuc be the successor prefix. Then, H[s − 1] = h(spre , snew ), and H[s] = h(snew , ssuc ). First, we consider the case lp = h(spre , snew ). Let c = T [i] be the current character. If the first character of spre is not c, lp = 0. Otherwise, first characters of spre and snew are c; let us denote spre = cspre and snew = csnew . The value h(spre , snew ) is the range minimum query of H between ppre and pnew , where ppre is the position of spre , and pnew is the position of snew . Then, lp = 1 + h(spre , snew ). The position ppre can be calculated as predc (B, snew ) and snew is the value of s in the previous step. We can calculate ls similarly.
702
D. Okanohara and K. Sadakane
Algorithm 1. The overall algorithm for searching the longest match. After reading each character, it reports the position and the length of the matched substring. Input: A text T [0, . . . , n − 1] s := 0 // The position in B for the next character lp , ls ← 0 // The lcp between previous/successor prefix and the new prefix B // The BWT of T H // The hgt Array for i = 0 to n − 1 do c ← T [i] insert(B, c, s) if lp ≥ ls then Report: (lp + 1, SAlookup(p)) insert(H, lp + 1, p) else Report: (ls + 1, SAlookup(p + 1)) insert(H, ls + 1, p + 1) end if lp ← rmq(H, predc (B, s), s) ls ← rmq(H, s, succc (B, s)) s ← rankc (B, p) + C(c) inc(C, c) end for
If the H are stored in the compressed form (Section 2.4), we need one SAlookup operation for one character. Figure 1 shows the examples of updating H. In the left table, s = 4 and a new character “a” will be inserted to B[4]. The spre , and ssuc are “$aba” and “$ab” (The both of values of B are a). Then lp = 1 + rmq(H, 3, 3) = 4 and ls = 1 + rmq(H, 4, 5) = 1. The new s is 3, and we update H[2] = 4, and H[3] = 1. To support rmq operations on H, we store a balanced search tree for RMQs of blocks of length log n taking O(n) bits. This requires O(log n) accesses to H, each of which requires O(log2 n) time. Therefore, the total time for rmq over H is O(log3 ).
3.3
Simulating the Window Buffer
If the working space is limited, say 1 MB, we often discard the history from the oldest one, and search the longest match from only the previous 1 MB. This is usually called the sliding window buffer, which is used for many LZ77 compression implementations. Larsson [15, 16] proposed to simulate the sliding window with suffix trees. On the contrary, our algorithm simulates the sliding window buffer for suffix arrays with Hgt arrays.
An Online Algorithm for Finding the Longest Previous Factors
703
Here, we need to update the data structure in accordance with discarding the history from the oldest one. This can be supported in a very similar way to the insertion operation. To achieve this operation, we keep another variable t, which denotes the position for the oldest character in B. We initialized t as 0 (the position of the first character in B is 0, because it is preceded by $). If we apply the delete operation to the current data structure, and let c = B[t]. Then, we update t as, t = selectc (B, t − C(c)).
(2)
We simultaneously perform dec(C, c), and decrease s by 1 if t < s. For the update of H we set H[t − 1] = min(H[t − 1], H[t]), and delete H[t]. We can also update t = Ψ [t] similar to the lookup with samples of SAs, which requires additional space. Note that this operation does not actually delete the oldest character but just ignores it. Therefore, the order of prefix arrays is not changed after the deletion. If it actually deletes the character, it may cause Ω(n) changes in B and H. For example, given a text T = $zaaaaa, if the oldest character (z) is deleted, then the orders of prefixes are totally reversed. On the other hand, our algorithm preserves the order, and therefore all operations can be done in O(log2 n) time. The longest common suffix information larger than the window size (w) should be upper bounded by w. In detail, the code in Algorithm 1 is changed as insert(H, min(lp + 1, w), p), and insert(H, min(ls + 1, w), p + 1) where w is the window size. 3.4
Output LZ-Factorization
The LZ factorization of T is a factorization into chunks T = w1 w2 . . . wk such that each wj , j ∈ 1 . . . k is (1) a letter that does not occur in previous history or otherwise (2) the longest substring that occurs at least twice in w1 , w2 , . . ., wj . For example, when T = abaababa, w1 = a, w2 = b, w3 = a, w4 = aba, w5 = ba. The LZ factorization is used for many applications, such as including data compression, pattern analysis, and finding the runs in the text [5]. The difference between LZ-factorization and our method is that the former reports the repetition information at the left of the phrase, while the latter reports all the repetition at each position. The modification of our algorithm to output LZ-factorization is straight forward; when the matched length is not increasing, it is the separation of the LZ-factorization, and reports the occurrence of the longest match at that position. Figure 2 shows the algorithm to report LZ-factorization in an online-manner using our algorithm. We specify the factorization by a pair of a position of previous occurrence and its length.
4
Overall Analysis
The update of the position s is achieved by using one rank and one inc(C, i) operations. A rank operation requires O((1 + logloglogσ n ) log n) time [17], and a
704
D. Okanohara and K. Sadakane
Algorithm 2. This algorithm outputs the LZ-factorization of a text T = T [0, . . . , n − 1]. This algorithm employs the result of our algorithm. Input: Text T [0, . . . , n − 1] iprev = 0 // The position of the beginning of the next phrase. (lenprev , posprev ) = (0, 0) for i = 0 to n − 1 do (len, pos) = process(T [i]) // Result of the algorithm 1. if len ≤ lenprev then len = min(len, i − iprev ) Report wt = (len, posprev − len), t = t + 1 iprev = i end if posprev = pos end for Output: w1 , . . . , wt
inc(C, i) operation requires O(log σ) time. The two rmq operations are required in H, which requires O(log3 n) time. The insertion of a new character into T [p] requires O((1 + logloglogσ n ) log n). The update of the Hgt arrays requires O(log2 n) time. Therefore, the bottleneck of our algorithm is the rmq operation on H arrays. Note that this can be improved to O(log2 n) time if we keep the balanced parentheses sequence representing the topology of the suffix tree [24]. Due to the space limitation, we defer a discussion of the details to the full paper. For the space analysis, T can be kept in n log σ + o(n log σ) bits of space [17], and The Hgt arrays can be kept in O(n) bits of space. By summarizing the above result, we obtain the following theorem. Theorem 1. Let T be an input text of original length n drawn from Σ and σ = |Σ|. We can solve the online longest previous factor problem using n log σ + O(n) + o(n log σ) bits of space in O(n log3 n) time.
5
Experiments
In the experiments, we used simpler data structures. We store B and H by a balanced binary tree. Each leaf in the tree has the buffer of the fixed size to store the portion of the B and H. After the insertion, we check whether the leaf is full or not. If the leaf is full, we split the leaf into two leaves as children of the original leaf. To reduce the space requirement further, if the leaf is full, we first check whether the preceding node or succeeding node is full or not. If one of them is not full (say r), we move the buffer in the full leaf to the leaf r. We do not use the succinct representation for H, because this representation requires SA lookup operation, which is very slow in practice. We instead used the direct representation, and set the smallest bit width for each node, so that all values of H in the node can be represented correctly.
An Online Algorithm for Finding the Longest Previous Factors
705
Table 1. Description of the data used in experiments String Size (bytes) Σ Description fib35 9227465 2 The 35th Fibonacci string fib36 14930352 2 The 36th Fibonacci string fss9 2851443 2 The 9th run rich string of [11] fss10 12078908 2 The 10th run rich string of [11] rnd2 8388608 2 Random string, small alphabet rnd21 8388608 21 Random string, larger alphabet ecoli 4638690 4 E.Coli Genome chr22 34553758 4 Human Chromosome 22 bible 4047392 62 King James Bible howto 39422105 197 Linux Howto files chr19 63811651 4 Human Chromosome 19
Table 2. Peak memory usage in bytes per input symbol String OS fib35 6.85 fib36 6.85 fss9 6.84 fss10 6.86 rnd2 3.14 fnd21 3.29 ecoli 4.13 chr22 4.28 bible 3.40 howto 4.63 chr19 3.78
CPSa 17.00 17.00 17.00 17.00 17.00 17.00 17.00 17.00 17.00 17.00 17.00
CPSd 11.50 11.50 11.10 11.10 9.00 9.00 9.00 9.00 9.00 9.00 9.00
kk-LZ CPS6n 19.92 5.75 20.76 5.75 21.27 5.73 22.47 5.50 11.83 5.75 5.75 11.11 5.79 11.03 5.78 5.72 5.78 11.07 5.78
Table 3. Runtime in milliseconds for searching the longest previous factors String OS fib35 22446 fib36 41629 fss9 4623 fss10 31855 rnd2 15821 fnd21 21787 ecoli 7975 chr22 119056 bible 7098 howto 156715 chr19 256483
CPSa 5093 8728 1261 7020 3929 4360 1953 18800 1558 19000 38336
kk-LZ CPS6n 9225 4068 15822 7273 1853 629 9280 5538 5206 10186 - 17605 1028 6448 12855 79861 - 3309 - 60568 29193 166939
We denote our algorithm implemented in the above way os. We also implemented the other SA-based LZ-factorization, the cps of [5]. The implementation kk-lz of Kolpakov and Kucherov’s algorithm was obtained from [14], and cps6n [4] was written by its author. Note that while os reports the matching result online, others report the matching results last. All programs were written in C or C++. All running times given are the average of two runs, and do not include the time spent reading input files. There are no large variances between the two trials. Memory usage was recorded with the memusage command. Times were recorded with the standard C getrusage function. All experiments were conducted on a 3.0 GHz Xeon processor with 32GB main memory. The operation system is the Linux version 2.6.9. The compiler was g++ (gcc version 4.0.3) executed with the -O3 option. Times for the cps, and cps6n implementations include the time required for SA and LCP array construction. We used libdivsufsort [21] for SA construction, and the linear-time algorithm [13] for LCP construction. The implementation of
706
D. Okanohara and K. Sadakane
kk-lz is only suitable for strings on small alphabets (σ ≤ 4), so the times are only given for some files. Table 1 shows the list of the test data. All data are taken from [25]. Table 2 shows the result of the peak memory usage of each program. The values in the column of CPSd is taken from [5]. These results indicate that our algorithm requires a smaller memory than other programs, especially when the values in H are small. This is because os dynamically set the bit width for H in each node, so that all values of H in the node can be represented. Table 3 shows the result of the runtime of each program. Almost all the runtimes of os are about 4 times that of cps.
6
Conclusion
In this paper, we presented a novel algorithm for searching the longest match using a small working space. Our algorithm is online, which can process very large text data, and streaming data. The proposed method is based on the construction of enhanced prefix arrays. Our method is based on the well-studies rank, select, and rmq operations, and is therefore easy to implement. Our general approach can be adapted to simulate a sliding window buffer, by efficiently updating the index by discarding the history from the oldest one. The experimental results show that our method requires about 1/2 or 1/4 the working space of those for the previous methods [5, 14] for real world data, with a reasonable decline in the speed Since the compressed suffix trees (CST) can be simulated by adding the balanced parenthesis tree (BP) to ESA [24], we can extend our algorithm to build the CST incrementally. Due to space limitations, we defer a discussion of the details to the full paper. As for our next step, we would like to further reduce the working space and time. In particular, the data structures for the Hgt array is the bottleneck of our algorithm, and it is expected to propose a new succinct representation for Hgt array with faster operation. Acknowledgements. The authors would like to thank Simon J. Puglisi, who provided the code and helpful comments. The work is supported in part by the Grant-in-Aid of the Ministry of Education, Science, Sports and Culture of Japan.
References 1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004) 2. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994) 3. Chan, H., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2), 21 (2007) 4. Chen, G., Puglisi, S.J., Smyth, W.F.: LZ factorization in less time and space. Mathematics in Computer Science (MCS) Special Issue on Combinatorial Algorithms (2008)
An Online Algorithm for Finding the Longest Previous Factors
707
5. Chen, G., Puglisi, S.J., Smyth, W.: Fast and practical algorithms for computing all the runs in a string. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 307–315. Springer, Heidelberg (2007) 6. Crochemore, M., Ilie, L.: LZ factorization in less time and space. Information Processing Letters 106, 75–80 (2008) 7. Crochemore, M., Ilie, L., Smyth, W.F.: A simple algorithm for computing the Lempel–Ziv factorization. In: DCC, pp. 482–488 (2008) 8. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. of FOCS (2000) 9. Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQproblem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006) 10. Fischer, J., Heun, V.: A new succinct representation of rmq-information and improvements in the enhanced suffix array. In: Chen, B., Paterson, M., Zhang, G. (eds.) ESCAPE 2007. LNCS, vol. 4614. Springer, Heidelberg (2007) 11. Franek, F., Simpson, R.J., Smyth, W.F.: The maximum number of runs in a string. In: AWOCA, pp. 26–35 (2003) 12. Gonnet, G.H., Baeza-Yates, R., Snider, T.: New indices for text: PAT trees and PAT arrays. Information Retrieval: Algorithms and Data Structures, 66–82 (1992) 13. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longestcommon-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001) 14. Kolpakov, R., Kucherov, G.: Mreps, http://bioinfo.lifl.fr/mreps/ 15. Larsson, J.: Extended application of suffix trees to data compression. In: Proc. of DCC, pp. 190–199 (1996) 16. Larsson, J.: Structures of String Matching and Data Compression. PhD thesis, Lund University (1999) 17. Lee, S., Park, K.: Dynamic rank-select structures with applications to run-length encoded texts. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 95– 106. Springer, Heidelberg (2007) 18. Lippert, R., Mobarry, C., Walenz, B.: A space-efficient construction of the burrows wheeler transform for genomic data. Journal of Computational Biology (2005) 19. Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993) 20. Moffat, A.: An improved data structure for cumulative probability tables. Software: Practice and Experience 29, 647–659 (1999) 21. Mori, Y.: libdivsufsort, http://code.google.com/p/libdivsufsort/ 22. Navarro, G., M¨ akinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007) 23. Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffi arrays. In: ACM-SIAM SODA, pp. 225–232 (2002) 24. Sadakane, K.: Compressed suffix trees with full functionality. J. Theory of Computing Systems (2007) 25. Smyth, W.F.: http://www.cas.mcmaster.ca/∼ bill/strbings/ 26. Weiner, P.: Linear pattern matching algorihms. In: Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973) 27. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)