Simpler and efficient LZW-compressed multiple pattern matching Paweł Gawrychowski

July 4, 2012

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

1 / 20

We consider the standard pattern matching problem.

Pattern matching Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

2 / 20

Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

3 / 20

Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

3 / 20

Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

3 / 20

Find kjfdkasl in

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

3 / 20

And move to its natural generalization.

Pattern matching Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.

Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

4 / 20

And move to its natural generalization.

Compressed pattern matching Given a compressed representation of a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.

Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

4 / 20

And move to its natural generalization.

Compressed pattern matching Given a compressed representation of a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.

Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

4 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

√ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

t[1..N] text, which after compression consists of n blocks p1 , p2 , . . . , p` patterns of total length M

LZW-compressed multiple pattern matching Input: p1 , p2 , . . . , p` and a sequence of n blocks defining text t Output: does any pi occur in t?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

6 / 20

First solutions for the single pattern version were given in 1994 by Amir, Benson, and Farach. They developed two algorithms with time complexities O(n log M + M) and O(n + M 2 ).

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

7 / 20

Year later the second algorithm was improved by Kosaraju, who developed a O(n + M 1+ ) time solution.

Gawrychowski SODA 2011 Single pattern version can be solved in O(n + M) time.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

8 / 20

If we consider more than one pattern, the situation seems significantly more challenging.

Kida, Takeda, Shinohara, Miyazaki, Arikawa DCC 1998 Multiple pattern version can be solved in O(n + M 2 ) time. Is it possible to narrow the gap between single and multiple pattern versions?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

9 / 20

This paper Multiple pattern version can be solved in O(n log M + M) or O(n + M 1+ ) time. 1

matches the bounds of Amir et al. and Kosaraju.

2

DOES NOT use any combinatorics on words, reduces the question to simple-to-state data structure problems.

3

the same high-level idea in both algortihms. So, in a certain sense, more uniform than the previously known solutions for single pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

10 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Algorithm 1 M ULTIPLE - PATTERN - MATCHING(s1 , s2 , . . . , sn0 ) 1: c ← s1 2: for k = 2, 3, . . . , n0 do 3: add (c, sk ) to P 4: c ← prefixer(c, sk ) 5: end for 6: for all (s, s 0 ) ∈ P do 7: detector(s, s0 ) 8: end for

detector(s1 , s2 ) Given two snippets, check if any pattern occurs in their concatenation.

prefixer(s1 , s2 ) Find the longest suffix of the concatenation which is a prefix of some pattern. Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

12 / 20

Algorithm 2 M ULTIPLE - PATTERN - MATCHING(s1 , s2 , . . . , sn0 ) 1: c ← s1 2: for k = 2, 3, . . . , n0 do 3: add (c, sk ) to P 4: c ← prefixer(c, sk ) 5: end for 6: for all (s, s 0 ) ∈ P do 7: detector(s, s0 ) 8: end for

detector(s1 , s2 ) Given two snippets, check if any pattern occurs in their concatenation.

prefixer(s1 , s2 ) Find the longest suffix of the concatenation which is a prefix of some pattern. Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

12 / 20

Consider detector(s1 , s2 ). Let P = p1 $p2 $ . . . $p` .

s2

s1 $

pi [1..j]

pi [j + 1..|pi |]

$

Consider the situation in the prefix tree T r and the suffix tree T .

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

13 / 20

s2

s1 pi [1..j]

$

Tr

pi [j + 1..|pi |]

$

T

pi [1..j] pi [j + 1..|pi |] s1 Paweł Gawrychowski

s2 LZW-compressed multiple pattern matching

July 4, 2012

14 / 20

By computing the pre- and post-order numbers, this reduces to preprocessing a collection of M rectilinear rectangles so that given a point we can quickly retrieve (any) rectangle containing it.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

15 / 20

Similarly, for prefixer(s1 , s2 ) we need to preprocess a collection of weighted horizontal segments so that given a vertical segment we can quickly retrieve the heaviest segment it intersects.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

16 / 20

Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.

Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

17 / 20

Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.

Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

17 / 20

Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.

Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

17 / 20

We want better bounds, though. More precisely, we would like to be linear in either n or M.

O(n log M + M)

The intervals do not cross each other, and we can use this property to replace balanced search tree by a perfect binary tree, where each update touches just one vertex. Additionally, we exploit the fact that we can process all queries at once.

O(n + M 1+ )

We increase the out-degree of the tree to M  . Then the updates become more expensive, but the depth (and so the query time) become constant.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

18 / 20

We want better bounds, though. More precisely, we would like to be linear in either n or M.

O(n log M + M)

The intervals do not cross each other, and we can use this property to replace balanced search tree by a perfect binary tree, where each update touches just one vertex. Additionally, we exploit the fact that we can process all queries at once.

O(n + M 1+ )

We increase the out-degree of the tree to M  . Then the updates become more expensive, but the depth (and so the query time) become constant.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

18 / 20

Similar ideas work for the second problem, too. To get the whole solution we must fill in some details (for example, we need an efficient way of retrieving the vertices corresponding to the snippets, and, if we do not assume a constant alphabet, a fast implementation of the Aho-Corasick automaton). Nevertheless, all those detail boil down to the same ideas as above.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

19 / 20

1 2

is it possible to achieve O(n + M) time for multiple patterns?

what about approximate pattern matching? For example, given k , can we detect an occurrence with at most k mismatches faster than in O(nmk )?

Questions?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

20 / 20

1 2

is it possible to achieve O(n + M) time for multiple patterns?

what about approximate pattern matching? For example, given k , can we detect an occurrence with at most k mismatches faster than in O(nmk )?

Questions?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

20 / 20

Simple@let@token r and efficient LZW-compressed ...

Jul 4, 2012 - √N), so the best possible compression ratio is limited. On the other hand, ..... We increase the out-degree of the tree to Mϵ. Then the updates.

626KB Sizes 4 Downloads 254 Views

Recommend Documents

Computationally Efficient Simulation of Queues: The R Package - arXiv
in a hospital (Takagi, Kanai, and Misue 2016); items in a manufacturing system (Dallery and Gershwin 1992); ... simpy (Lünsdorf and Scherfke 2013) and the Java (Gosling 2000) package JMT (Bertoli,. Casale, and Serazzi .... Green, Kolesar, and Svoron

Efficient Speaker Identification and Retrieval - Semantic Scholar
Department of Computer Science, Bar-Ilan University, Israel. 2. School of Electrical .... computed using the top-N speedup technique [3] (N=5) and divided by the ...

Efficient Speaker Identification and Retrieval
(a GMM) to the target training data and computing the average log-likelihood of the ... In this paper we aim to (a) improve the time and storage efficiency of the ...

Geometrically accurate, efficient, and flexible ...
May 23, 2016 - Preprint submitted to Computer Methods in Applied Mechanics and Engineering ... 10. 4 Quadrature rules based on local parametrization of cut tetrahedra ..... voxels in less than 30 seconds on a GPU of a standard laptop [57].

Facile and efficient synthesis of 4 - Arkivoc
Siddiqui, A. Q.; Merson-Davies, L.; Cullis, P. M. J. Chem. Soc., Perkin Trans. 1 1999, 3243. 12. Hrvath, D. J. J. Med. Chem. 1999, 40, 2412 and references therein ...

Enabling Robust and Efficient Distributed ...
relatively recent P2P-based storage services that allow data to be stored and retrieved among peers [3]. ... recently, for cloud computing services as well [2], [18]. ...... [45] R. O'Dell and R. Wattenhofer, “Information dissemination in highly ..

Efficient Dynamics
The All-New BMW M2. Technical Specifications. M2 DKG. M2. Engine type. N55B30T0. N55B30T0. Transmission type. DKG manual transmission. Body. Seats.

Efficient Speaker Identification and Retrieval - Semantic Scholar
identification framework and for efficient speaker retrieval. In ..... Phase two: rescoring using GMM-simulation (top-1). 0.05. 0.1. 0.2. 0.5. 1. 2. 5. 10. 20. 40. 2. 5. 10.

Efficient Tuition Fees and Examinations
In academic systems in which teacher's careers essentially depend on ... In addition, a costless test technology, used by the faculty, provides another estimation ...

Unsupervised, Efficient and Semantic Expertise Retrieval
a case-insensitive match of full name or e-mail address [4]. For. CERC, we make use of publicly released ... merical placeholder token. During our experiments we prune V by only retaining the 216 ..... and EX103), where the former is associated with

Unsupervised, Efficient and Semantic Expertise Retrieval
training on NVidia GTX480 and NVidia Tesla K20 GPUs. We only iterate once over the entire training set for each experiment. 5. RESULTS AND DISCUSSION. We start by giving a high-level overview of our experimental re- sults and then address issues of s

Stable and efficient coalitional networks - Springer Link
Sep 9, 2012 - made to the coalitional network needs the consent of both the deviating players and their original coalition partners. Requiring the consent of ...

Efficient duration and hierarchical modeling for ... - ScienceDirect.com
a Department of Computing, Curtin University of Technology, Perth, Western Australia b AI Center, SRI International, 333 Ravenswood Ave, Menlo Park, CA, 94025, USA. a r t i c l e. i n f o ..... determined in advance. If M is set to the observation le

Secure and Efficient Data Transmission.pdf
Page 1 of 5. Secure and Efficient Data Transmission for Cluster-Based. Wireless Sensor Networks. ABSTRACT: Secure data transmission is a critical issue for wireless sensor networks. (WSNs). Clustering is an effective and practical way to enhance the

Economic Planning of R & R and Implementation.pdf
1. /. 2. Loading… Page 1 of 2. Page 1 of 2. Page 2 of 2. Page 2 of 2. Main menu. Displaying Economic Planning of R & R and Implementation.pdf. Page 1 of 2.

(S1Ds r ri r r, 1
Jan 5, 2016 - CSDO HOSTING OF 2016 REGIONAL SPORTS COMPETITIONS (STCAA). Date: ... Email: [email protected] I Website: ..... render services during weekends shall be given Service Credits and Compensatory.

Economic Planning of R & R and Implementation.pdf
essential for its successful implementation". Discuss. MRRE-007 2. Page 2 of 2. Main menu. Displaying Economic Planning of R & R and Implementation.pdf.

JRS-R&R-Agglomeration,Urban Wage Premiums, and College ...
Page 1 of 1. ALAT PERAGA MENARA HANOI, POLA SUDUT, DAN BLOK LOGIKA. Dosen Pembimbing : Dr. Warli. M.pd. Disusun oleh : Abi Fusawat Sarji Rindi Dwi Kurniawati. Page 1 of 1. JRS-R&R-Agglomeration,Urban Wage Premiums, and College Majors.pdf. JRS-R&R-Agg

Economic Planning of R & R and Implementation.pdf
Page 1 of 2. MRRE-007 I. POST GRADUATE DIPLOMA IN. PARTICIPATORY MANAGEMENT OF. cn1 DISPLACEMENT, RESETTLEMENT AND.