July 4, 2012

LZW-compressed multiple pattern matching

We consider the standard pattern matching problem.

Pattern matching Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence?

LZW-compressed multiple pattern matching

Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp

LZW-compressed multiple pattern matching

Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp

LZW-compressed multiple pattern matching

Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp

LZW-compressed multiple pattern matching

Find kjfdkasl in

LZW-compressed multiple pattern matching

And move to its natural generalization.

Pattern matching Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.

Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?

LZW-compressed multiple pattern matching

LZW-compressed multiple pattern matching

LZW-compressed multiple pattern matching

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

√ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

LZW-compressed multiple pattern matching

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

LZW-compressed multiple pattern matching

LZW-compressed multiple pattern matching

LZW-compressed multiple pattern matching

LZW-compressed multiple pattern matching

t[1..N] text, which after compression consists of n blocks p1 , p2 , . . . , p` patterns of total length M

LZW-compressed multiple pattern matching Input: p1 , p2 , . . . , p` and a sequence of n blocks defining text t Output: does any pi occur in t?

LZW-compressed multiple pattern matching

First solutions for the single pattern version were given in 1994 by Amir, Benson, and Farach. They developed two algorithms with time complexities O(n log M + M) and O(n + M 2 ).

LZW-compressed multiple pattern matching

Year later the second algorithm was improved by Kosaraju, who developed a O(n + M 1+ ) time solution.

Gawrychowski SODA 2011 Single pattern version can be solved in O(n + M) time.

LZW-compressed multiple pattern matching

If we consider more than one pattern, the situation seems significantly more challenging.

Kida, Takeda, Shinohara, Miyazaki, Arikawa DCC 1998 Multiple pattern version can be solved in O(n + M 2 ) time. Is it possible to narrow the gap between single and multiple pattern versions?

LZW-compressed multiple pattern matching

This paper Multiple pattern version can be solved in O(n log M + M) or O(n + M 1+ ) time. 1

matches the bounds of Amir et al. and Kosaraju.

2

DOES NOT use any combinatorics on words, reduces the question to simple-to-state data structure problems.

3

the same high-level idea in both algortihms. So, in a certain sense, more uniform than the previously known solutions for single pattern.

LZW-compressed multiple pattern matching

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

LZW-compressed multiple pattern matching

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

LZW-compressed multiple pattern matching

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

LZW-compressed multiple pattern matching

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

LZW-compressed multiple pattern matching

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

LZW-compressed multiple pattern matching

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

LZW-compressed multiple pattern matching

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

LZW-compressed multiple pattern matching

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

LZW-compressed multiple pattern matching

Algorithm 1 M ULTIPLE - PATTERN - MATCHING(s1 , s2 , . . . , sn0 ) 1: c ← s1 2: for k = 2, 3, . . . , n0 do 3: add (c, sk ) to P 4: c ← prefixer(c, sk ) 5: end for 6: for all (s, s 0 ) ∈ P do 7: detector(s, s0 ) 8: end for

detector(s1 , s2 ) Given two snippets, check if any pattern occurs in their concatenation.

prefixer(s1 , s2 ) Find the longest suffix of the concatenation which is a prefix of some pattern. Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

12 / 20

July 4, 2012

Consider detector(s1 , s2 ). Let P = p1 $p2 $ . . . $p` .

s2

s1 $

pi [1..j]

pi [j + 1..|pi |]

$

Consider the situation in the prefix tree T r and the suffix tree T .

LZW-compressed multiple pattern matching

s2

s1 pi [1..j]

$

Tr

pi [j + 1..|pi |]

$

T

pi [1..j] pi [j + 1..|pi |] s1 Paweł Gawrychowski

s2 LZW-compressed multiple pattern matching

July 4, 2012

14 / 20

By computing the pre- and post-order numbers, this reduces to preprocessing a collection of M rectilinear rectangles so that given a point we can quickly retrieve (any) rectangle containing it.

LZW-compressed multiple pattern matching

Similarly, for prefixer(s1 , s2 ) we need to preprocess a collection of weighted horizontal segments so that given a vertical segment we can quickly retrieve the heaviest segment it intersects.

LZW-compressed multiple pattern matching

Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.

Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.

LZW-compressed multiple pattern matching

LZW-compressed multiple pattern matching

LZW-compressed multiple pattern matching

We want better bounds, though. More precisely, we would like to be linear in either n or M.

O(n log M + M)

The intervals do not cross each other, and we can use this property to replace balanced search tree by a perfect binary tree, where each update touches just one vertex. Additionally, we exploit the fact that we can process all queries at once.

O(n + M 1+ )

We increase the out-degree of the tree to M . Then the updates become more expensive, but the depth (and so the query time) become constant.

LZW-compressed multiple pattern matching

LZW-compressed multiple pattern matching

Similar ideas work for the second problem, too. To get the whole solution we must fill in some details (for example, we need an efficient way of retrieving the vertices corresponding to the snippets, and, if we do not assume a constant alphabet, a fast implementation of the Aho-Corasick automaton). Nevertheless, all those detail boil down to the same ideas as above.

LZW-compressed multiple pattern matching

Questions?

LZW-compressed multiple pattern matching

LZW-compressed multiple pattern matching

