Faster algorithm for computing the edit distance between SLP-compressed strings Paweł Gawrychowski Max-Planck-Institut für Informatik

October 24, 2012

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

1 / 18

We consider the question of computing the edit distance between two strings.

Edit distance Given two strings a and b, what is the minimum number of insert/delete/replace operations required to transform a into b?

Observation Computing the edit distance between a and b is equivalent to finding the length of longest common subsequence of a and b.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

2 / 18

We consider the question of computing the edit distance between two strings.

Edit distance Given two strings a and b, what is the minimum number of insert/delete/replace operations required to transform a into b?

Observation Computing the edit distance between a and b is equivalent to finding the length of longest common subsequence of a and b. a = ATGCCGAC b = CAGACTAGA

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

2 / 18

We consider the question of computing the edit distance between two strings.

Edit distance Given two strings a and b, what is the minimum number of insert/delete/replace operations required to transform a into b?

Observation Computing the edit distance between a and b is equivalent to finding the length of longest common subsequence of a and b. a = AT GCCGAC b = CAGACTAGA

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

2 / 18

The question is: how quickly can be compute this number?

Use dynamic programming! For each i and j compute the longest common subsequence of a[1..i] and b[1..j]. Which results in a very simple quadratic time algorithm. More precisely, the running time is O(|a||b|).

Is there anything faster? Can be improved by either using bit parallelism, or assuming that some parameter (edit distance, number of matches, . . . ) is small.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

3 / 18

A natural direction is to consider the case when the input strings a and b are highly compressible. The question is: what method of compression should we assume?

Straight-line programs, or grammar compression A context-free grammar in Chomsky normal form with exactly one production for each nonterminal, hence generating exactly one string.

Fibonacci words F0 = 1 F1 = 0 F0 = 1

F2 = 01

F1 = 0

F3 = 010

Fn+2 = Fn+1 Fn for all n ≥ 0 F4 = 01001 F5 = 01001010 F6 = 0100101001001 F7 = 010010100100101001010 Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

4 / 18

A natural direction is to consider the case when the input strings a and b are highly compressible. The question is: what method of compression should we assume?

Straight-line programs, or grammar compression A context-free grammar in Chomsky normal form with exactly one production for each nonterminal, hence generating exactly one string.

Fibonacci words F0 = 1 F1 = 0 F0 = 1

F2 = 01

F1 = 0

F3 = 010

Fn+2 = Fn+1 Fn for all n ≥ 0 F4 = 01001 F5 = 01001010 F6 = 0100101001001 F7 = 010010100100101001010 Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

4 / 18

A natural direction is to consider the case when the input strings a and b are highly compressible. The question is: what method of compression should we assume?

Straight-line programs, or grammar compression A context-free grammar in Chomsky normal form with exactly one production for each nonterminal, hence generating exactly one string.

Fibonacci words F0 = 1 F1 = 0 F0 = 1

F2 = 01

F1 = 0

F3 = 010

Fn+2 = Fn+1 Fn for all n ≥ 0 F4 = 01001 F5 = 01001010 F6 = 0100101001001 F7 = 010010100100101001010 Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

4 / 18

Why this compression method? 1

as powerful as all Lempel-Ziv schemes (i.e., LZ77 can be converted into SLP with small size increase),

2

easy to work with!

Problem We are given two strings a and b described by two SLPs. n is the total number of rules in both SLPs N is the total length of both words How quickly can we compute LCS(a, b)?

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

5 / 18

Why this compression method? 1

as powerful as all Lempel-Ziv schemes (i.e., LZ77 can be converted into SLP with small size increase),

2

easy to work with!

Problem We are given two strings a and b described by two SLPs. n is the total number of rules in both SLPs N is the total length of both words How quickly can we compute LCS(a, b)?

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

5 / 18

Whenever we are working with compressed data, the goal is to achieve running time depending just on the size of the compressed representation. Unfortunately, in our case this is not possible.

Lifshits and Lohrey MFCS 2006 Checking if a is a subsequence of b is N P-hard.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

6 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin Journal of Mathematical Sciences 158(5), 759–769 (2009) O(nN 1.5 ) is possible!

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin Journal of Mathematical Sciences 158(5), 759–769 (2009) O(nN 1.5 ) is possible!

Hermelin, Landau, Landau and Weimann STACS 2009 O(n1.4 N 1.2 ) is possible!

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin SODA 2010 Well, actually O(nN log N) is possible!

Hermelin, Landau, Landau and Weimann STACS 2009 O(n1.4 N 1.2 ) is possible!

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin SODA 2010 Well, actually O(nN log N) is possible!

Hermelin, Landau, Landau and Weimann STACS 2009 (full) ...finally, O(nN log Nn ) is possible.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin SODA 2010 Well, actually O(nN log N) is possible!

Hermelin, Landau, Landau and Weimann STACS 2009 (full) ...finally, O(nN log Nn ) is possible.

This paper O(nN

q

log Nn ) time algorithm.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

The aforementioned papers actually solve a more general weighted version of the problem in the same complexity, for any rational scoring function. The same is true for the improvement, but today we will focus just on the unweighted case.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

8 / 18

High-level idea Same as in the previous solutions: look at the alignment dag...

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

9 / 18

High-level idea Same as in the previous solutions: look at the alignment dag...

A

B

C

B

A

A

B A C B B A

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

9 / 18

... and notice that if the strings are highly repetitive, many fragments of the dag look somehow similar, and we can avoid duplicating work.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

10 / 18

... and notice that if the strings are highly repetitive, many fragments of the dag look somehow similar, and we can avoid duplicating work.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

10 / 18

... and notice that if the strings are highly repetitive, many fragments of the dag look somehow similar, and we can avoid duplicating work.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

10 / 18

... and notice that if the strings are highly repetitive, many fragments of the dag look somehow similar, and we can avoid duplicating work. X1

X2

X3

X4

X5

X6 x

X10 X20 X30 X40 X50 X60 x

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

10 / 18

To make life simple, we want to partition the whole dag into blocks corresponding to pairs of nonterminals.

Theorem Given a SLP and a parameter x, we can (quickly) construct a new SLP of roughly the same size and with the following two properties: 1

each nonterminal describes a string of length at most x,

2

the original string can be represented as a concatenation of roughly Nx new nonterminals.

x will be chosen later.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

11 / 18

To make life simple, we want to partition the whole dag into blocks corresponding to pairs of nonterminals.

Theorem Given a SLP and a parameter x, we can (quickly) construct a new SLP of roughly the same size and with the following two properties: 1

each nonterminal describes a string of length at most x,

2

the original string can be represented as a concatenation of roughly Nx new nonterminals.

x will be chosen later.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

11 / 18

To make life simple, we want to partition the whole dag into blocks corresponding to pairs of nonterminals.

Theorem Given a SLP and a parameter x, we can (quickly) construct a new SLP of roughly the same size and with the following two properties: 1

each nonterminal describes a string of length at most x,

2

the original string can be represented as a concatenation of roughly Nx new nonterminals.

x will be chosen later.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

11 / 18

We want to compute only the values on the boundaries between blocks. Consider the situation around a single block.

We need to take a closer look at how the values on the right and bottom boundary (outputs) depend on the values on the left and top boundary (inputs). Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

12 / 18

We want to compute only the values on the boundaries between blocks. Consider the situation around a single block. I0

I1

I2

I3

I4

I5

I6 O12

I−1

O11

I−2

O10

I−3

O9

I−4

O8

I−5 I−6 O0

O7 O1

O2

O3

O4

O5

O6

We need to take a closer look at how the values on the right and bottom boundary (outputs) depend on the values on the left and top boundary (inputs). Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

12 / 18

We want to compute only the values on the boundaries between blocks. Consider the situation around a single block. I0

I1

I2

I3

I4

I5

I6 O12

I−1

O11

I−2

O10

I−3

O9

I−4

O8

I−5 I−6 O0

O7 O1

O2

O3

O4

O5

O6

We need to take a closer look at how the values on the right and bottom boundary (outputs) depend on the values on the left and top boundary (inputs). Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

12 / 18

Let Hα,β (i, j) be the highest scoring path from the i-th input to the j-th output of the block corresponding to words α and β. Then the matrix Hα,β has a very simple structure: Hα,β (i, j) = j − i − P Σ (i, j)

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

13 / 18

Let Hα,β (i, j) be the highest scoring path from the i-th input to the j-th output of the block corresponding to words α and β. Then the matrix Hα,β has a very simple structure: Hα,β (i, j) = j − i − P Σ (i, j) 1 1 1 1 1 1 1

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

13 / 18

Let Hα,β (i, j) be the highest scoring path from the i-th input to the j-th output of the block corresponding to words α and β. Then the matrix Hα,β has a very simple structure: Hα,β (i, j) = j − i − P Σ (i, j) 1 1 1 1 (i, j)

1

1 1

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

13 / 18

Hence each Hα,β can be represented in a succinct form by simply storing the nonzeroes of the permutation matrix. The question is how to compute this representation efficiently?

Theorem (Tiskin SODA 2010) Given the representation of Hα0 ,β and Hα00 ,β , we can compute the representation of Hα0 α00 ,β in O(x log x) time, where |α0 |, |α00 |, |β| ≤ x. Note that Hα0 α00 ,β is simply the max-product of Hα0 ,β and Hα00 ,β .

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

14 / 18

Hence each Hα,β can be represented in a succinct form by simply storing the nonzeroes of the permutation matrix. The question is how to compute this representation efficiently?

Theorem (Tiskin SODA 2010) Given the representation of Hα0 ,β and Hα00 ,β , we can compute the representation of Hα0 α00 ,β in O(x log x) time, where |α0 |, |α00 |, |β| ≤ x. Note that Hα0 α00 ,β is simply the max-product of Hα0 ,β and Hα00 ,β .

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

14 / 18

Theorem Given the representation of Hα,β and the vector u storing the input values, we can compute the vector v storing the output values in O(x) time, where |α|, |β| ≤ x. Note that v is simply the max-product of Hα,β and u. This assumes the word RAM model with w ∈ Ω(log n).

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

15 / 18

Theorem Given the representation of Hα,β and the vector u storing the input values, we can compute the vector v storing the output values in O(x) time, where |α|, |β| ≤ x. Note that v is simply the max-product of Hα,β and u. This assumes the word RAM model with w ∈ Ω(log n).

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

15 / 18

Fast matrix-vector multiplication We reduce the question to a very simple data structure problem, which is to maintain an array of length x under two operations: 1

decreasing all values in some prefix of the array by one,

2

computing the maximum in the whole array.

This can be reduced further to an even simpler problem, which is to maintain a partition of [1, x] in to a number of disjoint segments under the following two operations: 1

locating the segment a given element belongs to,

2

merging two adjacent segments.

This is known as the interval-union-find problem, and a (not very complicated) amortized constant time solution is known.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

16 / 18

Fast matrix-vector multiplication We reduce the question to a very simple data structure problem, which is to maintain an array of length x under two operations: 1

decreasing all values in some prefix of the array by one,

2

computing the maximum in the whole array.

This can be reduced further to an even simpler problem, which is to maintain a partition of [1, x] in to a number of disjoint segments under the following two operations: 1

locating the segment a given element belongs to,

2

merging two adjacent segments.

This is known as the interval-union-find problem, and a (not very complicated) amortized constant time solution is known.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

16 / 18

Fast matrix-vector multiplication We reduce the question to a very simple data structure problem, which is to maintain an array of length x under two operations: 1

decreasing all values in some prefix of the array by one,

2

computing the maximum in the whole array.

This can be reduced further to an even simpler problem, which is to maintain a partition of [1, x] in to a number of disjoint segments under the following two operations: 1

locating the segment a given element belongs to,

2

merging two adjacent segments.

This is known as the interval-union-find problem, and a (not very complicated) amortized constant time solution is known.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

16 / 18

Now we only have to combine those two theorems and compute the total running time, which turns out to be proportional to: n2 x log x + choosing x = √ f

log f

, where f =

N n

N2 x

gives us log x ≤ log f and hence we

get the following bound on the running time: p p N2 p n f log f + log f = nN log f = nN f 2

Paweł Gawrychowski

Faster edit distance between SLPs

r log

N n

October 24, 2012

17 / 18

Now we only have to combine those two theorems and compute the total running time, which turns out to be proportional to: n2 x log x + choosing x = √ f

log f

, where f =

N n

N2 x

gives us log x ≤ log f and hence we

get the following bound on the running time: p p N2 p n f log f + log f = nN log f = nN f 2

Paweł Gawrychowski

Faster edit distance between SLPs

r log

N n

October 24, 2012

17 / 18

Now we only have to combine those two theorems and compute the total running time, which turns out to be proportional to: n2 x log x + choosing x = √ f

log f

, where f =

N n

N2 x

gives us log x ≤ log f and hence we

get the following bound on the running time: p p N2 p n f log f + log f = nN log f = nN f 2

Paweł Gawrychowski

Faster edit distance between SLPs

r log

N n

October 24, 2012

17 / 18

1 2

Can we avoid bit manipulation? q Can we get rid of the annoying log Nn factor?

Questions?

Paweł Gawrychowski

,

Faster edit distance between SLPs

October 24, 2012

18 / 18

Faster algorithm for computing the edit distance ...

Oct 24, 2012 - merging two adjacent segments. This is known as the interval-union-find problem, and a (not very complicated) amortized constant time ...

215KB Sizes 0 Downloads 188 Views

Recommend Documents

Faster algorithm for computing the edit distance ...
this distance is usually among the very first examples covered in an algorithms and data .... 3.2 of [9] for an example and a more detailed explanation. It turns out ...

Linear-Space Computation of the Edit-Distance ... - Research at Google
weighted transducers and automata which we use in our computation of the ..... for 2k string-automaton pairs (xi k,Ai k)1≤i≤2k . Thus, the complexity of step k.

VChunkJoin: An Efficient Algorithm for Edit Similarity ...
The current state-of-the-art Ed-Join algorithm im- proves the All-Pairs-Ed algorithm mainly in the follow- .... redundant by another rule v if v is a suffix of u (including the case where v = u). We define a minimal CBD is a .... The basic version of

An Efficient Algorithm for Similarity Joins With Edit ...
ture typographical errors for text documents, and to capture similarities for Homologous proteins or genes. ..... We propose a more effi- cient Algorithm 3 that performs a binary search within the same range of [τ + 1,q ..... IMPLEMENTATION DETAILS.

Linear-Space Computation of the Edit-Distance between a ... - CiteSeerX
for 2k string-automaton pairs (xi k,Ai k)1≤i≤2k . Thus, the complexity of step k is in O(∑ ... In Proceedings of the 12th biennial European Conference on Artificial.

A new algorithm for computing the minimum Hausdorff ...
Sort the spikes according to their left endpoints and add them from left to right. ... If we shift A* by t units, the left or the right endpoint will contribute at least |t| to ...

A faster algorithm for finding optimal semi-matching
Sep 29, 2007 - CancelAll(N2). Figure 2: Divide-and-conquer algorithm. To find the min-cost flow in N, the algorithm use a subroutine called CancelAll to cancel.

edit distance and chaitin- kolmogorov difference
un programme, et que la e-distance et la ck-différence soient du même ordre de grandeur. Comme il ..... The best known complexity bound is O(n1.n2/log(n2)). (Masek .... Consider the shortest sequence Seo of edit operations that transforms ...

edit distance and chaitin- kolmogorov difference
Parametrization of the transformation programs. The numbers of repetitions ki are randomly chosen such that k1+..+kNL is about half the size of the entry strings.

Efficient Graph Similarity Joins with Edit Distance ...
Delete an isolated vertex from the graph. ∙ Change the label .... number of q-grams as deleting an edge from the graph. According to ..... system is Debian 5.0.6.

Efficient Graph Similarity Joins with Edit Distance ...
information systems, multimedia, social networks, etc. There has been ..... inverted index maps each q-gram w to a list of identifiers of graphs that contain w.

Embedding Edit Distance to Allow Private Keyword Search in Cloud ...
need to a priori define the set of words which are considered as acceptable for ... able Encryption scheme for edit distance by adapting the model from [7]. ... The context is Cloud Computing where users can either store or retrieve data from the ...

A Linear Time Algorithm for Computing Longest Paths ...
Mar 21, 2012 - [eurocon2009]central placement storage servers tree-like CDNs/. [eurocon2009]Andreica Tapus-StorageServers CDN TreeLike.pdf.

A computational algorithm for computing cochlear ...
Carney 2001), or even to model behavioral data pertaining to auditory ... For any sound pressure (Pa) waveform, this stage produces stapes velocity (m/s),.

An I/O-Efficient Algorithm for Computing Vertex ...
Jun 8, 2018 - graph into subgraphs possessing certain nice properties. ..... is based on the belief that a 2D grid graph has the property of being sparse under.

Toward Faster Nonnegative Matrix Factorization: A New Algorithm and ...
College of Computing, Georgia Institute of Technology. Atlanta, GA ..... Otherwise, a complementary ba- ...... In Advances in Neural Information Pro- cessing ...

New Bit-Parallel Indel-Distance Algorithm
[email protected]. 3 Department of Informatics, Kyushu University 33, Fukuoka 812-8581, Japan [email protected]. Abstract. The task of approximate ...

Call for Abstracts Edit 5 -
CALL FOR ABSTRACTS. Symposium: Forgotten Liberators –. A “Decolonised” History of the. Third World in World War II. The symposium, to be held under the ...

Call for Abstracts Edit 4 -
neglected role in the fight against fascism during the Second World War. ... place on Thursday 5 July 2018 at the Steve Biko Centre, One Zotshie Street, ...

Contex Aware Computing for Ubiquitous Computing Applications.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Contex Aware ...

Contex Aware Computing for Ubiquitous Computing Applications.pdf ...
Contex Aware Computing for Ubiquitous Computing Applications.pdf. Contex Aware Computing for Ubiquitous Computing Applications.pdf. Open. Extract.

406< Download Edit Your Digital PhotosFree / Edit and ...
Learn How To Use Adobe Photoshop To Quickly And Easily Edit Your Digital Photos Like The Professionals. Related Content: ... For beginners Part 2 (Shakira).

the matching-minimization algorithm, the inca algorithm and a ...
trix and ID ∈ D×D the identity matrix. Note that the operator vec{·} is simply rearranging the parameters by stacking together the columns of the matrix. For voice ...