An Algorithmic Treatment of Strong Queries Ravi Kumar

Silvio Lattanzi∗

Prabhakar Raghavan

Yahoo! Research 701 First Avenue Sunnyvale, CA 94089.

Dipartimento di Informatica La Sapienza Univ. of Rome Roma 00198, Italy.

Yahoo! Research 701 First Avenue Sunnyvale, CA 94089.

[email protected]

[email protected]

[email protected]

ABSTRACT A strong query for a target document with respect to an index is the smallest query for which the target document is returned by the index as the top result for the query. The strong query problem was first studied more than a decade ago in the context of measuring search engine overlap. Despite its simple-to-state nature and its longevity in the field, this problem has not been sufficiently addressed in a formal manner. In this paper we provide the first rigorous treatment of the strong query problem. We show an interesting connection between this problem and the set cover problem, and use it to obtain basic hardness and algorithmic results. Experiments on more than 10K documents show that our proposed algorithm performs much better than the widely-used word frequency-based heuristic. En route, our study suggests that less than four words on average can be sufficient to uniquely identify web pages. Categories and Subject Descriptors. H.3.m [Information Storage and Retrieval]: Miscellaneous General Terms. Algorithms, Experimentation, Theory Keywords. Strong query, Set cover, Greedy algorithm

1.

INTRODUCTION

A strong query for a target document with respect to an index (e.g., a search engine) is the smallest query for which the target document is returned as the top result for the query by the index. Strong queries were first studied in the web context more than a decade ago in the seminal work of Bharat and Broder [2]. They used such queries for measuring the overlap among various search engines. By finding a strong query for a chosen document, it becomes possible to check if a search engine’s index contains the document (or a copy of it). While search engines these days support ∗ Most of this work was done while the author was at Yahoo! Research.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSDM’11, February 9–12, 2011, Hong Kong, China. Copyright 2011 ACM 978-1-4503-0493-1/11/02 ...$10.00.

the url itself as a query, this was not the case a decade ago. But, even in today’s world, without using strong queries, it is tricky to conclusively ascertain if a document is present in a search engine’s index: url normalization, cgi- and sessionparameters, and duplicate copies are the prime culprits for making this task challenging. The importance of strong queries goes beyond its initial application to measuring search engine overlap and similar competitive analysis. Strong queries are a lens into the ranking function of the search engine. Understanding the behavior of the search engine on strong queries and exploring the ability and ease of constructing strong queries reveals valuable information about the innards of the ranking function. Strong queries can also be used to detect plagiarism by checking the existence of copies of a given document. By altering the words chosen to be in the strong query (i.e., varying the lexicon), interesting and unusual aspects of the document and the search engine can be revealed. In addition, a strong query can be seen as summarizing signature of a document in an index. Lastly, strong queries have a fascinating entertainment angle: the puzzle at Google’s booth at WWW 2010 in Raleigh was in fact a version of the strong query problem! Despite its relatively long existence in the field, the strong query problem has not been sufficiently addressed in a formal manner. The simple and elegant appeal of the strong query problem makes this even more surprising. There have been several heuristics for the problem. Many of these heuristics use statistical information such as word frequencies or shingle frequencies in order to construct the strong query. Our results. In this paper we study the strong query problem and its variants from an algorithmic point of view. We first formalize the basic problem using a term-document matrix, where the goal is to find the minimal length query to retrieve a given document as the top result. In this setting we show the following results: (i) when the term-document matrix is binary, finding the minimal length query is similar to the set cover problem, and (ii) when the term-document matrix is weighted, even the decision version of the strong query problem (i.e., is there a query of a certain length) becomes as hard as the independent set problem. We then turn to algorithms. We first analyze the popularly used frequency-based algorithm: rare words make up the strong query. We show that this algorithm can be quite poor, even in a stochastic setting where entries in the termdocument matrix are generated according to an i.i.d. pro-

cess. We then propose a new algorithm, based on the greedy algorithm for set cover. While this algorithm already gives a provable guarantee, we show this guarantee is even better in a stylized stochastic setting. This analysis could be of independent interest since there are several web mining tasks that are based on set cover-like greedy algorithms and they can benefit from an improved performance bound (for example, see [6] and the references therein.) We then generalize the strong query problem to the kstrong query problem: given a set of k target documents, find the smallest query such that the top k results for it is the target set. This problem is quite hard: even the decision version for k = 2 is as hard as the independent set. We implement these algorithms and study their performance in practice. Our target documents are generated from the results of web search queries. The experiments show that the greedy algorithm performs much better than the frequency-based algorithm, achieving 15% improvement in the average length of the strong queries. Two nuggets are worth noting from our experimental results: (i) on average, queries with fewer than four words are sufficient to uniquely locate a given target document; this is intriguing, given the tens of billions of documents in a search engine index, and (ii) if we use web queries as the lexicon instead of a standard English lexicon, then the average strong query length goes up by one; this offers a new viewpoint contrasting the standard lexicon and the set of web queries.

2.

FORMULATION

Let M be a term-document matrix, where the rows correspond to the set D of documents, the columns correspond to the set T of terms, and Mdt denotes the score (or term weight) of the term t in the document d. We consider the case when the entries of M are positive numbers and the case when they are binary; we will refer to the former as the weighted case and to the latter as the binary case. A query is a subset of terms. The score of a query T 0 ⊆ T on a document d ∈ D is given by X score(T 0 , d) = Mdt . t∈T 0

The decision version of the k-strong query problem is: given a subset D0 ⊆ D of documents, |D0 | = k, is there a query T 0 such that score(T 0 , d0 ) ≥ score(T 0 , d) for all d0 ∈ D0 and d ∈ D \ D0 ? The minimization version will seek to find the smallest such query, i.e., to minimize |T 0 |. When k = 1, we call this the strong query problem and we call d the target document.1 In the following we assume that |T | is at most polynomial in |D|. Finally, we recall the decision version of the minimum set cover problem: given subsets S1 , . . . , Sm of a universe S and an integer k, is there a covering of S by at most k of the given subsets?

1 Note that if the queries are conjunctive, then the k-strong query problem for k > 1 is not interesting since the problem of finding a query to obtain the given k documents is the same as finding a query to obtain the single ‘virtual’ document that is the intersection of the given k documents.

3.

HARDNESS

In this section we show basic hardness results for the strong query problem. We consider both the decision and the minimization versions for the binary and the weighted cases.

3.1

Binary case

First, we consider the simplified case where the matrix M is binary. In this case the decision version of the strong query problem turns out to be trivial while the minimization version becomes hard. Claim 1. The decision version of the strong query problem is in polynomial time. Proof. Let D0 = {d0 } be the given document. Let T 0 ⊆ T be the terms t such that Md0 t = 1. Now, if for each d ∈ D \ D0 , there is a t ∈ T 0 such that Mdt = 0, then the problem is feasible. This condition can be checked easily. Unfortunately this is true only for the decision version. Now, we show that the minimization version is as hard as set cover. Lemma 2. The minimization version in the binary case of the strong query problem is equivalent to the minimum set cover problem. Proof. Let {d1 , . . . , dn } be the set of documents and {t1 , . . . , tm } be the set of terms. Let d1 be the target document. The minimization version of the binary strong query problem asks the following: is there a query with at most k terms such that the score of d1 is more than the score of any other document? First note that we can restrict our attention to the terms that are contained in the document d1 . Our goal is to find the minimum set of terms such that every other document lacks at least one of those terms. Now this problem is a set cover instance where the terms are the objects in S and a subset Si contains the object tj if and only if the document di does not contain the term tj . Thus a query with at most k terms such that the score of d1 is more than the score of any other document exists if and only if there is a k-covering of S in {S2 , . . . , Sn }. Conversely, consider the decision version of the minimum set cover problem. We can create an instance of the binary strong query problem with m + 1 documents d0 , . . . , dm and n terms as follows. The document d0 contains all the terms and the document di , for 1 ≤ i ≤ m, contains the term tj if and only if j ∈ / Si . In this case a k-covering of S exists if and only if there is a query with at most k terms such that the score of d1 is bigger than the score of any other document. Using Lemma 2 and the results in [19, 11, 20], we obtain the following. Corollary 3. The minimization version in the binary case of the strong query problem cannot be approximated in polynomial time within a factor of (1 − ) log n, for any constant  > 0, unless N P ⊆ DT IM E(nlog log n ). Corollary 4. The minimization version in the binary case of the strong query problem is approximable within ln (|S|/opt) + O(log log (|S|/opt)) + O(1).

3.2

Weighted case

Now we show that if M is weighted, then the decision version of the strong query problem can be solved in polynomial time if and only if N P ⊆ DT IM E(nlog log n ) and the minimization version is as hard to approximate as the maximum independent set problem.

for n + 3 < i ≤ 2n + 3 has weight two for term ti , zero for terms tn+i and t2n+1 and weight one for all the remaining terms and document di , for 2n + 3 < i ≤ 3n + 3 has weight zero for terms ti and t2n+1 , two for term tn+i and weight one for all the remaining terms. Figure 1 shows that term-document matrix of the reduction.

Lemma 5. The decision version of the strong query problem in the weighted case is in polynomial time if and only if N P ⊆ DT IM E(nlog log n ). Proof. To prove the statement we show that there is a reduction from the decision version of set cover to the decision version of the strong query problem in the weighted case. Given a set cover instance, we construct the documents d1 , . . . , dm+2 and the terms t1 , . . . , tn+1 as follows. For i ≤ n, let the term tk correspond to the object k and tn+1 be a new term. We construct the documents and the term weights as follows. First, d1 contains all the terms ti such that i ≤ n with weight one and the term tn+1 with weight k+ 1. Next, the document d2 contains all the terms ti such that i ≤ n with weight two. Finally ∀j > 2, dj contains all the terms for which the corresponding object is not contained in Sj−2 with weight one plus the term tn+1 with weight k + 1. For clarity, the term-document matrix is shown below: d1 d2 d3 .. . dm+2

t1 · · · tn 1···1 2···2 complemented set cover

tn+1 k+1 0 k+1 .. . k+1

To see the reduction, first, notice that in order for d1 to be the element with the highest score the term tn+1 has to be in the query and at most k terms in t1 , . . . , tn can be in the query. Furthermore, for every document di for i > 2, there should be at least one term tj that has a score of 0 in the document di . Thus a strong query exists if and only if there is a covering composed by at most k subsets of the given set cover instance. The claim follows from the inapproximability of set cover [19, 11]. Lemma 6. The minimization version in the weighted case of the strong query problem cannot be approximated in poly1 nomial time within a factor of 21 |T | /2− , for any constant  > 0, unless P = N P . Proof. We show an approximation preserving reduction to the minimization version of the strong query problem from the maximum independent set problem. Given a simple graph G(V, E), the decision version of maximum independent set asks: is there an independent set with at least k vertices? Given an instance of the maximum independent set problem, we construct the instance of the strong query minimization problem as follows. Let d1 , . . . , d3n+3 be the documents and t1 , . . . , t2n+1 be the terms. Document d1 has weight one for each term, document d2 has weight one for each term except t2n+1 that has weight 0, document d3 has weight zero for each term except t2n+3 that has weight one. Document di , for 3 < i ≤ n + 3, has weight one for term j, with j ≤ n if and only if (vi , vj ) ∈ E and zero otherwise. Furthermore, it has weight k for the term tn+i and weight one for the remaining terms. Finally document di ,

d1 d2 d3 d4 .. . dn+3 dn+4 .. . d2n+3 d2n+4 .. . d3n+3

t1 · · · tn 1···1 1···1 0···0 adjacency

tn+1 · · · t2n 1···1 1···1 0···0

matrix

J + kI

J +I

J −I

J −I

J +I

t2n+1 1 0 1 0 .. . 0 0 .. . 0 0 .. . 0

Figure 1: Term-document matrix for the hardness of weighted strong query problem. Here, I is the n × n identity matrix and J is the n × n all-ones matrix. The decision version of the strong query minimization problem in the weighted setting asks: is there a query with at most 2k + 1 terms such that the score of d1 is bigger than the score of any other document? To see the reduction, first notice that d2 forces t2n+1 to be in the query and d4 forces the query to also contain some other terms. Furthermore, in order for d1 to have score higher than the documents dn+4 , . . . , d3n+3 , we have that if the term ti is in the query, then the term ti+n has to be in the query if i ≤ n or the term ti−n has to be in the query if n < i ≤ 2n. Now, in order for d1 to have a score higher than all the other documents, at least a term ti with n < i ≤ 2n must be in the query. But, this implies that the document di+3 has weight zero for at least k + 1 terms of the query. So it has k zeros for the selected terms in the adjacency matrix. Thus the query contains at least k terms from the first n terms ti1 , . . . , tik and all those terms are not contained in documents di1 +3 , . . . , dik +3 . But, this happens if and only if vi1 , . . . , vik form an independent set in G. The claim follows from the inapproximability of the maximum independent set [12].

4.

ALGORITHMS

In this section we study the performance of a frequencybased algorithm and of a greedy algorithm in an adversarial and a stochastic setting. In the former setting the termdocument matrix is given by an adversary, whereas in the latter setting all the documents are generated independently and the probability that a term is in a document is the same for every document and is specified by the adversary. In both cases, we are interested in finding a good approximation for the minimization problem, assuming the worst target document. Note that the notion of approximation is the ratio of the number of terms in the strong query obtained

by the algorithm to that of the strong query in the optimum solution. We need the following form of the Chernoff bound [10]. Pn Theorem 7 (Chernoff Bound). Let X = i=1 Xi , where Xi are independently distributed random variables in [0, 1]. Then, for any  > 0,  2   Pr [|X − E[X]| > E[X]] ≤ exp − E[X] . 3

4.1

Frequency-based algorithm

Bharat and Broder [2] introduced a frequency-based algorithm for the strong query problem. In their own words: The strong query is constructed as follows. The page is fetched from the Web and we analyze the contents to compute a conjunctive query composed of the k (say 8) most significant terms in the page. Significance is taken to be inversely proportional to frequency in the lexicon. Words not found in the lexicon are ignored since they may be typographical errors . . . In the following we will refer to this algorithm as the frequency algorithm, denoted Freq. Our goal is to study the performance guarantees of Freq. In the following, we show that Freq can perform arbitrarily poorly in both the adversarial and the stochastic settings. Lemma 8. The approximation ratio of Freq can be `/2 where ` is the number of terms in the target document. Proof. First notice that we are interested only in the terms that are contained in the target document d∗ and therefore we can restrict our attention to those terms. Consider the following instance of the problem: the target document d∗ contains ` terms and the corpus contains d∗ and other ` + 1 documents. Term t1 is contained only in d∗ and in document i for some i. Terms t2 , . . . , t`−2 have increasing frequencies and they are all contained in document i. The last term t`−1 has frequency ` − 1 and is not contained in document i. In this case the optimal solution is the query t1 ∧ t`−1 . However, it can be verified that Freq would output the query t1 ∧ · · · ∧ t`−1 . The claim follows. Lemma 9. The expected approximation ratio of Freq in   the stochastic setting can be Ω logmm , where m is the number of documents in the instance. Proof. Consider the following instance. There are Cm· log m + 1 terms such that t1 appears in a document with probability 2/m and each of the the other terms appears in a document with probability 1 − 1/m. Note that with  2 2  2 m−2 probability m 1− m = Θ(1), only two docu2 m ments contain t1 ; call those documents di and dj . Furthermore, for any k > 1, the probability that di but not dj contains tk is (1 − 1/m)/m. Thus, in expectation, there are C(1 − 1/m) log m terms contained in di but not in dj . Using the Chernoff bound, with high probability, there is at least one term t∗ contained in di but not in dj . Note that the setting described above happens with constant probability and let di be the target document. The optimal query is t1 ∧ t∗ . However, Freq selects t1 ∧ · · · ∧ ts terms until it finds a term that is not contained in dj . Now by the

Chernoff bound, no term besides t2 will appear less than m − 10 log m times with high probability. Thus, by selecting the term with the smallest global frequency, the event of finding a term not containing dj happens with probability at most (10 log m)/(m − 1), thus the algorithm will select s = (m − 1)/(10 log m) terms in expectation. The claim follows.

4.2

Greedy algorithm

In this section we study the performance of a greedy algorithm inspired by the algorithm introduced by Johnson [13] and Lov´ asz [16] to approximate the set cover problem. The greedy algorithm (Greedy) for the strong query problem works as follows: 1. Add to the query q the term ti in the target document with the minimum frequency 2. Delete from the corpus all documents that contain ti and compute the new frequencies of the terms 3. If the corpus contains only the target document, then output q; otherwise return to step 1. It is easy to see that if a strong query exists, Greedy always returns it since the target document is never discarded. Using Lemma 2 and the analysis of the greedy set cover algorithm by Slav´ık [20], we get the following worst-case result. Corollary 10. Greedy approximates the minimization version in the binary case of the strong query problem within a factor ln (|S|/opt) + O(log log (|S|/opt)) + O(1). Observe that this is already a huge improvement over the approximation guarantee of Freq. Next, we will improve this further under the stochastic setting.

4.2.1

An improved analysis in the stochastic setting

In this section we show that Greedy in the  set stochastic ting achieves an approximation ratio of O logloglogmm , with high probability. We also notice that there are cases where the approximation ratio of the greedy algorithm is more than 3/2 with constant probability; we leave the problem of improving this bound as a future research direction. We will use the following result [13, 16]. Theorem 11. Let S be the largest set in a set cover instance. Then, Greedy obtains an O(log |S|) approximation. Now, we state and prove the main result. Theorem 12. In the stochastic setting of the strong query   problem, Greedy gives an approximation of O logloglogmm with high probability, where m is the number of documents in the corpus. Proof. Let opt be the set of terms in the optimal query. We use the following proof strategy. We first prove that after we select O(|opt|) terms, with high probability, we either reduce our problem to a problem that contains terms with frequencies at least m − log4 m or to an instance on 2m/ log m documents. In the former case, we use Lemma 2 and Theorem 11 to complete the proof. In the latter case, we iterate the argument noting that this can happen at most log m times, since in each iteration, we reduce the size log log m

  of our instance by Ω(log m). This leads to an O logloglogmm approximation. Let p(ti ) be the probability that the term ti appears in a document. Let soli = {t01 , . . . , t0i } be the terms selected by Greedy at the ith step; let sol be the final set of terms selected by Greedy. Let Dsoli be the set of documents that are not covered by the terms in soli and let Xiti be the number of documents that are covered by the term ti at the ith step of Greedy. From the definition of Greedy, t t0 it follows that Xi i ≥ Xi j for all j ∈ / soli and i 6= j. Fur-

use Lemma 2 and Theorem 11 to finish the proof. Therefore, ˆ > log4 m. we can assume that D ˆ We next calculate the probability that µ (1 + C) = |D| for C > 0, and then use it to find a lower bound on the documents covered by sol|opt| . By applying the Chernoff bound we have   h i X ˆ ≤ Pr µ (1 + C) ≤ Pr µ (1 + C) ≤ |D| Yj  ˆ I≤j≤s



t0

thermore, note that if Xi i ≤ log4 m, then the problem has been reduced to a problem that contains terms with frequencies at least m − log4 m. So in the following we can assume t0 that Xi i ≥ log4 m. Now we can upper bound the expected contribution of a term tj ∈ / soli−1 as # "   1 tj t0i Pr E[Xi ] ≥ 1+ Xi log m   t0 1 t t0 Xi i . ≤ Pr E[Xi j ] − Xi j ] > log m t0 Xi j

4

≥ log m, we have Note that since   t0 1 t t0 Pr E[Xi j ] − Xi j ] > Xi i log m    t0j tj t ≤ Pr E[Xi ] − Xi > max log3 m, E[Xi k ] − log4 m  t  t t ≤ Pr Xi k − E[Xi k ] > max log3 m, E[Xi k ] − log4 m !   2 log3 m log4 m 1 tk max E[Xi ] < exp − t ,1 − t 3 E[Xi k ] E[Xi k ]

≤ Pr 

This yields  ˆ log | D| ˆ ≥ 1 − m−Ω(log m) .  |D| Pr[µ > 1 − q ˆ |D| 



j≤Iˆ

ˆ I
X

1

Now recalling that |T | is polynomial in |D| = m, we have for every tj ∈ / soli and every i ≤ m, with high probability,  0  t 1 + log1 m Xi i p(tj ) ≤ Dsol . i−1



X



X

By a similar reasoning,



1−

Yj +

1−



1 log m



t0 Xi i

1− Dsol i−1

1 log m 1 log m

p(t0i ) ≤

j≤Iˆ t0

Xj j +

j≤Iˆ

X

1− 1− 1−

X

X

log √ m m I
1 log √ m m

1

3 log m



p(t0i ),

(1)

with probability m−Ω(log m) . Now let opt = {t∗1 , . . . , t∗s } be the optimal solution for the problem and let optk = {t∗1 , . . . , t∗k }, for k ≤ s. Let Yj be the random variable counting the number of documents covered ∗ by t∗j and not P covered by any other term tr with r < j. By definition, j≤s Yj = |D|. Furthermore, note that Yj is the sum of independent Bernoulli trials with probability p(t∗j ). Now we compare the number of documents covered by ˆ sol and opt. Let Iˆ be the smallest index such that ∀j > I ˆ = |Dopt | and we have that Dsolj ≤ Doptj . Let D Iˆ P ˆ ≤ log4 m, then we can µ = I
log m m

µ

p(t∗j )|Doptj−1 | p(t∗j )|Dsolj−1 |

X 3 p(t0j )|Dsolj−1 | log m ˆ I
√ m 1 − log m X 0 t 2 X j. log m j
Thus with probability 1−m−Ω(log m) ,  1+

1 1−

p(t∗j )|Doptj−1 |

log √ m m I
1

Yj +

j≤Iˆ

X 1

t0

Xj j +

µ≤

log √ m m I
t0

Xj j +

 ≤ 1+

.

ˆ log |D| √ ˆ |D|

t0

Xj j +

j≤Iˆ

X

j≤Iˆ

1

j≤Iˆ

So ∀j with j ∈ / soli−1 , we have 1+

1−

j≤Iˆ

.

p(t0i )

p(tj ) ≤

Yj +

j≤Iˆ





(2)

Now combining (1) and (2), we get that with probability at least 1 − m−Ω(log m) , X X X ˆ |D| = Yj + Yj = Yj + |D|

X

m

Yj − µ ≥ Cµ

    (|D| − µ)2 C2 ≤ exp −µ = exp −µ . 3 3

X

<

X

ˆ I≤j≤s



−Ω(log m)



I
P

j
 t0 Xj j ≥ 1 −

2 log m

|D|, meaning, after |opt| steps, Greedy reduces the number of documents by a factor of log2 m , with probability 1 − m−Ω(log m) ; therefore, we can iterate the on the  arguments  subset Dsol|opt| . After iterating O logloglogmm times, the

instance contains at most log4 m documents;  the probability of this happening is 1 − O logloglogmm m−Ω(log m) = 1 − m−Ω(log m) . At this point,  we can apply Theorem  11 log m to get that |sol| ≤ |opt| · O + log log m = log log m   log m |opt| · O log log m . This finishes the proof.



5.

K -STRONG QUERY PROBLEM In this section we study the complexity of the k-strong query problem. We focus on the decision version of the binary 2-strong query problem and show that it is as hard as set cover. Lemma 13. The decision version of the 2-strong query problem in the binary case is in polynomial time if and only if N P ⊆ DT IM E(nlog log n ). Proof. As in Lemma 5, we show a reduction from the decision version of set cover to the decision version of the binary 2-strong query problem. We construct an instance of the strong query problem: let d1 , . . . , dm+3 be the documents and t1 , . . . , tn+k+1 be the terms. The terms ti , i ≥ n correspond to the objects of the set cover instance. We construct the documents as follows: d1 contains all the terms ti such that i > n, the document d2 contains all the terms ti such that i ≤ n + 1, and the document d3 contains all the terms ti such that i ≤ n. For j > 3, the document dj contains the terms th if and only if the corresponding object is not contained in Sh−3 . Furthermore, the document dj also contains the term tn+1 . The term-document matrix is shown for clarity. d1 d2 d3 d4 .. .

t1 · · · tn 0···0 1···1 1···1 complemented set cover

dm+3

tn+1 1 1 0 1 .. . 1

tn+2 · · · tn+k+1 1···1 0···0 0···0 0···0 .. . 0···0

Recall that the decision version of the binary 2-strong query problem asks: is there a query for which the closest documents to the query are d1 and d2 ? We now show the reduction. First, note if d1 is to be one of the top two scoring documents, then the query can contain at most k terms in t1 , . . . , tn . Furthermore, if d2 is to be one of the top two scoring documents, then every document di , with i > 3, should not contain at least one of the terms in the query. Thus to solve the problem we have to solve the instance of the decision version of set cover. Thus, the 2-strong query setting is quite hard. We leave the complexity of the minimization version both in the adversarial and the stochastic settings as interesting directions for further work.

6.

EXPERIMENTS

In this section, we describe the experiments and results for the strong query problem. First, we begin with the data description. Next, we describe the experimental methodology and an analysis of the results.

6.1

Experimental setup

Document selection. Queries from the Yahoo! search log were used to derive the documents for testing the algorithms. We use a sample from the Yahoo! query log in order to construct our dataset. These queries were restricted to occur at least 50 times to mitigate noisy queries or queries with typos or spelling errors. We discarded queries with punctuation, numeric, and unicode characters.

We then created sample buckets from the resulting queries. There are two kinds of buckets: one that depends on the query length and another that depends on the popularity of the query. In the first set of buckets, we restricted to queries with length from one to seven words, where the tokenization was using whitespace. The second set of buckets were created using query frequency: the ith bucket in this set, for i = 1, . . . , 5, consists of queries that occur at least 50 × 10i . Thus, we have 11 buckets, with about 100 queries in each bucket, gives us a total of 1091 queries. For each of these queries, we obtained the top ten results using Yahoo!. This gave us 10,769 documents that form the basis of our study. We used the Yahoo! search API in all our algorithms, invoking the option to return only HTML documents. The number of results sought is ten. For crawling webpages, we use wget with a timeout of ten seconds and use lynx to parse and tokenize the page, after removing HTML artifacts. Lexicon. To avoid misspelt words in the text getting selected as part of the strong query, we base our experiments using a lexicon. For the main study, we use an English lexicon based on /usr/share/dict/words in Linux. The lexicon consists of 483K words. Of these, we discarded hyphenated and punctuated words; this is so that the tokenization remains clean. We ended up with 415K words and this forms our basic lexicon. We compute the frequency of the words in this lexicon by using a sample of over 10M web documents. Algorithms. We implement two algorithms and compare their performance. The first is the frequency-based algorithm (Freq). This algorithm uses the lexicon described above. This will serve as our baseline. The second algorithm is a variant of the greedy algorithm (Greedy) that we described in Section 4.2. Recall that this algorithm requires the global frequencies to be updated for each new iteration when the target document is not obtained as the top result. However, this is not feasible or efficient in practice. Therefore, we modify the algorithm as follows. Let d be the target document. If d1 , . . . , d10 are the documents returned for the current query, and if d1 6= d, then we make the global frequencies of all the words in {d} \ {d1 } to be zero. That is, the next iteration of the algorithm will not use any words that are present in d1 . . . , d10 . This modification, though a practical convenience, has two issues: (i) we cannot prove any performance bound with this modification; in fact, this variant can be shown to have poor guarantees, and (ii) the modification may be too drastic in some situations, for example, all good candidate words might be discarded. Our experiments, however, show that the latter case does not happen too many times in practice. Metrics. For purposes of comparing the output of the algorithm, we use a content-based match rather than a syntactic url match. This is to handle the case when the search engine returns a duplicate of the target document or the case when the search engine has indexed a slightly different version of the target document. We implement the content-based match by using weighted Jaccard similarity of the bag of words in the respective documents: we say two documents match if the weighted Jaccard similarity J(A, B) =

X min{fw (A), fw (B)} , max{fw (A), fw (B)} w

where fw (A) denotes the frequency of the word w in document A, is at least 0.9. We say a strong query algorithm fails on a target document if it cannot produce a document as the top result that matches the target document and the strong query is composed of more than ten words. The performance metric of an algorithm is both its success rate, which is the fraction of documents for which it does not fail, and its average strong query length (avg. sqlen), which is the number of words in the strong query, conditioned upon the algorithm finding a document (as the top result) that matches target document.

6.2

Basic results

Table 1 shows the success rate and the average strong query length of the two algorithms. algorithm Freq Greedy

failure (%) 0.328 0.370

avg. sqlen 3.848 3.257

As an aside, we note that the average size of the target document is 221 words. The average number of words per document deleted from consideration by the Greedy algorithm is 18.

6.3

Original query vs strong query overlap

Figure 3 shows the overlap between the original query that generated the document (recall our method of selecting the documents) and the strong query produced by the algorithm, as a function of the length of the query that generated the document. The measure of similarity is Jaccard, as before. The figure shows that the Greedy algorithm produces strong queries that resemble the original queries somewhat less than the Freq algorithm. In both cases, the overlap is quite minimal, suggesting that the algorithms uncover unique combination of words, going well beyond the query. Also, the overlap seems to decrease with query length but starts to increase after query length of four, suggesting that longer queries tend to uniquely identify documents, which is reflected in the selection of the strong query.

Table 1: Performance metrics of the algorithms.

0.055 Freq Greedy

0.05

0.5 Freq Greedy

0.45 0.4

overlap

0.35 0.3 0.25

0.045 0.04 overlap

From the results, it is clear that the Greedy algorithm obtains more than 15% improvement in strong query length over the baseline Freq algorithm. As we speculated earlier, the success rate for Greedy algorithm suffers a bit. It is surprising to note that under four words (not necessarily a phrase) are sufficient to, in a sense, uniquely identify a document from an index of more than a few billion documents. This can be viewed as a query analog of the highly compressible nature of the web graph [4, 5]. Next, we study the measure of similarity between the target document and the best matching document in the top 10 results, as a function of the iteration of the algorithm. Figure 2 shows the plot. The most interesting aspect of this is the shape of the curve. Both the curves increase initially and then they start declining. This indicates that the lowest frequency terms can be misleading, but after three words, things start to correct and finally they deteriorate again. From the figure, the Freq algorithm comes closer to the target than the Greedy algorithm.

0.035 0.03 0.025 0.02 0.015 0.01 0.005 1

2

3 4 query length

5

6

Figure 3: Overlap of original query and the strong query, as a function of the query length. Figure 4 shows the corresponding plot for queries based on their frequency buckets. While the relative behavior between Freq and Greedy is similar, the shape of the curves is intriguing. The overlap is lower for very unpopular or very popular queries. The former happens because unpopular queries tend to hit less popular documents, but there could be other unique ways of identifying the documents. Likewise, popular queries hit popular documents, but popular documents also can be hit by many other queries (not necessarily by words contained in the document, but by anchortext, querylog, clicklog, hand-tuning, and other extraneous mechanisms).

0.2

6.4

0.15 0.1 0.05 1

2

3

4

5 6 iteration

7

8

9

10

Figure 2: Average Jaccard distance between the target document and best matching document at each step of the iteration.

Effect of document result position

Next we study the effect of the position of the document on the failure rate of algorithms and on the average strong query length. Figure 5 shows the plots. The failure rate is largely constant. The average query length, on the other hand, has a distinct increase. It shows that documents that are ranked lower in search results require longer queries to uniquely identify them (for both the algorithms). This is an interesting artifact of the search engine: one has to work a lot to ‘beat out’ the top results in a search engine, in order to get the lower-ranked target document.

0.026

0.37 Freq Greedy

0.024 0.022

0.35

0.02

0.34

0.018

fail

overlap

Freq Greedy

0.36

0.33

0.016 0.32

0.014

0.31

0.012

0.3

0.01 0.008

0.29 1

1.5

2

2.5 3 3.5 query frequency

4

4.5

5

1

2

3

4

5 6 position

7

8

9

10

9

10

4.1

Figure 4: Overlap of original query and the strong query, as a function of the query frequency.

frequency 1 2 3 4 5

Freq avg. sqlen 3.5824 3.8766 3.7810 3.9392 3.9009 3.9316 Freq failure avg. sqlen 0.3523 3.8172 0.3868 3.9181 0.3574 3.8310 0.3750 3.7397 0.4168 3.6276

failure 0.3379 0.3750 0.3842 0.3765 0.3837 0.3934

Greedy failure avg. sqlen 0.3729 2.9519 0.4053 3.3087 0.4252 3.2500 0.4241 3.4088 0.4428 3.3617 0.4436 3.2946 Greedy failure avg. sqlen 0.4031 3.2951 0.4321 3.2894 0.4156 3.1740 0.4236 3.2065 0.4576 3.1402

Table 2: Effect of query properties on the performance.

6.5

Effect of query properties

In this section, we analyze the role of the properties of the query that determine the failure rate and average strong query length. We take each document and study the performance of the algorithms as a function of the length and frequency of the original query that produced this document. We can see that the the failure rate increases as a function of the query length and query popularity. Documents corresponding to longer queries and more popular queries are the easiest to fail. Such documents are perhaps either less popular documents or documents that have many other ‘competitors’ and hence may need more words to be identified. Average strong query length, on the other hand, does not show any significant trend.

6.6

Query-based lexicon

In this experiment, we vary the lexicon. Instead of using the standard English lexicon, we choose the words in the queries as part of the lexicon. For this purpose, we work with a day’s worth of querylog. We collect the queries, apply the usual filtering rules to discard alphanumeric and punctuated words, and end up with about 137K words. This will constitute our lexicon.

3.9 3.8 avg. sqlen

length 1 2 3 4 5 6

Freq Greedy

4

3.7 3.6 3.5 3.4 3.3 3.2 3.1 1

2

3

4

5 6 position

7

8

Figure 5: Failure and average strong query length as a function of the document position in search results.

Table 3 shows the basic performance of the algorithms. We can see that longer queries are needed in this case. Surprisingly, the failure rate did not change much from the other case. algorithm Freq Greedy

success (%) 0.326 0.375

avg. sqlen 4.868 4.157

Table 3: Performance metrics of the algorithms for the query-based lexicon. The rest of the numbers are mostly qualitatively similar to the earlier ones. The values, however, are different: for example, the overlap between the original query and the strong query is significantly higher than the word lexicon case and the overlap between the target document and the documents produced during the iteration is significantly lower. All these amount to show the robustness of the algorithms and in fact, the problem itself. For brevity, we do not repeat these plots.

7.

RELATED WORK

In 1957, Luhn [17] introduced a method for document summarization that is based on short queries; this can be thought of as the genesis of the strong query problem. Much later, the problem was considered in the web search context by Bharat and Broder [2]. In both those papers, the proposed heuristic is use the term frequencies to construct the

query: low frequency terms are selected, but terms with very low frequency are not considered since they can be misspellings. Since then, the problem has surfaced in several settings; for example, Pereira and Ziviani [18] considered the problem in the context of detecting similar pages. Dasdan et al. [7] suggested a method for the strong query problem that is based on selecting a random sample of adjacent terms. Even though they claim improvement over previous heuristics, their method loses the spirit of the original problem since the queries are made up of phrases instead of words. Yang et al. [23] introduced a heuristic that is based on the frequency and an entropy-based score. A strong query can be viewed as a minimal description of a document and hence our problem has some relationship to the problem of finding a short summary or a snippet for a web page. In [9] the authors use a technique based on selecting phrases from a document to obtain a good summary. Snippet-finding approaches are typically based on machine learning, relevance feedback, and linguistic analysis [1, 15, 22]; however, the emphasis in this line of work is not on obtaining provable performance guarantees. There is an extensive literature on the minimum set cover problem. The minimum set cover is NP-complete [14] and also hard to approximate to within a factor of Ω(log n) [19, 11]. Johnson and Lov´ asz [13, 16] independently proved that the greedy algorithm that selects at each step the set with the maximum number of uncovered elements achieves an O(log n) approximation. A tighter analysis of the greedy algorithm was presented later by Slav´ık in [20]. The literature on the stochastic analysis of set cover is relatively limited. In [3] the authors studied the problem where the ratio between the number of elements and the number of sets is constant and every possible subset set is selected with uniform probability. In [21] the authors considered the setting where every element has the same probability to appear in any set. In [8] the authors showed that by selecting the sets in a random order until all the elements are covered, one can obtain a 2-approximation algorithm. We study a generalization of this last probabilistic setting where the probability that an element is in a set is different for each  and we prove that the greedy algorithm is  element an O logloglogn n -approximation.

8.

CONCLUSIONS

In this paper we studied the classical strong query problem from an algorithmic point of view. We formalized the problem in a term-document matrix setting and showed that the basic problem is hard, but still amenable to greedy solutions. Experiments showed that our algorithm improves upon the widely-used heuristics that are based on word frequency. By providing a formal treatment, we have opened the problem for algorithmic attack; but we have only scratched the surface and lot more needs to be done. We also proposed the k-strong query problem, where we are given a set of k target documents and the goal is to find a strong query that will retrieve all the k documents are the top k results. A more challenging version of the problem is to demand the k documents in a particular order. A very special case of this problem could be particularly helpful in understanding the working of a ranking function: given a query and the top two documents d1 and d2 , what is the smallest query for which the order of d1 and d2 is reversed?

Interesting future work involves obtaining good heuristics to these problems. It will also be interesting to try our algorithms using bigrams and phrases. We leave it as an interesting future direction.

Acknowledgments We thank Andrei Broder for helpful discussions.

9.

REFERENCES

[1] E. Amig´ o, J. Gonzalo, V. Peinado, A. Penas, and F. Verdejo. Using syntactic information to extract relevant terms for multi-document summarization. In Proc. 20th COLING, 2004. [2] K. Bharat and A. Broder. Estimating the relative size and overlap of public web search engines. In Proc. 7th WWW, pages 512–523, 1998. [3] J. Blot, W. F. de la Vega, V. T. Paschos, and R. Saad. Average case analysis of greedy algorithms for optimisation problems on set systems. TCS, 147(1&2):267–298, 1995. [4] P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In Proc. 13th WWW, pages 595–602, 2004. [5] G. Buehrer and K. Chellapilla. A scalable pattern mining approach to web graph compression with communities. In Proc. 1st WSDM, pages 95–106, 2008. [6] F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In Proc. 19th WWW, pages 231–240, 2010. [7] A. Dasdan, P. D’Alberto, S. Kolay, and C. Drome. Automatic retrieval of similar content using search engine query interface. In Proc. 18th CIKM, pages 701–710, 2009. [8] J. Davila and S. Rajasekaran. A note on the probabilistic analysis of the minimum set cover problem. Computing Letters., 2006. [9] J.-Y. Delort, B. Bouchon-Meunier, and M. Rifqi. Web document summarization by context. In Proc. 12th WWW (Posters), 2003. [10] D. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomised Algorithms. Cambridge University Press, 2009. [11] U. Feige. A threshold of ln n for approximating set cover. JACM, 45(4):634–652, 1998. [12] J. H˚ astad. Clique is hard to approximate within n1− . Acta Mathematica, 182:105–142, 1999. [13] D. S. Johnson. Approximation algorithms for combinatorial problems. In Proc. 5th STOC, pages 38–49, 1973. [14] R. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85–103. Plenum Press, 1972. [15] Y. Ko, H. An, and J. Seo. An effective snippet generation method using the pseudo relevance feedback technique. In Proc. 30th SIGIR, pages 711–712, 2007. [16] L. Lov´ asz. On the ratio of optimal integral and fractional covers. Discrete Mathematics, 13:383–390, 1975.

[17] H. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Research and Development., 1(4):309–317, 1957. [18] A. Pereira and N. Ziviani. Retrieving similar documents from the web. J. Web Engineering., 2(4):247–261, 2004. [19] R. Raz and S. Safra. A sub-constant error-probability low-degree test, and sub-constant error-probability PCP characterization of NP. In Proc. 28th STOC, pages 475–484, 1997.

[20] P. Slav´ık. A tight analysis of the greedy algorithm for set cover. In Proc. 29th STOC, pages 435–441, 1996. [21] O. Telelis and V. Zissimopoulos. Absolute o(log m) error in approximating random set covering: An average case analysis. IPL, 94(4):171–177, 2005. [22] A. Turpin, Y. Tsegay, D. Hawking, and H. E. Williams. Fast generation of result snippets in web search. In Proc. 30th SIGIR, pages 127–134, 2007. [23] Y. Yang, N. Bansal, W. Dakka, P. G. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proc. 2nd WSDM, pages 34–43, 2009.

An Algorithmic Treatment of Strong Queries

parameters, and duplicate copies are the prime culprits for making this task challenging. The importance of strong queries goes beyond its initial application to ...

312KB Sizes 2 Downloads 166 Views

Recommend Documents

An Algorithmic Approach for Auto- Selection of Resources to ... - IJRIT
These algorithms can be implemented into a small computer application using any computer programming language. After implementation of these algorithms, the process of automatic selection of the resources responsible for good performance will be auto

An Example of Kakutani Equivalent and Strong Orbit ...
Jul 7, 2008 - Cantor space and T : X → X is a minimal homeomorphism. The minimality of. T means that every T-orbit is dense in X, i.e. ∀ x ∈ X, the set {Tn(x)|n ∈ Z} is dense in X. There are several notions of equivalence in dynamical systems

An Extended Framework of STRONG for Simulation ...
Feb 29, 2012 - Indeed, STRONG is an automated framework with provable .... Construct a local model rk(x) around the center point xk. Step 2. .... We call the sample size required for each iteration a sample size schedule, which refers to a.

An interdisciplinary approach to the treatment of ...
affected boy with a long history of the disease, manifesting a great lot of all possible complications, this approach ... A 13-year-old boy with the diagnosis of Crohn's disease in a progressive state over the last 9.5 years. .... Farmer M, Petras RE

Prediction of Hard Keyword Queries
Keyword queries provide easy access to data over databases, but often suffer from low ranking quality. Using the benchmarks, to identify the queries that are like ...

Aspects of Insulin Treatment
The Valeritas h-Patch technology has been used to develop a .... termed “cool factors,” such as colored and ... and overused sites, and there is a huge stress of ...

Engineering of strong, pliable tissues
Sep 28, 2006 - Allcock, H. R., et al., “Synthesis of Poly[(Amino Acid Alkyl ..... Axonal Outgrowth and Regeneration in Vivo,” Caltech Biology,. (1987). Minato, et ...

Aspects of Insulin Treatment
“modal day” display particularly useful. Data analysis with artificial intelligence software should be designed to recognize glucose patterns and alert patients and.

author queries
8 Sep 2008 - Email: [email protected]. 22. ... life and domain satisfaction: to do well from one's own point of view is to believe that one's life is ..... among my goals. I also value positive hedonic experience, but in this particular. 235 situ

Queries - High School of Athens.pdf
Loading… Page 1. Whoops! There was a problem loading more pages. Queries - High School of Athens.pdf. Queries - High School of Athens.pdf. Open. Extract.