Unsupervised Discovery of Coordinate Terms for ...

Viewer
Transcript

2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Unsupervised Discovery of Coordinate Terms for Multiple Aspects from Search Engine Query Logs Masashi Yamaguchi, Hiroaki Ohshima, Satoshi Oyama, Katsumi Tanaka Graduate School of Informatics, Kyoto University Yoshida-Honmachi, Sakyo, Kyoto 606-8501, Japan {yamaguti,ohshima,oyama,tanaka}@dl.kuis.kyoto-u.ac.jp Abstract

ury private jet as alternatives for purchase makes no sense. Generally, it is much more useful to compare the target with something similar than to compare it with something quite different. One problem in comparison is that a person may not always be aware of all the possibly good candidates for comparison. For example, a person buying a car may not know all the similar car types. A system that could present good candidates for comparison would be helpful to a user who is making a decision. Such a system could also help deepen a user’s understanding of a certain target by suggesting other things that are comparable to the target. Linguistically speaking, coordinate terms would be useful in such a system. Coordinate terms are terms that have identical hypernyms, which are words with a general meaning that includes the meanings of other particular words. For example, “Toyota,” “Honda,” and “Nissan” are coordinate terms because their hypernyms are the same: “car manufacturer.” Since comparisons among things that are somewhat similar are more useful, coordinate terms are of particular interest. Because they share the same hypernym and share several common topics, they are good candidates for use in making comparisons. Coordinate terms can have several common topics. For example, the coordinate terms “Toyota” and “Honda” share common topics such as dealer, used car, and rental car. The basic assumption underlying our approach to discovering coordinate terms is that terms representing these common topics tend to co-occur with coordinate terms in search engine query logs. We call these common terms topic terms. In fact, the queries “Toyota dealer,” “Toyota used car,” and “Toyota rental car” as well as ‘Honda dealer,” “Honda used car,” and “Honda rental car” frequently appear in query logs. Conversely, if we find terms that frequently co-occur with the topic terms in query logs, those terms can be regarded as candidate coordinate terms. For example, if “Nissan dealer,” “Nissan rental car,” and “Nissan used car” also frequently appear in query logs, we can say that “Nissan” is a candidate coordinate term for “Toyota.”

A method is described for discovering coordinate terms, such as “Honda” and “Nissan,” for a given term, such as “Toyota,” as well as their common topic terms, from the query logs of a Web search engine. Coordinate terms are good candidates for use in making comparisons. A HITSbased algorithm is applied to a bipartite graph between coordinate term candidates and co-occurrence patterns to identify coordinate and topic terms. Spectral analysis is used to distinguish coordinate terms corresponding to different aspects of the search term. As a result, we can discover terms related to the terms in a search engine query that reflect the needs and interests of the user.

1 Introduction Comparing one thing to another is an important step in decision making. For example, a person who is going to buy a car usually compares the features of several candidates. A person who is going to take a vacation usually considers several destinations. An investor compares the performance of various companies before buying stock in any of them. Comparison is also useful for deepening one’s understanding of the target. In school, students compare countries to better understand one in particular. Comparing one historical person to another is a good way to better understand the characteristics of the target person. Nowadays, many people use a search engine to gather information from the Web for use in making comparisons. The Web contains myriad kinds of information (as well as the most up-to-date information), and the information can be gathered quickly at low cost. We usually compare things that are similar to each other in some way. Comparing things that are totally different is not particularly useful and, in some cases, it may be nonsensical. For example, comparing a compact car to a lux978-0-7695-3496-1/08 $25.00 © 2008 IEEE DOI 10.1109/WIIAT.2008.135

245 249

We have developed a method for discovering terms that might be useful for comparison against a target given as a query term by a Web search engine user. It identifies coordinate terms for a given term by using only search engine query logs, which are histories of user queries on a general Web search engine. Most Web search engines save these logs. From them, we can determine which terms frequently co-occur with a certain term. Search engine query logs clearly reflect the intentions of search engine users, so they are a very useful accumulation of many people’s information needs. It is therefore reasonable that mining such logs will enable us to better understand the information needs of search engine users. It is important to consider the order of the terms in a query when mining query logs. While many search engines process a query composed of several keywords as a conjunction (AND) of the keywords and do not explicitly distinguish their order, we found that the frequency of queries comprising the same keywords but in a different order varies considerably. The ordering seems to reflect a tendency to compose a query by starting with a keyword representing the main subject and then adding keywords to narrow down the topic. Since our method distinguishes queries comprising the same keywords but in a different order, it can better identify good candidates for coordinate and topic terms. Many terms have multiple aspects (i.e., they have multiple hypernyms), and there is a set of coordinate terms for each aspect. For example, “honda” has not only the aspect of “car manufacturer” but also the aspect of “motorcycle manufacturer.” The respective coordinate terms include “toyota,” “ford,” and “gm” and “yamaha,” “kawasaki,” and “suzuki.” Distinguishing the coordinate terms by aspect when presenting them to a user is desirable because doing so would help him or her to choose an appropriate coordinate term to use for comparison. It can also uncover an unexpected aspect of the target entity and help the user obtain a comprehensive picture of the target. We thus extended our method to enable it to distinguish coordinate terms for different aspects. The remainder of this paper is organized as follows: We overview related studies in Section 2. The central idea of our method is described in Section 3. Section 4 describes our basic method. Section 5 describes the extension to our method that enables coordinate terms for multiple aspects to be obtained. Section 6 shows the experimental results. We conclude with a brief summary Section 7.

eral example entities in a certain class, the methods using this approach discover other entities belonging to the same class. The main difference between this approach and ours is that, with this approach, the target class needs to be determined beforehand and multiple instances are needed while, with our approach, we not need to assume a single target class and entities for each aspect can be discovered from only one entity, which is given by us. It can be considered as unsupervised approach to discovering entities for multiple aspects. There has also been work on discovering attributes of entities in a certain class by using query logs [9]. This basically corresponds to the discovery of topic terms in our approach. However, as with discovering entities, these methods can discover the attributes of a single class (aspect) while our method can discover different topic terms for each aspect. There has also been research on discovering related words such as coordinate terms, and there is even an online service for obtaining coordinate terms. Google Sets (http://labs.google.com/sets) provides a set of coordinate terms for terms entered by the user. Given a small subset of coordinate terms, Google Sets returns additional terms belonging to the set. For example, if a query to Google Sets consists of “Versace” and “Armani,” additional terms such as “Gucci,” “Chanel,” “Prada,” and “Calvin Klein” are returned. While the details of the algorithm used have not yet been disclosed, Ghahramani and Heller [3] noted that it is a large-scale clustering algorithm that uses many millions of data instances extracted from Web data. Inspired by Google Sets, Ghahramani and Heller [3] developed the “Bayesian Sets,” method, which uses Bayesian estimation to identify coordinate terms. It takes a query consisting of a small set of items and returns additional items belonging to the set. The algorithm computes a score for each item by comparing the posterior probability of that item given the set of prior probabilities for that item. The algorithm is simple and low cost, and it was applied to large data sets from a movie rating service, a collection of research papers, and an encyclopedia. Shinzato et al. [11] described a method for identifying coordinate terms in HTML documents by focusing on the HTML structure. Terms at the same level in the structure, such as terms in a list, can be candidate coordinate terms. If the degrees of their mutual information and of co-occurrence are high, the terms are regarded as coordinate terms. Ohshima et al. [7] described a method for discovering coordinate terms by using a Web search engine. Given user keyword x, it submits two queries, “x or” and “or x,” and extracts from the search result snippets containing the preceding and following terms as candidate coordinate terms. It ranks the candidates on the basis of their frequencies in the search results.

2 Related Work Several approaches to extracting ontological knowledge from search engine query logs have recently been reported. Among them, the most relevant to our work is discovering entities belonging to a certain class [5, 8, 10]. Given sev-

250 246

Church et al. [2] described a method for identifying terms semantically related to a certain term by using mutual information. While not all the identified terms are coordinate terms, some of them are. Subsequent research has exploited the idea of this research—terms with high mutual information are more likely to be semantically related. Lin et al. [6] described a method that generates clusters of similar terms. They also described a similarity measure in which the similarity between two objects is defined as the amount of common information in their descriptions. Using modification relations, it calculates term similarities and generates clusters of similar terms. Therefore, a large corpus with modification relations is required. Turney [12] and Baroni et al. [1] described methods for discovering synonyms. They use co-occurrence or mutual information calculated using the estimated total number of term occurrences in the search engine results. They demonstrated that data collected using a Web search engine can be used in place of data collected using conventional large corpora analysis.

Replace “Toyota” with x

LC (“Toyota”)

Co-occurrence patterns

Toyota Corolla

x Corolla

Aichi Toyota

Aichi x

Toyota used car

x used car

Query term Toyota

…

…

Query log DB

Figure 1. Process occurrence patterns.

3 Basic Idea for Discovering Coordinate Terms from Search Engine Query Logs

for

obtaining

co-

for the search, and the subsequent words generally refine the search. We can identify such term order tendencies in the query logs and exploit them. In our method, we obtain terms that match the cooccurrence pattern, which enables us to express both cooccurrence and order (Figure 1). For “Toyota” in “Toyota used car,” we acquire “x used car” as a co-occurrence pattern. If there is a “Honda used car” query in the log, “Honda” matches the co-occurrence pattern, so “Honda” is a candidate term. If there is a “used car dealer” query, “dealer” does not match the co-occurrence pattern, so the “used” is not considered as a candidate because of the difference in term order. Let LC(q) denote a log collection consisting of log records that match keyword or co-occurrence pattern q. For example,

Our proposed method is based on the co-occurrence of terms in query logs. Query logs are histories of users’ queries on a Web search engine. Web search engines generally retain such logs. A search engine query is likely to contain several terms separated by spaces. In this paper, the combination of terms in a query, such as “Toyota used car,” is called a log record. The collection of log records for a search using “Toyota used car” is called the log collection for “Toyota used car.” Queries containing “Toyota” produce a variety of cooccurrent terms in the query log: Toyota product names such as “Corolla” and “Lexus,” “used car,” “dealer,” etc. Queries containing “Honda” or “Nissan” produce many cooccurrent terms that differ from those for “Toyota.” However, there are some common ones, such as “car,” “used car,” “rental car,” and “dealer.” Similarly, for “tennis,” “golf,” and “soccer,” the common co-occurrent terms include “goods,” “practice,” and “school.” While the basic idea of our proposed method is that coordinate terms have common co-occurrent terms in search engine query logs, the order of the terms in a query is significant. Namely, the query “Toyota used car” differs from “used car Toyota.” For example, during April 2006, the query “Toyota used car” was entered 75,200 times into Overture, while “used car Toyota” was entered 3019 times. The coordinate terms for “Toyota” are generally the same when it is used with “used car” in a query. For example, the frequency of the query “Honda used car” was 9013 while that of “used car Honda” was 1759. As mentioned above, the first word in a multiword query often is the main subject

LC(“Toyota”) = {“Toyota dealer”, “Corolla Toyota”, “Toyota rental car”, . . .} LC(“x dealer”) = {“Honda dealer”, “Nissan dealer”, . . .}. Note that we limit the size of LC(q) due to computational cost and search engine restrictions. Moreover, in most query log databases, we cannot use co-occurrence patterns including variable x to search for logs. Therefore, if we want to obtain LC(“x dealer”), for example, we first find logs containing “dealer” in the query and then eliminate logs like “dealer inspection” that do not match the cooccurrence patterns. Our method for finding candidate coordinate and topic terms comprises five steps.

251 247

1. Receive query q from user.

Topic terms

Coordinate terms

Representative topic and coordinate terms

2. Obtain log collection LC(q).

Honda

3. Find co-occurrence patterns {p1 , p2 , . . . , pk } for query q from LC(q) (k is the number of patterns found). These co-occurrence patterns are candidate topic terms.

x dealer Nissan x rental car x used car

4. Obtain log collections LC(p1 ), LC(p2 ), . . . , LC(pk ).

Mazda BMW movie

5. Find terms that match x in co-occurrence pattern pi from each LC(pi )(1 ≤ i ≤ k). They are the candidate coordinate terms of q.

x museum 70 x Corolla

...

For example, the query “Toyota” is processed as follows.

new

...

1. Receive “Toyota” query from user. 2. Obtain LC(“Toyota”) as shown in Figure 1.

Figure 2. Bipartite graph for “Toyota.”

3. Find co-occurrence patterns that are co-occurrent with “Toyota” in each log in LC(“Toyota”). They consist of patterns such as “x Corolla,” “Aichi x,” and “x used car.”

containing authoritative and useful resources, and good authorities have many links from good hub pages. Hubs are pages with links, and good hubs have many links to good authorities. The HITS algorithm proposed by Kleinberg [4] is very effective for finding good authorities and hubs on the basis of link information. If we consider co-occurrence as a link structure, we can say that representative coordinate terms and topic terms form a community, as shown in Figure 2. To discover such communities, we can use the HITS algorithm to identify coordinate terms.

4. Obtain query logs for each of the co-occurrence patterns. For LC(“x Corolla”), “new Corolla” and “used car Corolla” are obtained. For the pattern “Aichi x”, logs such as “Aichi university and “Aichi bank” are obtained. For the pattern “x used car, logs such as “Nissan used car” and “Honda used car” are obtained. 5. Find terms that match x in the pattern from the logs. For example, we obtain terms such as “new,” “used car,” “bank, “university,” “Nissan,” and “Honda” as candidate coordinate terms.

4.1

Hyperlink-Induced (HITS)

Topic

Search

HITS was designed for locating dense bipartite communities in a link structure. That is, the central idea is that authoritative pages can be identified as belonging to dense bipartite communities in link structures. For the query “car manufacturer,” the home pages of Toyota, Honda, and other car makers would be considered good authorities, while Web pages that list these home pages would be good hubs. The HITS algorithm can be run on an arbitrary set of hyperlinked pages, and such a set is represented as a directed graph: G = (V, E), where V is the set of pages in the environment, and the directed edge (p, q) ∈ E represents the existence of a link from p to q. Each page, p ∈ V , is associated with a non-negative authority weight, ap , and a non-negative hub weight, hp . Their values are updated using the algorithm described below. Note that the weights of each type are normalized so that their squares sum to 1. That is, Σp∈V a2p = 1, and Σp∈V h2p = 1. The updating algorithm is intuitively understandable. If

4 HITS-BASED SELECTION METHOD To select representative coordinate/topic terms from the candidate terms, we use a hyperlink-induced topic search (HITS) based method. As described in Section 3, coordinate terms have common co-occurrence patterns in query logs. For example, “x dealer,” “x used car,” and “x rental car” are widely co-occurrent with coordinate terms “Toyota,” “Honda,” and “Nissan.” We define “representative coordinate terms” as terms that share many representative topic terms (co-occurrence patterns) and “representative topic terms” as terms that match many representative coordinate terms. To resolve these mutually dependent definitions, we introduce a technique used in link analysis. One major objective in link analysis is finding communities in linked entities such as Web pages. Kleinberg [4] modeled Web communities using hubs and authorities. Authorities are pages

252 248

p points to many pages with large a-values, it receives a large h-value; if p is pointed to by many pages with large h-values, it receives a large a-value. For this reason, Kleinberg defined two operations on the weights: I and O. The I operation updates the a-values: ap ← hq . (1)

2. Make a bipartite graph by pointing from each candidate topic term (co-occurrence pattern) to the candidate coordinate terms if there is a log record in which both the pattern and terms appear, i.e., if there is a log record that matches the result of replacing variable x in the co-occurrence pattern with the candidate coordinate term. (Figure 2 shows the bipartite graph used for discovering coordinate terms of “Toyota.”)

q:(q,p)∈E

The O operation updates the y-values: aq . hp ←

3. Apply the HITS algorithm to the bipartite graph. 4. Output the candidate coordinate terms with high avalues as final coordinate terms, and output candidate topic terms (co-occurrence patterns) as final topic terms.

(2)

q:(p,q)∈E

As a result, the a- and h-values have a mutually reinforcing relationship. The pages with larger a-values are considered better authorities, and the pages with larger h-values are considered better hubs. Given these definitions, hubs and authorities have mutually reinforcing relationships. In terms of spectral analysis of matrices, the HITS algorithm is rephrased as follows. Given a set of n Web pages, we can define n × n adjacency matrix A, whose (i, j)element is 1 if page i links to page j, and 0 otherwise. Let a be the vector whose pth element is ap and h be the vector whose pth element is hp . Operations (1) and (2) can then be written as

5 Discovering Coordinate Terms for Multiple Aspects As described in Section 4, we apply the HITS algorithm to the bipartite graph between co-occurrence patterns and candidate terms to identify coordinate terms. We extended our method so that it can also discover coordinate terms for other aspects. As described in Section 1, terms can have several aspects. The correct coordinate terms depend on the term’s meaning. For example, the term “piano” has the aspect of musical instrument and the aspect of lesson. Similarly, the term “mitsubishi” has the aspect of car and the aspect of electronics. In his original HITS paper [4], Kleinberg used nonprincipal eigenvectors of AT A and AAT to find multiple Web communities. The ith eigenvectors of AT A and AAT were defined as xai and xhi , respectively. The common ith eigenvalue of AT A and AAT is λi . He extracted Web pages for which the corresponding element in xai had a large absolute value and designated them as authorities of the ith community. The pages for which the corresponding element in xhi had a large absolute value were designated as hubs of the ith community. Non-principal eigenvectors can be found by eigen decomposition of AT A and AAT . However, linear algebra says that they can be directly found using singular value decomposition (SVD) of A without composing the two matrices, AT A and AAT . Suppose M is an m × n real matrix. Its SVD is a factorization of the form

a(t+1) = AT h(t) = (AT A)a(t) h(t+1) = Aa(t+1) = (AAT )h(t) . Linear algebra says that a∗ = limt→∞ a(t) and h∗ = limt→∞ h(t) converge to the principal eigen vectors of AT A and AAT and satisfy (AT A)a∗ = λa∗ (AAT )h∗ = λh∗ , where λ is the common principal eigenvalue of AT A and AAT . That is, HITS is equivalent to finding the principal eigen vectors of AT A and AAT .

4.2

Applying HITS to Coordinate/Topic Term Discovery

M = U ΣV T .

As described in Section 3, our method uses the cooccurrence pattern to discover candidate terms. The HITS algorithm is applied to a bipartite graph between the cooccurrence pattern and corresponding candidates.

If r is the rank of M , U is an m × r matrix, V is an r × n matrix, and Σ is an r × r diagonal matrix whose elements are non-negative. The column vectors of U are orthonormal and are called left singular vectors of M ; the column vectors of U are also orthonormal and are called right singular vectors of M .

1. Obtain candidates for coordinate terms and topic terms (co-occurrence patterns) using the method described in Section 3.

253 249

Table 1. Results for query “honda.” Position

1

2+

2-

3+

3-

used cheap kawasaki bmw yamaha snapper zero turn riding murray toro bmw toyota ford gm chevy kubota toyota chrysler car jeep cheap ebay yamaha used suzuki

Coordinate terms 0.2269 suzuki 0.1920 electric 0.1679 john deere 0.1658 toyota 0.1615 ford 0.1351 reel 0.1319 mtd 0.1319 dixon 0.1319 simplicity 0.1319 scotts -0.1281 nissan -0.1224 chevrolet -0.1075 dodge -0.1075 car -0.1075 jeep 0.1145 auto 0.1050 boat 0.1003 ford 0.0981 gm 0.0981 chevy -0.2231 mini -0.2075 cobra -0.1956 kawasaki -0.1804 used honda -0.1709 salvage

If we apply SVD to adjacency matrix A (A = U ΣV T ), the following two relationships hold:

0.1533 0.1410 0.1359 0.1341 0.1211 0.1319 0.1319 0.1319 0.1319 0.1319 -0.1075 -0.1075 -0.1075 -0.0937 -0.0937 0.0921 0.0854 0.0847 0.0847 0.0847 -0.1520 -0.1349 -0.1310 -0.1185 -0.1110

Co-occurrence pattern x parts 0.4109 x lawn mowers 0.3495 x dealers 0.3429 x cars 0.3171 x lawnmowers 0.3001 x lawn mowers 0.5159 x lawnmowers 0.5061 x mowers 0.4030 x generators 0.0349 x pilot 0.0059 x parts -0.3446 x dealers -0.2815 x auto parts -0.2212 x engines -0.1651 x cars -0.1484 x parts 0.3504 x dealers 0.2602 x engines 0.1636 x mowers 0.1570 x lawnmowers 0.1286 x motorcycles -0.4404 x motorcycle parts -0.3018 x atvs -0.2931 x generators -0.2690 x atv parts -0.2576

6 Experimental Results and Discussion

AT A = V ΣT U T U ΣV T = V Σ2 V T

We used public AOL query log data (http://www.gregsadetsky.com/aol-data/) in the experiments. We used the 200 queries with the highest frequencies for each query term and co-occurrence pattern. We tested our extended method by applying SVD to matrices constructed as described above. Table 1 presents the results for query term “honda.” Honda is an example of entities with multiple aspects. It is a well-known car manufacturer as well as a famous motorcycle manufacturer. The table shows the results from the three right singular vectors and the three left singular vectors corresponding to the three largest singular values. It shows the ten coordinate terms with the largest positive scores and the ten coordinate terms with the largest negative scores. Position i means the ith singular value, and +/- means the positive/negative side. For the fist singular vector, the scores were either positive or negative, so we show only one set of coordinate terms. We also show the five corresponding co-occurrence patterns with the largest scores for each set of coordinate terms. Both car manufacturers and motorcycle manufacturers are in the first set of coordinate terms. On the negative side of the second singular vector, the results mainly consist of car manufacturers while on the negative side of the third singular vector the results mainly consist of motorcycle manufacturers. On the positive side of the second singu-

AAT = U ΣV T V ΣT U T = U Σ2 U T . The first relationship shows the eigen decomposition of AT A; the column vectors of V are actually the eigen vectors of AT A. Similarly, the column vectors of U are the eigen vectors of AAT . Thus, we can deduce that HITS for multiple communities can be solved by SVD and that the authority score vectors are equivalent to the right-singular vectors. Similarly, the hub score vectors are equivalent to the left-singular vectors. Our extended method for discovering coordinate terms for multiple aspects uses SVD. First, candidate coordinate terms and topic terms (co-occurrence patterns) are collected as described in Section 3. Then, matrix A is constructed; its rows are the candidate topic terms (co-occurrence patterns), and its columns are the candidate coordinate terms. Cell aij takes a non-zero value if there is a log record that corresponds to replacing x in the co-occurrence pattern, pi , with the candidate coordinate term, qj . Otherwise, it is set to zero. There are several ways we could set the non-zero values for the matched pair pj and qj . The simplest way is to set them to a fixed value, say 1. Other options include using the number of matched log records. In our experiments, we simply used binary values for aij .

254 250

lar vector, we find various unfamiliar (at least to the authors) names. These are names for companies making lawnmowers. Honda also makes lawnmowers, which the authors did not realize. This illustrates how the second or later singular vectors are useful for finding unknown aspects of the target entities. On the positive side of the third singular vector, we can see names of car manufacturers again, and they overlap those on the negative side of the second singular vector. The ability to discover interesting or unexpected aspects of the target is difficult to evaluate using traditional performance measures such as precision and recall used in information retrieval. Therefore, even though they are somewhat subjective, we describe several examples with various interesting/unexpected aspects and discuss them. The queries and discovered coordinate terms and co-occurrence patterns are listed in Table 2. For each query, we checked up to three stronger right-singular vectors for candidate terms and up to three left-singular vectors for co-occurrence patterns. Among the five sets (1, 2+, 2-, 3+, 3-) of coordinate terms and co-occurrence patterns for each query, we show the set for the first singular vectors and a few sets of coordinate terms and co-occurrence patterns representing interesting/unexpected aspects. The other sets of coordinate terms and co-occurrence patterns, which are omitted due to space limitations, typically represent aspects similar to those of singular vectors with higher singular values or general terms irrelevant to the query terms. We focus on eight findings in particular. Coordinate terms with mixed aspects appear in the first singular vector. For example, in the first singular vector for “gucci,” we find famous fashion brands dealing with various products ranging from handbags to shoes. In the first singular vector for “sony,” we find electronics companies ranging from computer-specific companies to general home-electronics companies. The first singular vector sometimes represents a more general aspect than expected. As coordinate terms for “cnn” we would expect to find names of other broadcasting companies such as “abc” and “bbc”, which appear on the positive side of the second singular vector. In the first singular vector, however, we find terms representing online news sites and local newspapers. For query term “picasso,” we would expect to find names of painters such as “leonardo da vinci” and “salvador dall” on the negative side of the second singular vector as coordinate terms. However, general terms representing various kinds of art, such as “mexican,” “abstract,” and “renaissance” appear in the first singular vector. In the second or later singular vectors, we are likely to find coordinate terms for specific aspects. For query term “gucci,” which is a major brand of various fashion items, we found other brand names specific to shoes

and brand names specific to watches as coordinate terms on the positive side of the second singular vector and on the negative side of the third singular vector, respectively. While “sony” is the name of a general electronics company, we find companies producing digital cameras, computers, and televisions on the negative side of the second singular vector, the positive side of the third singular vector, and the negative side of the third singular vector, respectively. For query term “hawaii”, we find general place names in the first singular vector while names of resort places appear on the negative side of the second singular vector. We sometimes find unexpected aspects of target entities. “Piano,” of course, has an aspect of music, but it also has an aspect of a thing to practice or a thing for which to take lessons. For this aspect, we find such terms as “typing,” “golf,” and “spanish” on the positive side of the second singular vector. Other than the aspect of painter, “picasso” has an aspect of famous person. For this aspect, we find names like “william shakespeare” and “miles davis” on the negative side of the third singular vector. Expected aspects do not always appear. Mitsubishi is a large company group that includes a car manufacturer, a electronics maker, and a bank. We find coordinate terms for car manufacturer and electronics maker but not for bank. This is because the bank does not do much business in the United States, so is not well-known by AOL users. Even though we checked all the candidate coordinate terms and co-occurrence patterns, we were unable to find any terms that represent this aspect. This is a natural consequence of using query logs since they strongly depend on the users’ information needs. Co-occurrence patterns help us understand users’ needs. Examining the co-occurrence patterns may uncover hidden needs of users related to the target entities. For example, while many users are interested in eating or shopping at “mcdonalds,” some users consider McDonald’s as place to work. This is evident from the co-occurrence patterns on the positive side of the second singular vector. Co-occurrence patterns help us understand the meaning of coordinate terms. Especially for an unexpected aspect, the meanings of coordinate terms can be difficult to understand. For example, the user may not know the names of lawnmowers, as in the case described above. The user may not understand why “spanish” is a coordinate term for “piano.” Seeing cooccurrence patterns can help a user understand the relationships between the query term and the coordinate terms. The scores for co-occurrence patterns tend to be concentrated on a small number of patterns while the weights for coordinate terms tend to be dispersed across many terms. This indicates that the needs of users can be represented by

255 251

a relatively small number of terms even if we have many comparative entities.

[3] Z. Ghahramani and K. Heller. Bayesian sets. In Proceedings of the Nineteenth Annual Conference on Neural Information Processing Systems (NIPS 2005), 2005.

7 Conclusion

[4] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604– 632, 1999.

We have described a method for discovering coordinate terms and their common topics from search engine query logs. Coordinate terms are good candidates for use in making comparisons since comparisons among things that are somewhat similar are more useful. The basic idea is that coordinate terms have common topic terms co-occurring in search engine query logs, and these coordinate/topic terms have strong tendencies in terms of their appearance order in the logs. Kleinberg’s HITS algorithm, which was developed for finding useful Web pages corresponding to “hubs” and “authorities,” is used to identify coordinate and topic terms simultaneously. If the query term has several aspects, finding only the set of coordinate/topic terms representing one aspect is not sufficient. The method also needs to discover coordinate terms for the other aspects. We thus extended our basic method to include the use of spectral analysis of the co-occurrence matrix to discover coordinate terms and topic terms. Singular value decomposition is applied to the matrix to find multiple sets of coordinate terms and topic terms. Testing showed that the extended method works as intended.

[5] M. Komachi and H. Suzuki. Minimally supervised learning of semantic knowledge from query logs. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP 2008), pages 358–365, 2008. [6] D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 17 th International Conference on Computational Linguistics and 36 th Annual Meeting of the Association for Computational Linguistics (COLING-ACL 1998), pages 768– 774, 1998. [7] H. Ohshima, S. Oyama, and K. Tanaka. Searching coordinate terms with their context from the web. In Proceedings of the 7th International Conference on Web Information Systems Engineering (WISE 2006), pages 40–47, 2006. [8] M. Pas¸ca. Weakly-supervised discovery of named entities using web search queries. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM 2007), pages 683–690.

Acknowledgments This work was supported in part by the following projects and institutions: Grants-in-Aid for Scientific Research (Nos. 18049041, 18049073, and 19700091) from MEXT of Japan, a MEXT project entitled “Software Technologies for Search and Integration across HeterogeneousMedia Archives,” a Kyoto University GCOE Program entitled “Informatics Education and Research for KnowledgeCirculating Society,” a Microsoft IJARC CORE4 project entitled “Toward Spatio-Temporal Object Search from the Web,” and the National Institute of Information and Communications Technology.

[9] M. Pas¸ca and B. V. Durme. What you seek is what you get: Extraction of class attributes from query logs. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), pages 2832–2837, 2007. [10] S. Sekine and H. Suzuki. Acquiring ontological knowledge from query logs. In Proceedings of the 16th International Conference on World Wide Web (WWW 2007), pages 1223–1224, 2007. [11] K. Shinzato and K. Torisawa. A simple www-based method for semantic word class acquisition. In Proceedings of the Recent Advances in Natural Language Processing (RANLP 2005), pages 493–500, 2005.

References [1] M. Baroni and S. Bisi. Using cooccurrence statistics and the web to discover synonyms in a technical language. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), pages 1725–1728, 2004.

[12] P. D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Lerning (ECML 2001), pages 491–502, 2001.

[2] K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29, 1990.

256 252

Table 2. Results with multiple aspects. Query cnn

Position 1

2+

gucci

1

2+

3-

hawaii

1

2-

mcdonalds

1

2+

2-

mitsubishi

1

2+

piano

1

2+

picasso

1

2-

3-

sony

1

2-

3+

3-

Coordinate terms aol, yahoo, msn, local, las vegas, houston, florida, google, israel, san diego abc, bbc, fox, espn, nbc, american idol, sports, msnbc, cbs, nascar coach, designer, chanel, wholesale, prada, louis vuitton, discount, replica, cheap, guess, girls, nike, silver, mens, platform, converse, mbt, kids, wedge, adidas, cartier, swiss army, rolex, movado, omega, citizen, breitling, timex, pocket, seiko florida, colorado, north carolina, texas, las vegas, tennessee, georgia, california, arizona, chicago bermuda, cheap, disney, discount, grand canyon, luxury, cancun, maui, myrtle beach, disney world subway, olive garden, red lobster burger king, starbucks, applebees, walmart, sears, home depot, sonic home depot, walmart, sears, blockbuster best buy, target, old navy, staples, office depot, babiesrus applebees, chili’s, mexican, sonic, taco bell, outback, wendy’s, denny’s, chinese food, dairy queen toyota, ford, honda, nissan, chevy, dodge, kia, bmw, car, chevrolet, sony, samsung, sharp, plasma, lcd, best buy, cheap, panasonic, projection, flat panel christian, guitar, free, anime, easter, aol, bass guitar, myspace, horse, music guitar, typing, golf, singing, spanish, bass, art, french, free guitar, bass guitar mexican, nude, african american, tattoo, dragon, golf, mermaid, abstract, renaissance, angel leonardo da vinci, dwight eberly, salvador dali, acrylic, art, gustav klimpt, suppressed, van gogh, bible, antique chris brown, robert frost, william shakespeare, lauren bacall, miles davis, alexander von wuthenau, alicia keys, pablo picasso, leonardo da vinci, original panasonic, cheap, samsung, hp, discount, best, dell, sharp, toshiba, best buy canon, cannon, nikon, kodak, olympus, fuji, pentax, disposable, minolta, underwater gateway, apple, acer, compaq, wireless, ibm, gaming, mac, systemax, dell sharp, mitsubishi, panasonic, samsung, lcd, hd, plasma, flat screen, digital, big screen

257 253

Co-occurrence pattern x news, x weather, x jobs, x careers, x radio x news, x sports, x polls, x radio, x world news x handbags, x purses, x shoes, x sunglasses, x bags, x shoes, x sandals, x sneakers, x dresses, x watches x watches, x watch, x sunglasses, x bags, cheap x bags x newspapers, map of x, x real estate, x vacations, x map, x vacations, x hotels, x cruises, x weather, x vacation x coupons, x menu, x store locator, x locations, x nutrition facts x coupons, x store locator, x job application, x locations, x jobs x menu, x restaurant, x resturant, x nutrition facts, x nutritional information x parts, x dealers, x cars, x accessories, x autos x televisions, x tvs, x television, x tv, x electronics x wallpaper, x pictures, x music, x books, x lessons, free x lessons, x lessons, x tabs, online x lessons, x chords x art, x paintings, x art work, x woman, x lithographs x paintings, x biography, biography of x, x paintings for sale, x lures x biography, biography of x, x art work, x lithographs, x woman x televisions, x tvs, x digital cameras, x laptops, x computers x cameras, x camera, x digital cameras, x digital camera, x camcorder x computers, x laptops, x notebooks, x laptop, x products x televisions, x tvs, x cameras, x television, x tv

Unsupervised Discovery of Discourse Relations for ...