Query Transformation by Visualizing and Utilizing ...

Viewer
Transcript

Query Transformation by Visualizing and Utilizing Information about What Users Are or Are Not Searching Taiga Yoshida, Satoshi Nakamura, Satoshi Oyama, and Katsumi Tanaka Department of Social Informatics, Graduate School of Informatics, Kyoto University Yoshida-Honmachi, Sakyo, Kyoto 606-8501 Japan {yoshida,nakamura,oyama,tanaka}@dl.kuis.kyoto-u.ac.jp

Abstract. The usage of search engines to obtain necessary information from the WWW has become popular. It is hard for users to make proper queries when they want to retrieve ambiguous information or topics that they are not familiar with. In this paper, we propose a system that helps users to retrieve information by visualizing efficient keywords. The system extracts important terms from search results and plots them on a two-dimensional graph, and the user can look down upon the tendency of search results. The system enables the user to re-rank and re-search search results dynamically by moving terms displayed on the graph. In addition, we attempt to extend the searching area by presenting complementary information related to the query. We verified the usefulness of our method by applying it to a web page search and a digital library. Keywords: information retrieval, visualization, interactive operation, query transformation.

1 Introduction If we want to search web pages which contain information we want, we usually use one of two methods. One way is to trace links in a Links-page which is a collection of links to domain specific pages such as social bookmark services. Another way is to retrieve information with a web search engine. Many Links-pages are constructed by human inputs, and they usually have descriptions of target web pages, therefore we can retrieve information efficiently. However, it is rare that we know about Links-pages beforehand, so we usually use a web search engine in order to find them. If we use one keyword as a query, there would be many topics in pages of search result. Therefore, we must construct a query which can confine topics in result pages. However, it is usually difficult for us to choose keywords and construct a query connecting keywords with OR or NOT operators. For example, when a user executes “iPod” as a query with a search engine in order to buy iPod, the result is composed of many pages concerned with learning about iPod and a few pages about buying iPod. This is because “iPod” as a query is not sufficiently effective at confining the target topic. If the user wants to improve the search query to the target topic, he/she should add keywords such as “buy”, “order” or “shopping” and so on. However topics of search result pages will drastically change after changing the query. Therefore, he/she must try many queries in order to find appropriate pages. G. Buchanan, M. Masoodian, S.J. Cunningham (Eds.): ICADL 2008, LNCS 5362, pp. 124–133, 2008. © Springer-Verlag Berlin Heidelberg 2008

Query Transformation by Visualizing and Utilizing Information

125

Google suggest[1] is a system which attempts to resolve this problem. This system supports users in searching for their ideal pages by providing additional keywords which are frequently used with a query. However this method depends on query logs and does not consider topics in result pages. Therefore the system cannot identify what topics will be retrieved after a query transformation, and often provides keywords which do not allow users to find appropriate pages efficiently. Therefore we attempt to allow users to understand topics in search results extracting important keywords in search result pages as topic terms, and plotting these topic terms and their relation on a graph. In addition, we enable users to re-rank search results and re-search by moving topic terms.

2 Related Work There are many systems which visualize web pages or information[2]. KeyGraph[3] is a system which visualizes important words in text data by network graph in which important words as nodes connected each other. But KeyGraph is intended for one text file and not intended for a set of documents such as web search results. Natto view[4] is a system which visualizes web space by 3D graphics. In this system, if a user lifts a node with a mouse, related nodes will lift together. So, the user can easily see nodes connected strongly. However, Natto view doesn't focus on filtering information or re-ranking search results by graph operations. Clustering[5][6] is a method for categorizing documents. Clusty[7] is one of the clustering search engines. When a user executes a query, the system presents search results and some categorized groups with labels. By looking these labels, the user can narrow down search results. When using a clustering search engine, however, the user cannot decide how the system classify search results. Yahoo! Mindset[8] is a system which reflects user's intention. This system enables a user to re-rank web pages according to whether they are for shopping or for researching. 121r[9] is a system with which a user can re-rank search result pages by operating axes presented in a radar chart. The system re-ranks result pages according to values of axes. These systems are same as our system in point of re-ranking search result pages by users' operation. However these systems don't generate new axes dynamically. And these systems do not designed for re-searching another query. Matsuike et al.[10] made a system which assists Web search by presenting terms extracted from result pages. This system supports users' discovery of knowledge and transformation of queries by visualizing keywords in the form of a tree structure. However, types of queries generated by the system are confined. Our system is different from this system in point of reflecting users' operation flexibly.

3 Overview In our research, we propose a system which helps users to understand topics in search result pages and thus find appropriate pages. The system displays important terms in result pages in a two-dimensional graph named “keyword map” instead of using a list of search result items because in this way the user can understand the results at one

126

T. Yoshida et al.

glance. In this paper, we define important terms in result pages as topic terms. The user can transform queries and re-rank result pages by operating topic term nodes displayed on the keyword map. 3.1 Topic Terms and Co-occurring Terms A page of search result returned by a search engine consists of the form of a textual list of search results. A search result is constructed by title of the web page, URL to the page and snippet which summarizes the web page. There are many topics in the page of search result, and a main topic is referred to in many snippets. However, a user must read almost all the snippets in the page of search result to understand what topics are described in the result pages. In our research, we propose a system by which a user can find out a tendency of topics in result pages without looking at a list of search result items. Among terms that appear in snippets, there are some terms that are strongly associated with a topic in search result pages. For example, when we execute a query “jaguar” with a search engine, if there are keywords like “animal” or “zoo” in a result page, we will realize the page concerns the animal “jaguar”. Meanwhile, if there are some keywords like “car” or “ford”, we will realize the page concerns the car “jaguar”. We define these separable keywords as “topic terms” because they are involved in making topics. We visualized the relationship between query keywords and topic terms on a keyword map. Moreover, a user will easily find topics in a page listing search result items by knowing what keywords are closely connected with topic terms. Therefore, we define these surrounding keywords as “co-occurring terms” and visualize them along with topic terms. 3.2 External Topic Terms When a user searches for certain information with a search engine, he/she would be interested in relative information which he/she has not intended to search. For example, the user who executes a query “iPod” would be interested in other portable mp3 players such as “ZUNE”, “walkman” and so on. However, there are very few pages about “ZUNE” or “walkman” in search result items of a query “iPod”. Our system automatically searches complementary information by constructing a new query consists of topic terms. And the system extracts some keywords about complementary information from snippets in the search result. We define these keywords as “external topic terms”. 3.3 Visualization and Manipulation It is difficult to grasp the relationship between each topic term and the topic's weight merely by looking at a list of search results. Tag cloud[11] is a method which shows the importance of terms by changing the font size of them. However, Tag cloud does not consider about relationship between terms. This affected how we chose the form of network structure for the system interface. The system plots query keywords, topic terms, and co-occurring terms, connecting them to each other with lines. In this paper, we call this network structure a keyword map. We call terms on a keyword map “nodes”, and lines between nodes “edges”.

Query Transformation by Visualizing and Utilizing Information

127

On the keyword map, the system plots a query as “query node”. The system plots topic terms as “topic nodes”. Query node and topic nodes are connected with edges and a user can move every node by drag-and-drop operation. Terms which co-occur with topic terms are plotted as “co-occurrence nodes”. They are plotted around topic nodes. An image of nodes on a keyword map is shown in Fig.1. Using a conventional search engine, a user must make a multi-keyword query in order to reduce useless pages. And desirable pages will not be obtained unless a user chooses appropriate keywords. If a user uses OR operators in a query, he/she might be able to find appropriate pages more flexibly. However, it is difficult to make effective queries with OR operators as described in the paper written by White et al.[12]. In our research, we propose a system with which a user can weight each keyword smoothly by operating keyword candidates which are presented on the screen with a mouse. Our system not only re-ranks but transforms a query if a user weights a keyword strongly. A user can generate a new query using AND or NOT operators and re-rank result pages by changing a distance between a query node and a topic term node in the keyword map. The behaviors of our system depend on a distance between a query node and a topic node as follows:

・ AND search if the distance is shorter than a threshold L1 ・ NOT search if the distance is longer than a threshold L2 ・ Re-ranking according to the distance if it is between L1 and L2 Fig.2. shows how the system re-ranks search results or reconstructs queries according to a layout of nodes on a keyword map. When a query changes, topics in search result pages also change. So if a query changes, our system calculates new topic terms and co-occurrence terms, and plots a keyword map again. If a user wants to find information about “external topic terms”, the user can use an “external topic node” as a normal “topic node” by drag-and-dropping the node into a keyword map.

4 Design and Implementation The steps of using the system are given below.

Fig. 1. Visualization of terms

1. 2.

Fig. 2. Query transformation

A user inputs a query to the system. The system presents a query node, topic nodes and co-occurrence terms of a page listing search result items on a keyword map.

128

T. Yoshida et al.

3. 4. 5.

The user operates nodes on the keyword map according to the user's intention. The system re-ranks a list of search result items according to the layout of keyword map. The user repeats operations until appropriate pages are found.

Fig.3. is a system image when a user executed a query. A keyword map is presented on the left side of the window. The system shows topic terms of the query as topic nodes in a keyword map. Query transformation and re-ranking are executed by approximating important topic terms and keeping away unnecessary topic terms. A list of re-ranked pages is shown in the right section of the system. There are titles, snippets, and URLs of result pages. 4.1 Extracting Topic Terms When a user inputs a query and presses the “Search” button, the system executes the query with Yahoo web search and some result pages are obtained. The system extracts terms from snippets of the result pages and calculates their DF value. Then the system chooses topic terms from these extracted terms. We defined topic terms as terms which can confine topics in pages by adding the term to the query. The system extracts topic terms by the method shown below. 1. 2. 3.

4. 5.

Sort terms according to DF values in descending order and name Ti (i = 1 ~ n) in order. Continue from 2. to 5. for each term until 10 topic terms are extracted. Divide result pages into two groups Pi and N i . If a page contains term Ti , categorize the page into Pi . Otherwise categorize it into N i . Calculate DF value for pages in Pi and make a DF list for all Ti . Name the list DFP = { DFPi | i = 1 ~ n}. In the same way, make a DF list from pages in N i and name the list DFN = { DFNi | i = 1 ~ n}. Calculate the cosine similarity between DFP and DFN . If the cosine similarity is less than the threshold (= 0.6), add the term to the keyword map.

Fig. 3. System image

Query Transformation by Visualizing and Utilizing Information

129

4.2 Extracting Terms that Co-occur with Topic Terms When a user confines result pages to pages which contain query keywords and an extra topic term, topics in pages change radically. Therefore terms in snippets will change too. The system presents some terms around topic terms on a keyword map. These terms are chosen by selecting terms whose frequencies of appearance increase when pages are confined to those which contain query keywords and the topic term. A user can decide whether or not to emphasize the term by looking at terms around the topic terms on the keyword map. Terms co-occurring with topic terms are extracted by the method below. 1. 2.

3.

4.

5. 6.

For all topic terms, continue from 2. to 6. Divide result pages into two groups. One group is the “Positive group”. Pages in this group contains the topic term in their snippets. If a page does not contain the topic term, the page is classified into the “Negative group”. Count the aggregate number of pages in the “Positive group” and call it n P . Then count the aggregate number of pages in the “Negative group” and call it nN . For all Ti (i = 1 ~ n), calculate the DF value for pages in the “Positive group” and call the value DFPTi . In the same way, calculate the DF value for pages in the “Negative group” and call the value DFNTi . Calculate the chi-square value using n P , n N , DFPTi , DFNTi . If the chi-square value is more than the threshold, adopt the term Ti as a co-occurrence term.

A Chi-square value S i is calculated by the formula below. If the value is more than the threshold, it can be said that an occurrence rate of the term in the “Positive group” is different from the frequency in the “Negative group.” Si =

(n P + n N ){DFPT (n N

(

i

)

(

− DFNTi − DFNTi n P − DFPTi

)

(

DFPTi n P − DFPTi DFNTi n N − DFNTi

)

)}

(1)

4.3 Extracting External Topic Terms To support a user to find information which he/she does not intend to search, the system presents him/her some keywords relevant to complementary information about the query and topic terms. We call these terms “external topic terms”. The system extracts external topic terms by the method shown below. 1. 2. 3.

Choose three terms T1 , T2 and T3 by selecting the most frequently emerging terms from topic terms. Construct a query Q = ( T1 or T2 or T3 ) not Qo . Qo means the query which is inputted by the user. Execute the query Q and extract some terms by the method similar to extracting topic terms.

130

T. Yoshida et al.

4.4 Weighting Topic Terms In our system, the system re-ranks search results according to a layout of topic nodes on a keyword map. A user can re-rank pages by moving topic nodes with a mouse. Each page has its own score. The score S j of a page is calculated by the formula shown below. d i means a distance between a query node and a term node Ti . And x ji is a value if a snippet of a page contains term Ti , x ji = 1; otherwise x ji = 0. θ is a threshold. n ⎛ x ji ⎞ S j = ∑ ⎜⎜ − θ ⎟⎟ i =1 ⎝ d i ⎠

(2)

If the user moves a topic node toward the center of a keyword map, the score of pages which contain the topic node in their snippets increase. Searched result pages are sorted in descending order by S j , and some pages which have high score are presented as a list of pages on the window. 4.5 Extension for Google Scholar We designed a system for Google scholar[13]. First we explain about Google scholar. Google scholar is a web site which is used for searching papers. A user can search for papers by inputting free keywords, author names, publication names or publication date as a query. Titles, summaries and conference information etc. about papers are given as search results. When a user uses Google scholar, he/she can retrieve papers by many attributes. However, many users find it hard to make full use of these. When using our system, a user inputs keywords in a text box as a query. If he/she wants to input author name, publication name or publication date, he/she can use those attributes as a query. The user can execute the query by clicking on the “Search” button. After clicking the button, the search results of Google scholar are presented on the right side of the system. In conjunction with presenting search results, the system extracts some terms from authors, titles, summaries, conference names, years and publication names of the search results. These terms are presented on a keyword map. Some common, short words such as “is” or “in” are excluded from the extracted terms. The user can change what attribute terms to plot by clicking a button presented under the keyword map. On the keyword map, term nodes are plotted reflecting the frequency of terms in a search result. Therefore, we can regard a paper that contains many terms plotted in the vicinity of the center of the keyword map as a typical paper for the query. The system calculates scores for every paper based on the frequency of each term and the distance from the position of each term to the center of the keyword map. Each paper is re-ranked in descending order of the calculated score. In this way, searched papers are re-ranked in order from papers strongly related to the query to papers that are not related to the query. Some of the highly ranked papers are presented as a list of papers on the right side of the window. A user can grasp the term tendency of search results by looking at the keyword map. If there is a term which he/she thinks is related to the topic he/she wants to browse,

Query Transformation by Visualizing and Utilizing Information

131

he/she can find an appropriate paper by moving the term toward the center of the keyword map. On the other hand, if a user does not want to browse papers which contain a certain term, he/she can remove such papers by moving the term toward the outer side of the keyword map. When a user moves a term node, the system automatically calculates the scores of papers and re-ranks these papers. “External topic nodes” are plotted around a circle in a keyword map. They usually have no effect on search results, but they act as well as normal “topic nodes” when they are drag-and-dropped into a circle in the keyword map. Fig.4. shows a system image for Google scholar.

5 Evaluation and Discussion To validate a feasibility of the system, we performed some experimentation. A task of the experimentation is identification of persons who is sharing his/her name. In the experiment, we ascertained whether the system can present pages about a specified person when we moved topic terms related to the person on a keyword map. We performed this experimentation for 10 names. In each trial, we decided one person from some persons who share the same name, and chose 3 terms from topic term nodes on a keyword map. And we counted the number of correct pages in the top 20 pages after re-ranking. Before moving topic terms, the average number of correct pages in the top 20 pages was 6.6. The number of correct pages increased to 11.0 after moving 1 node, 12.8 after moving 2 nodes, and 13.1 after moving 3 nodes. This experimentation shows that the system can provide users with proper pages which match the user’s intention by simple operations. A difference between our system and a conventional search engine is the method of displaying search results. A conventional search engine presents search results as a list of their titles and snippets. On the other hand, our system visualizes keywords in search results as a keyword map. By viewing the keyword map, a user can understand the tendency of topics in search results. So, our system is good for searching information when there are many topics in search result items.

Fig. 4. System for Google scholar

132

T. Yoshida et al.

Searching papers with Google scholar is different from searching web pages with a search engine in some points. At first, there is a large number of attributes in each result papers. When a user search papers, these attributes(conference, year, publisher) are attached importance if the user know the attribute values of papers he/she want to find. In a Google scholar, however, a user can communicate these attribute values to the system only as query keywords. So the user cannot specify attribute values in detail. Our system resolves this problem. The user can grasp the tendency of attribute values by looking at a keyword map and can specify attribute values even if he/she wants to emphasize the keyword after looking search results. Second, topics in each paper are more concrete than topics in web pages, and usually their topics are different from each other. So it is difficult for users to make queries if a user want to find papers which the user does not know well about their attribute values(titles, authors etc.). Our system is adapted for searching papers when the user wants to indicate of attribute values vaguely. For example, when a user wants to find papers about database, papers obtained by a query “database” contain many topics. In this case, the user can pick out papers in which he/she has an interest with our system by operating a keyword map. The system is also adapted for a situation when the user wants to specify some keywords as author names but he/she is not sure whether all names are in the author list of the paper. The system for Google scholar is an instance of our method. The availability of the method does not depend on what information to search, so the method can be applied to other search engines. For example, a user can find ideal information by our method when he/she searches some products at online shopping sites. When applying the method to social bookmarking services, a user can search pages intuitively and flexibly by moving topic nodes which correspond to tags.

6 Conclusion In this paper, we proposed a system that visualizes search results and re-ranks those results according to user operations. The system extracts topic terms from search results, and proposes them in a two-dimensional graph named a keyword map. A user can find appropriate results just by moving terms with the mouse. Generally, a user uses nouns as keywords for a query. Therefore we mainly extract and plot nouns on a keyword map in our system. However, there are times when we want to know about the reputation of a certain product, or want to know about something but we cannot conceive concrete keywords. In these cases, we may want to input adjective keywords as a query. However, it is hard to find results which meet the purpose because it is rare that these results contain the keywords, even if the contents of the results are related to the keywords. We plan to improve the system so as to present adjective keywords to a user and enable him/her to re-rank search results by moving them. For this purpose, we have to conceive a method of associating adjective terms with search results which do not contain those keywords. When a system proposes topic terms, it is important to enhance the coverage of topics. In order to present various keywords, we will extend the system to change presented terms dynamically according to user operations. For example, when a user moves a certain keyword to the outer side of the keyword map, it means he/she does not

Query Transformation by Visualizing and Utilizing Information

133

want to browse pages or papers about the keyword. Therefore, the system removes some keywords related to the keyword and presents new keywords instead of those just removed. Then, a user can browse many topics by removing keywords which he/she is not interested in. Our system enables a user to construct queries and re-rank search results by operating a keyword map. When using our system, a user does not need to construct complicated queries, so he/she can find appropriate results using a simple process. To make a system which can be operated more intuitively, we also implement a system that presents information about search results intelligibly.

Acknowledgments This work was supported in part by "Informatics Education and Research Center for Knowledge-Circulating Society" (MEXT Global COE Program, Kyoto University), and by MEXT Grant-in-Aid for Scientific Research in Priority Areas entitled: "Content Fusion and Seamless Search for Information Explosion" (A01-00-02, Grant#: 18049041) and "Design and Development of Advanced IT Research Platform for Information" (Y00-01, Grant#: 18049073).

References 1. Google suggest, http://www.google.com/webhp?hl=en&complete=1 2. Chen, C.: Information Visualization. Splinger (2004) 3. Ohsawa, Y., Benson, N.E., Yachida, M.: KeyGraph: automatic indexing by co-occurrence graph based onbuilding construction metaphor. Research and Technology Advances in Digital Libraries (1998) 4. Shiozawa, H., Nishiyama, H., Matsushita, Y.: The Natto View: An Architecture for Interactive Information Visualization. IPSJ Journal (1997) 5. Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Advanced Reference Series. Prentice-Hall, Englewood Cliffs (1988) 6. Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proc. of SIGIR 1992, pp. 318–329 (1992) 7. Clusty, http://www.clusty.com/ 8. Yahoo! Mindset, http://mindset.research.yahoo.com/ 9. 121r(one to one ranking system), http://www.kbmj.com/service/products/121r.html 10. Matsuike, Y., Zettu, K., Oyama, S., Tanaka, K.: Approximate Intentional Representation showing the Outline and Surrounding of Web Search Results and its Visualization, DBWS 2004 (2004) 11. Sinclair, J., Cardew-Hal, M.: The folksonomy tag cloud: when is it useful? Journal of Information Science (2008) 12. White, R.W., Morris, D.: Investigating the querying and browsing behavior of advanced search engine users. In: Proc. of SIGIR 2007, pp. 255–262 (2007) 13. Google scholar, http://scholar.google.co.jp/schhp?hl=en

Utilizing ASP for Generating and Visualizing ...

Context-Aware Query Recommendation by ... - Semantic Scholar

Query By Committee Made Real - CiteSeerX

Kernel Query By Committee (KQBC)

Context-Aware Query Recommendation by ... - Semantic Scholar

(eLASCA) By Monotonic Point Transformation

Parallel Programming by Transformation

Visualizing stoichiometry - graphs and worksheet combined.pdf ...

Visualizing and Understanding Convolutional Networks Giuseppe.pdf

Representing and Visualizing Vectorized Videos through the ...

Extracting and Utilizing Social Networks from Log Files ...

KOIOS: Utilizing Semantic Search for Easy-Access and ... - CiteSeerX

Utilizing Multibreed Commercial Slaughter ...