Semantic Search Interface for Entity/Fact Retrieval

Viewer
Transcript

Search As You Think AND Think As You Search: Semantic Search Interface for Entity/Fact Retrieval Sofia J. Athenikos

Xia Lin

Drexel University College of Info Science & Technology Philadelphia, PA 19104, USA (1) 212-785-5285

Drexel University College of Info Science & Technology Philadelphia, PA 19104, USA (1) 215-895-2482

[email protected]

[email protected]

ABSTRACT The mode of information retrieval on the Web in general remains that of conventional keyword-based page/document retrieval. The project presented in this paper, called PanAnthropon FilmWorld, aims at demonstrating direct, sophisticated entity/fact retrieval by using semantic knowledge extracted from Wikipedia. To this end, a semantic knowledge base containing the extracted data and a semantic search interface demonstrating the proposed retrieval capability have been constructed. The focus of this paper is on the design and performance of the intelligent, interactive interface. The results of evaluation confirm the effectiveness of the interface for information retrieval by ordinary users with no prior exposure.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – query formulation, retrieval models, search process; H.3.5 [Information Storage and Retrieval]: Online Information Services – web-based services; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods – relation systems, semantic networks.

General Terms

The rest of this paper is organized as follows: Section 2 describes the background behind the research. Section 3 discusses related work. Section 4 discusses the conceptual framework that underlies the knowledge base and the search interface. Section 5 briefly describes the semantic knowledge base. Section 6 describes the semantic search interface. Section 7 discusses the evaluation on the effectiveness of the interface. Section 8 concludes the paper.

2. BACKGROUND Humans think about things and make sense of things largely by virtue of classifying things into distinct classes and interpreting the characteristics of things and the relations between things based on the classification. The ontological structure of the world in human thinking thus constitutes the basis of semantic structure of human sense-making. A significant kind of information-seeking activities is concerned with finding entities (things) or facts concerning entities. When we seek such information, we (usually) already know what kinds of things we are looking for, and our knowledge of the kinds (classes) of things (usually) already informs us of the kinds of facts, attributes (properties), or relations to look for.

Semantic Search, Faceted Search, Entity Retrieval, Fact Retrieval

The keyword-based search interfaces commonly found on the Web alienate the process of information-seeking from the process of thinking and sense-making by uniformly collapsing entities, attributes, relations, and facts into a sequence of individual words devoid of coherent meaning.

1. INTRODUCTION

3. RELATED WORK

Design, Experimentation, Human Factors, Performance

Keywords

Wikipedia (http://www.wikipedia.org/) has recently become an important semantic knowledge resource. The mode of information retrieval on Wikipedia, as on the general Web, however, remains that of conventional keyword-based page/document retrieval. The main objective of the research project presented in this paper, PanAnthropon FilmWorld, is to demonstrate the capability of retrieving entities and related facts that directly match a user’s query by using the semantic knowledge extracted from Wikipedia.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HCIR’11, October 20, 2011, Mountain View, CA, USA. HCIR’11, Copyright held by author.

The task of extracting semantic knowledge from Wikipedia, thus enabling entity retrieval via structured queries, has been attempted by Suchanek et al. [4,5] and Auer et al. [2,3], with the YAGO project (http://www.mpi-inf.mpg.de/yago-naga/yago/index.html), and the DBpedia project (http://www.dbpedia.org/), respectively. Both YAGO and DBpedia provide demonstrative query interfaces for searching their semantic knowledge bases by using SPARQL (http://www.w3.org/TR/rdf-sparql-query/) patterns composed of a set of conditions, each of the form . The YAGO query form provides a dropdown menu containing all available predicates to choose from. Since YAGO is a generaldomain or multi-domain knowledge base, and since the query form does not impose any restrictions as to what types of entities can occupy the subject field, the menu contains all predicates, regardless of whether or not a given predicate may be applicable to the subject entity.

The DBpedia query form and query input format are quite similar to those of the YAGO interface, except the fact that the predicate field here provides suggestions using the look-ahead technology, and except that query results are presented in a table format rather than in a list format. The semantic search interface constructed from this project is similar to those of YAGO and DBpedia in appearance. However, it provides multiple semantic search/retrieval functions, which are facilitated by explicit specification of entity type/subtype as well as by highly interactive, intelligent menu option presentation.

4. CONCEPTUAL FRAMEWORK 4.1 Classification of Entities In this project, “entities” are conceived of as things of all kinds that can be classified into different “classes” and that have certain “attributes”. The kinds of classes and attributes that are relevant depend on the domain at issue (i.e., the ontological space). This project therefore takes a domain-oriented approach to ontology construction as well as knowledge extraction and retrieval. The ontology constructed from the current, FilmWorld version of the PanAnthropon project, which is used to classify entities that are relevant to the film domain and are of interest to the project, is at: http://dlib.ischool.drexel.edu:8080/sofia/PA/Ontology.pdf. Each column in the ontology table corresponds to a distinct level in the subsumption hierarchy, from the top level to level 5. The entities extracted or derived in this project are semantically typed according to the film-domain ontology above. Specifically, the “type” of an entity refers to the level-1 class, whereas the “subtype” refers to the leaf class subsumed by the former. A simplified entity type/subtype classification scheme is at: http://dlib.ischool.drexel.edu:8080/sofia/PA/Ontology_Simple.pdf The simplified scheme is used for the entity type/subtype menu presentation on the search interface.

4.2 Classification of Attributes & Categories As indicated above, different attributes apply to different entities, depending on their types/subtypes. As in the case of classes, new attributes were extracted or derived, according to the progress of direct knowledge extraction and indirect knowledge derivation. A table containing the list of 190 attributes, along with information on the applicable types of entities, values, and value entities, is at: http://dlib.ischool.drexel.edu:8080/sofia/PA/Attributes.html. Apart from the hierarchical and non-hierarchical classification schemes to classify entities and attributes, another hierarchical scheme was used to classify Wikipedia categories extracted. The taxonomy consisting of 215 super-categories is partially shown at: http://dlib.ischool.drexel.edu:8080/sofia/PA/SuperCategories.pdf. Unlike in Wikipedia, only one leaf super-category was assigned to a given regular category.

4.3 Representation of Facts In this project, a “fact” concerning an entity refers to a tuple in the form of , where “value” can consist of a literal, an entity, a class, or a Wikipedia category. (When another entity occupies the value position, it represents a relation between the two entities.) The “note” field is used to store contextual information relevant to a given fact, which is not possible in the strict model.

5. SEMANTIC KNOWLEDGE BASE Semantic knowledge concerning the film domain has been directly extracted and indirectly derived from Wikipedia by using film pages and additional pages on the Academy Awards and the Golden Globe Awards. The semantic knowledge base constructed (using a MySQL database) contains 209,266 distinct entities and 2,345,931 distinct entity-centric facts. The details on knowledge extraction/derivation/organization are the focus of another paper [1]. The summary statistics on the extraction/derivation results is at: http://dlib.ischool.drexel.edu:8080/sofia/PA/Statistics.pdf.

6. SEMANTIC SEARCH INTERFACE The Web interface for the PanAnthropon FilmWorld project was implemented by using HTML, JavaScript, and JSP (in connection with the MySQL database) on the Tomcat server. The interface (Figure 1) is at: http://dlib.ischool.drexel.edu:8080/sofia/PA/.

6.1 Interface Functions The interface currently provides the following semantics-based entity/fact/relation retrieval functions: (1) General Entity Retrieval Query (GERQ); (2) Specific Entity-Centered Query (SECQ); (3) Entity Commonality Finder Query (ECFQ); (4) Direct Relation Finder Query (DRFQ); (5) Indirect Relation Finder Query (IRFQ); and (6) Category-Based Entity Browsing (CBEB). The GERQ function is one that corresponds to the main research problem of this project, namely, to demonstrate the capability of retrieving entities and related facts that directly match a query that specifies the entity type/subtype and conditions (i.e., attribute– value pairs) to be satisfied by the entities. The results of GERQ consist of tuples. The SECQ function refers to the capability of retrieving all entitycentric facts, given the type, subtype, and name of an entity. The results of SECQ consist of tuples. The ECFQ function refers to retrieving commonalities between two specified entities of the same entity type and subtype. The results of ECFQ consist of tuples, where attribute and value represent the commonly-shared attribute–value pairs, and note_1 and note_2 denote contextual notes for entity 1 and entity 2, respectively. The DRFQ function allows retrieving direct relations between two specified entities, regardless of their respective entity types and subtypes. The results of DRFQ consist of tuples. The IRFQ function allows retrieving 1-degree indirect relations between two specified entities. The results of IRFQ consist of tuples, where e1 and e2 stand for the two specified entities, e3 stands for a third, intermediary entity, and el-e3_rel and e3-e2_rel stand for the relation between entity 1 and entity 3 and the relation between entity 3 and entity 2. Finally, the CBEB function refers to retrieving (only) film entities by using the taxonomy of super-categories and categories. (In addition, the interface has a Slide function, which provides an image and brief introductory information for each film.)

6.2 Search Process Figure 2 shows a flowchart illustrating the search process using each of the six semantic entity/fact/relation retrieval functions.

6.3 Intelligent Interaction The interface is designed and implemented to be highly interactive and intelligent. The query form for each function is presented in a step-by-step manner, according to the user action in the previous step. For example, in GERQ: Given the user selection of an entity

type, only relevant entity subtypes are displayed. In turn, given the selection of a particular entity subtype, only relevant attributes are presented. Finally, given the selection of an attribute, only relevant values are provided for user selection. Figure 3 illustrates progressive steps of a sample GERQ input process.

Figure 1. PanAnthropon FilmWorld interface.

Figure 2. Flowchart of semantic search process.

=================================================================================================

=================================================================================================

=================================================================================================

=================================================================================================

Figure 3. Progress of sample GERQ input process.

7. EVALUATION

7.3 Post-Task Questionnaire Responses

Two types of evaluation have been performed in order to evaluate (1) the effectiveness of the semantic knowledge extraction system (http://dlib.ischool.drexel.edu:8080/sofia/PA/EvalSumm_IE.pdf), and (2) the effectiveness of the semantic search/retrieval interface (http://dlib.ischool.drexel.edu:8080/sofia/PA/EvalSumm_IR.pdf). Here only a summary of the second evaluation is given.

A post-task questionnaire consisting of 8 questions was filled out by each of 33 subjects who participated in the experiment (incl. 9 subjects whose main task results were excluded from analysis). Table 5 shows the summary results on Yes/Maybe/No questions. All 33 subjects agreed on the superior efficacy of PanAnthropon. The reasons for its effectiveness, given by the subjects, included: no need to guess right keywords; step-by-step search process; easy comprehensibility of entity types/subtypes/attributes; ability to search for specific entities; ability to specify multiple conditions; no need to browse multiple pages to find answers; ease of making comparisons; absence of extraneous information in query results.

7.1 Experimental Design The evaluation was performed by conducting an experiment using human subjects, which required each subject to perform a retrieval task on the PanAnthropon FilmWorld interface and the Internet Movie Database (IMDb) (http://www.imdb.com/) interface. All subjects were given the same set of 10 questions, divided into two subsets of 5 questions each. One half of the subjects (N=12) first answered Subset 1 using IMDb and then Subset 2 using (the GERQ function of) PanAnthropon (PA); the other half (N=12) first answered Subset 1 using PanAnthropon and then Subset 2 using IMDb. Three variations of question ordering were used for each subset, as shown in Table 1. (The questions were re-labeled when presented to each subject, in order to prevent any bias.) Table 1. Experimental task design

Table 5. Summary of post-task questionnaire responses

8. CONCLUSION The main contribution of this project consists in demonstrating the utility, feasibility, and effectiveness of entity/fact retrieval. The results of evaluation have shown that most subjects found the PanAnthropon interface to be not only effective but also easily understandable and highly usable, despite their limited exposure.

9. ACKNOWLEDGMENTS 7.2 Main Task Results For each subject, precision and recall scores of each task question response were computed, based on the weighted correctness score of each answer item in the response. Per-subject average precision and recall scores were then computed, for each of the two subsets of questions answered by using either IMDb or PanAnthropon. Finally, average precision and recall scores for subjects as a whole (N=24) were computed. Table 2 shows the summary results. Table 3 shows the number of subjects with average precision and recall greater than 90%. Table 4 shows that all 24 subjects had higher average precision/recall on the PanAnthropon subset. Table 2. Summary of main task results

This research has been partially supported by the 2011 Eugene Garfield Doctoral Dissertation Fellowship awarded to the first author by Beta Phi Mu the International Library & Information Studies Honor Society.

10. REFERENCES [1] Athenikos, S.J., and Lin, X. 2011. Enabling type/conditionspecified entity/fact retrieval using semantic knowledge extracted from Wikipedia. To be presented at (and published in Proceedings of) the First International Workshop on Search & Mining Entity-Relationship Data (SMER’11) (Glasgow, UK, 28 October 2011), co-located with the 20th ACM Conference on Information and Knowledge Management (CIKM 2011).

[2] Auer, S., and Lehmann, J. 2007. What have Innsbruck and Leipzig in common?: extracting semantics from wiki content. In Proceedings of 4th European Semantic Web Conference (ESWC 2007) (Innsbruck, Austria, 3–7 June 2007).

[3] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R.,

Table 3. Number of subjects w/ average precision/recall > 90%

and Ives, Z. 2007. DBpedia: a nucleus for a Web of open data. In LNCS 4825: Proceedings of the 6th International Semantic Web Conference (ISWC 2007) and the 2nd Asian Semantic Web Conference (ASWC 2007) (Busan, South Korea, 11–15 November 2007). Springer-Verlag, Berlin/Heidelberg, 722-735.

[4] Suchanek, F.M., Kasneci, G., and Weikum, G. 2007. YAGO: a

Table 4. Number of subjects w/ higher precision/recall on PA

core of semantic knowledge unifying WordNet and Wikipedia. In Proceedings of the 16th Int’l World Wide Web Conference (WWW 2007) (Banff, Alberta, Canada, 8-12 May 2007). ACM Press, New York, NY, 2007, 697-706.

[5] Suchanek, F.M., Kasneci, G., and Weikum, G. 2008. YAGO: a large ontology from Wikipedia and WordNet. Web Semantics: Science, Services and Agents on the World Wide Web 6, 3 (September 2008), 203-207.

20140615 Entity Linking and Retrieval for Semantic Search ... - GitHub