The University of Tokyo Graduate School of Information Science and Technology

COLLECTIVE SEMANTIC ANNOTATION FOR WEB TEXT: TRIPLE TAGGING AND TRIPLE EXTRACTION

A Thesis in Department of Creative Informatics by Jie Yang

c 2008 Jie Yang °

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

June 2008

Abstract

Semantic annotations are machine-understandable metadata attached to web resources. Semantic annotations represent information contained in text documents in a structured format which are more amenable to applications in data mining, question answering, or the Semantic Web. Considerable research has been done in the reign of semantic annotation. If we check the sources of the semantics of semantic annotations, existing studies can be classified in two categories: the “ontology-centric” class which depends on the “a-prior” vocabularies (generally known as ontologies) to annotate web text; and the recent “user-centric” class which avoids pre-defined vocabularies and allows normal web users to annotate web text with less or no constraints. This research on “collective semantic annotation” is a user-centric annotation approach. The goal of the work is to explore how we can generate semantic annotations for web text by exploiting the strengths of both normal web users and computers. Specifically, two questions are addressed. Firstly, what user-centric support can be provided to encourage normal web users annotating web text? Secondly, how to automate the annotation process? As the result of the first question, a user-centric annotation diagram, triple tagging diagram, is proposed. I identify eight dimensions which help us to describe annotation frameworks. Literature work is investigated in terms of the eight dimensions. The features and novelties of the triple tagging diagram are addressed. The diagram consists of three parts: the concept model which defines annotation primitives, the collaboration model which addresses the information collection and navigation possibilities, and the ontology model which provides a common definition for triple annotations so that they can be exchanged, re-used, and extended on the Web. A model evaluation is carried out, which includes both qualitative and quantitative analysis. The evaluation exhibits the expressive power and advantages of the triple tagging diagram over existing work. Regarding the second question, I propose an interactive approach which generates semantic annotations for web text automatically. In this approach, the annotation generation problem is defined as a binary relation extraction problem. ii

Linguistics and machine learning techniques are exploited to solve the problem. Specifically, we propose the algorithm of penalty tree similarity. The algorithm is an extension of tree kernels which are widely used in the field of Information Extraction. A triple tagging corpus is created and used in experiments. The result shows that the extended tree similarity algorithm achieves better performance. As a result of this research, a triple tagging system, Triple-Note, is implemented. It is implemented in a web-server architecture. On the client side an extension of Firefox browser is implemented to support users’ annotating actions. On the server side, automatic extraction, annotation storage, and other servicing models are implemented.

iii

Table of Contents

List of Figures

vii

List of Tables

ix

Acknowledgments

x

Chapter 1 Introduction 1.1 Context . . . . . . . . . . . . . . . . . . . . . 1.1.1 Annotation and Semantic Annotation . 1.1.2 Ontology-centric Semantic Annotation 1.1.3 User-centric Semantic Annotation . . . 1.2 Research Questions . . . . . . . . . . . . . . . 1.3 Approach and Contribution . . . . . . . . . . 1.4 Structure of the Thesis . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1 1 1 4 6 7 8 9

Chapter 2 Review of Annotation Systems 2.1 Eight Dimensions . . . . . . . . . . . . 2.2 Review of Literature Work . . . . . . . 2.2.1 Group I: Ontological Approach 2.2.2 Group II: Social Approach . . . 2.2.3 Group III: Bridging Approach . 2.3 Our Diagram . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

11 11 14 15 17 17 23

Chapter 3 Triple Tagging Model 3.1 Conceptual Model . . . . . . . . . 3.1.1 Triple Tagging Primitives 3.1.2 Tag Graph . . . . . . . . . 3.1.3 Mapping to RDF . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

24 24 25 26 28

iv

. . . .

. . . .

. . . .

3.2

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Chapter 4 Model Evaluation 4.1 Model Expressiveness . . . . . . . . . . . . 4.1.1 Triple Tagging and Semantic Wikis 4.1.2 Triple Tagging and Social Tagging . 4.1.3 Triple Tagging and Google Base . . 4.2 User Evaluation . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

42 . 42 . 43 . 45 . 46 . 47

Chapter 5 Sentence-Based Triple Extraction 5.1 Definition of the Problem . . . . . . 5.1.1 Motivation . . . . . . . . . . . 5.1.2 Binary Relation Extraction . 5.2 Related Work . . . . . . . . . . . . . 5.3 Syntactic Representation of Sentence 5.4 Tree Kernels . . . . . . . . . . . . . . 5.4.1 Kernel Methods . . . . . . . . 5.4.2 Dependency Tree Kernel . . . 5.5 Penalty Tree Similarity . . . . . . . . 5.6 Discussion . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

52 52 52 53 54 55 58 58 59 61 64

. . . . . . . . . . .

66 66 69 70 72 72 73 73 74 75 76 76

3.3

3.4

Exploiting Triple Tagging Model . 3.2.1 Collaboration . . . . . . . 3.2.2 Triple Query . . . . . . . . 3.2.3 Augmented Navigation . . Triple Tagging Ontology . . . . . 3.3.1 Triple Tagging Ontology . 3.3.2 An Example . . . . . . . . Discussion . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Chapter 6 An Interactive Approach for Triple Extraction 6.1 Definition of the Problem . . . . . . . . . . . . 6.2 Related Work . . . . . . . . . . . . . . . . . . . 6.3 Overview of the Process . . . . . . . . . . . . . 6.4 Pre-processing . . . . . . . . . . . . . . . . . . . 6.4.1 Dependency Parsing . . . . . . . . . . . 6.4.2 Semantic Tagging . . . . . . . . . . . . . 6.5 Word Pair Detecting . . . . . . . . . . . . . . . 6.5.1 POS Filtering . . . . . . . . . . . . . . . 6.5.2 Word Pair Filtering . . . . . . . . . . . . 6.6 Relation Labeling and Triple Filtering . . . . . 6.7 Discussion . . . . . . . . . . . . . . . . . . . . . v

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

29 29 30 31 32 33 34 36

Chapter 7 Experiments 7.1 Create a Corpus . . . . . . . . . . . . . . . . . . . . 7.2 Does Hypothesis One Hold? Semantic Convergence 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Setup . . . . . . . . . . . . . . . . . . . . . 7.3.2 Precision and Recall . . . . . . . . . . . . . 7.3.3 Results . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

79 79 80 84 84 85 86

Chapter 8 Implementation System: Triple-Note 8.1 User Interface . . . . . . . . . . . . . 8.1.1 Triple tagging web contents . 8.1.2 Triple recommendation . . . . 8.1.3 Triple graph browsing . . . . 8.2 Architecture . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

88 88 89 89 90 92

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Chapter 9 Conclusions and Future Research Directions 94 9.1 Research Justification . . . . . . . . . . . . . . . . . . . . . . . . . . 94 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 9.3 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . 97 Appendix A Triple Tagging Guideline 99 A.1 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 A.2 Examples and Explanations . . . . . . . . . . . . . . . . . . . . . . 100 Appendix B OWL File of the Triple Tagging Model

103

Bibliography

109

Index

119

List of Abbreviations

120

Publications

121

vi

The University of Tokyo Graduate School of Information ...

sources. Semantic annotations represent information contained in text documents in a structured format which are more amenable to applications in data mining,.

39KB Sizes 0 Downloads 251 Views

Recommend Documents

BOSTON UNIVERSITY GRADUATE SCHOOL OF ...
grammar for my conference abstracts, term papers, manuscripts, and this dissertation, ...... For example, in (21), the antecedent of the elided VP go to the ball.

Education Stanford University, Graduate School of ...
ALEXANDER A. NEZLOBIN. Haas School of Business,. University of California, Berkeley. Berkeley, CA 94720. Tel: +1 (650) 862-8875. E-mail: [email protected] sites.google.com/site/alexanderanezlobin/. Education. Stanford University, Graduate Sc

Manipal University Welcomgroup Graduate School ... -
131401148 Karthik S Ballal. 3. 4. 9. 16. 55. 11. 66. 13. 40. 45. 131401150 Sanket Raj. 3. 4. 9. 16. 59. 12. 68. 14. 42. 46. 131401152 Rahul Samson Rebello. 3. 4.

The City University of New York - The Graduate Center, CUNY
Jul 6, 2011 - Act as a First Responder to alarms and calls for service. Observe campus activities, reporting suspicious behavior and other incidents to Central ...

Stanford Business - Stanford Graduate School of Business
To be concrete, he cites examples from the airline industry. ..... four key variations on the idea: “I have lots of time in ...... Renewable energy and solar in particular.

Stanford Business - Stanford Graduate School of Business
1. I'm excited to write about reinvention because it is a process I think about often ..... School of Business for 12 years. The class ... at trade shows to lend a hand.

Welcomgroup Graduate School of Hotel ... -
ATTENDANCE PERCENTAGES TILL FEBRUARY 20, 2018. FIRST SEMESTER BHM SECTION A ... 57.89. 2. 171401002 MOHAMMED HARSHAD BHAVA. 19. 19. 100.00. 3. 171401005 SIDDANTH RAINA. 19. 17. 89.47. 4. 171401008 AJAY JOSEPH JAIN. 19. 18. 94.74. 5. 171401012 LAVANYA

financialization of the economy - University of Michigan's Ross School ...
Jan 13, 2015 - building a business that enhances The Coca-Cola Company's trademarks. ... Apple-- which regularly tops the list of the world's most valuable ...

financialization of the economy - University of Michigan's Ross School ...
Jan 13, 2015 - financialization is a potent force for changing social institutions. .... top five hedge fund managers in 2004 earned more than all of the CEOs in the .... Page 10 .... financial media meant that by the late 1990s, firms were under ...

VIT UNIVERSITY SCHOOL OF ELECTRONICS ...
Which of the figures above is the best representation of the channel in the schematic on the ... (a) Calculate VOH, VOL, VM of the above inveter. (b) Find VIH, VIL, ...

CAVENDISH UNIVERSITY SCHOOL OF MEDICINE ...
SCHOOL OF MEDICINE. Foundation Physics Tutorial sheet 1. (January 2017 intake). 1. A warplane moving at the same altitude makes three successive ...

Fordham University School of Law
Sep 30, 2003 - ... without charge from the Social Science Research Network electronic library: ... Both were twenty-eight.1 Over the next seven years, Andrea2 ..... was designing computer systems for NASA.53 Andrea approached him first in ...... nati

stirling management school scholarship ... - University of Stirling
MBA Only. MBA Leadership for Sustainable. Futures Scholarships. MBA Only ... Title of Masters course applied for (e.g. MSc Marketing). Type of offer received from the ... help you, and detailing your motivations, expectations and educational or profe

KELLEY SCHOOL OF BUSINESS, INDIANA UNIVERSITY
“Best Foot Forward or Best for Last in a Sequential Auction? .... Service to the Department of Finance, Kelley School of Business, and Indiana University.

Stable Matching With Incomplete Information - University of ...
Page 1. Econometrica Supplementary Material. SUPPLEMENT TO “STABLE MATCHING WITH INCOMPLETE. INFORMATION”: ONLINE APPENDIX. (Econometrica, Vol. 82, No. 2, March 2014, 541–587). BY QINGMIN LIU, GEORGE J. MAILATH,. ANDREW POSTLEWAITE, AND LARRY S

Polsky CME paper final.indd - The University of Chicago Booth School ...
Apr 1, 2006 - Futures trading started in Chicago in the mid 1800s as a way of managing ... has also been a rapid increase in the number of hedge funds and proprietary trading ..... the NYSE became a publicly traded company. It also led to.

Polsky CME paper final.indd - The University of Chicago Booth School ...
Apr 1, 2006 - developed, or are currently in the process of development, indicate that Chicago ..... technology behind the new exchange is a combination of.

ACU Commonwealth Summer School 2016 University of Rwanda The ...
Aug 20, 2016 - Site visit: HEHE Labs - striving for Rwanda to become an ICT. Hub. 18:30 – 19:30 ... communities into social learning capitals – Dr Prasad.

wioletta dziuda - Harris School of Public Policy - University of Chicago
Microeconomic Analysis, 1997-2000. Higher School of ... Microeconomic Analysis, 2005, 2006. Kellogg School of ... 2008-2013. Business Analytics, 2013-2014.