The Application and Research of Ontology Construction ...

Viewer
Transcript

2008 Workshop on Knowledge Discovery and Data Mining

The Application and Research of Ontology Construction Technology Wenwen Yi, Yong Sun, Shukui Zhang,Yingfeng Wu and Zhenhua Chu (Department of Computer Science and Technology, SooChow University, Suzhou 15006, China） {210513078,suny,zhangsk,210513068, 064227065005}@suda.edu.cn

In the field of search, the application of ontology is an important research topic. Introduction of ontology technology in the Retrieval System with massive data can make the searching results more comprehensive. However, now days the ontology is constructed by domain experts, and there are a lot of shortcomings, such as complex process, long time for the project, and difficulty to update. Thereby, in this paper, a method of semiautomaticly building ontology is proposed, after synthetically analyzing a variety of methods and techniques about it. The building process which is based on user interests, mines not only the concepts but also the potential relationships between concepts from the texts by the method of concepts clustering. On the basis of such research, an unique Patent Information Retrieval System based on ontology has been completed. KeyWords: Building Ontology Semiautomaticly, Patent Retrieval, WordNet, Obtaining Users Intrests

dynamic changes data resources. Traditional keyword inquiries can’t be available to provide users a comprehensive search result, since the restrictions of knowledge domain and the own deficiencies of keyword inquiries. For example, when user inputting the keywords like "computer", the retrieval system can only provide the computer-related inquiries results, but could not provide the PC-related result information. In fact, in most time, "PC" and "computer" have the same meaning. Although the introduction of ontology can ameliorate this situation, most of the application ontology is structured manually by domain exports. Hence, there are many disadvantages in the manual method, such as complicated process, heavy workload, difficulty to update and over-dependence on experts. This paper presents a semi-automatic and Dictionary-based Ontology Building Scheme, and gives comprehensive description of system model with semiautomatic building ontology, which is designed for the Patent Information Retrieval System.

1. Introduction

2. Technologies of ontology construction

Today, science and technology are the primary productive force, and according to the World Intellectual Property Organization's authority data, 90% of the latest technologies firstly demonstrate in the form of patent. There are nearly 1 million patents are authorized in the world each year. Patent information is enormous, and still increasing and changing. Where the shoe pinches to performance of patent information retrieval system is how to provide user the fastest retrieval services and the most comprehensive information in such a massive and

Technologies of ontology Construction can be divided into four categories: extracting ontology from free text, constructing ontology based on dictionary, extracting ontology from knowledge base and constructing ontology from the relationship model. We focus on the technology of extracting ontology from free text, which extracts ontology from texts through nature language processing technologies. According to the difference between the method, this technology can be divided as concept-based

Abstract

0-7695-3090-7/08 $25.00 © 2008 IEEE DOI 10.1109/WKDD.2008.108

618

clustering, method.

association-based

and

Considering the own characteristics of Patent Information Retrieval System, this paper adopt the method of concept clustering to construct ontology. The construction process is bottom-up. However, it is very difficult to establish a precise ontology in a full automatic way. There are three problems which are impeding people to construct ontology automatically, that are access of concepts, access of relationships between concepts and description of ontology terms. Some foreign researchers have done research on construction of ontology, but the automatic construction process beginning with nothing is exactly difficult. Based on the above-mentioned problem, this paper presents a semiautomatic construction method with intermediate expansion. The method is guided by user interests, and based on dictionary

model-based

2.1. Concept-based clustering methods D.Faure adopted the hierarchical clustering method based on concepts1, the basic clustering device of which contains fixed collocation of terms composed of verb and preposition. The method consists of two steps: conceptualization and clustering. L.Khan, and others created ontology from text files by using clustering and WordNet2.The creating process is bottom-up. Firstly, they create succession structure of documents by using clustering technologies, and make certain the position of the succession structure in the whole structure. Then they distribute the appropriate concept to each document clustering, by using WordNet and theme tracking algorithm, in order to create ontology.

3. The WordNet-based semiautomatic ontology construction model and tools

2.2. The method based on association rules

3.1. System Model Framework

Maedche and others developed a ontology construction Tool: Text-To-Onto, which is based on association rules [3]. The tool is an integrated environment, which could make the initial production to be required domain ontology. The concepts in the created domain ontology are not only the professional concepts in specific area, but also the concepts with nothing relation to the area. The unrelated concepts are removed, in order to make the terms in domain ontology to meet the requirement of application. This method needs the experts’ supervision, and the learning process needs to cycle.

For the Patent Information Retrieval System, the realization of semiautomatic construction isn’t an end. The goal is to guide the search and service the users better. The patent information has special structure and content, including Application Number, Application Date, and Publish Date, International Publish Number, Main Class Number, and Deputy Class Number, Patent applicant, patent inventor, Applicant Address, Invention Title, Priority, Certification/Publish Date, Agent, Agency, Abstract and other information. From the above information, we select the fields that could demonstrate the patent characteristics best, and consider them as the data source of construction. Design idea: after accessing the users’ interests through the Interests Access Model, we can conclude the domain to construct ontology. We need to extend and develop the existed rudiment ontology, so that the ontology construction could be completed. The existed ontology is used for guiding the retrieval, and provides more comprehensive and more intelligent information to users6 he Frame could be divided into three parts:

2.3. The method based on mode extraction M.A.Hearst provided a method of synonymous model4, which is used for search the relationship between concepts. This method searches the new concepts which is related to the exist concepts, and judge whether there are vocabulary model relationship among them. The association is the relationship between the concepts. But the disadvantage is that the error rate is so high that the generating results need the expert to verify.

619

function words. In the vocabulary memory, the nouns are organized into thematic hierarchy, verbs are organized into various deductions (implication) relationship, and adjectives and adverbs are organized in the N-dimensional super spaces. In WordNet the morphology is expressed as the spelling which is familiar to people and the meanings of morphology is expressed as synonyms set---Synset. Each Synset in the WordNet has a unique Id that is the only concept with clear significance. The definition of the semantic relations in the WordNet, can be considered as a pointer among the Synset.

(1) Interests Access Model: Accessing the users’ Interesting domain information. (2) Information Mining model: Mining the data in users’ interesting domain. (3) Ontology Construction Model: Matching concepts, ontology construction and extraction shown as figure 1. In the whole process, the dictionary is the basis of the semiautomatic construction. We experiment mostly in English patent information retrieval, so we select the famous English dictionary---WordNet, that is one of the most authoritative dictionaries, and it can provide lots of valuable resource for process of nature language and machine translation.

3.2.3 Ontology document parsing tool. Expansion of ontology embryo needs to parse the ontology documents. We use Jena as ontology document parsing tool. Jena provides a java-based and knowledge-based ontology access interface, including the interface which can read and write the classes in the form of RDFS\DAML\OWL. The ontology parsed by Jena, is in the form of Statement Object.

4. Design of all the function model 4.1. Interest access model Fig 1. System Frame

The users’ interests access model save the retrieval records in the form of TXT. System uses ROST4.0 (English words frequency statistical tool) to get the absolute frequency of terms from users’ searching records. After this work, we can get a set of keywords, the absolute frequency of which is the highest in all. And then, according to the keywords, we can know the domain in the IPC (International Patent Class).For example, we use this model to get the keywords that are related to ‘elevator’, and then we decide that the ontology is constructed in the domain of ‘elevator’, and the IPC number is’B66B1’.That is because most users put the interests into the ‘elevator’, and they want to know more information about it. In the 2381 records of ‘elevator’, there are 1576 Invention Patent information, and 814 applied patent information.

3.2. System developing tools 3.2.1. Ontology edits tools. So far, there have been a lot of ontology edit tools, such as Protégé series, OntoEdit, OilEd, Ontolingua, OntoSaurus, and so on. This paper would focus on Protégé, a famous ontology edit tool which is developed by Stanford University, and it provides a free and open- source ontology edit platform. Protégé based java can be used on different OS platform and support function expansion. It can edit the class, instance and attributes directly. Based on above advantages, we adopt Protégé as ontology edit tools in our system. 3.2.2. WordNet. WordNet is an online glossary reference system, and based on psychological linguistics. It divides all the English vocabulary into five categories: nouns, verbs, adjectives, adverbs and

4.2. Information mining model

620

be constructed .In the way of knowledge mining, ontology contents are extended by extracting domain concepts from data source. The extension process keep cycled till the ontology is completed. Figure 2 shows the ontology embryo in the patent information domain.

The key of this model is concepts detecting. The main function of this mode is concepts identification. Although the concepts in the domain are nouns, different operations should be done to different type of fields in the mining process. For an example, operations to Date and Number are different. And another example, Abstract information is String type, so that we couldn’t do concepts matching process directly. At first we must complete the participle and data clearing, and we can mine a lot of potential concepts and the relationship among them. In this process, it is also needed to compute the Term Frequency which is different from the TF introduced in3.1 section. The TF computed in this section is named relative frequency or Normalization frequency. Formula (1) -- TF-IDF5 (Term Frequency Inverse Document Frequency). tf W

ik

=

ik

* log(

n

∑

k =1

( tf

2

ik

N + nf ) nk

) * log

2

(

nk

Http://www.w3.org/TR/xmlschema-2/#string

IC

TI

AN

AD

AB

AGT

AGC

PA

AA

Agent Patent Assignee Key1

……

Key2

Fig 2. the Frame of Ontology Embryo

In the figure, rectangle stands for concepts existed in the embryo, and eclipse stands for new concepts extracted from the data information. For patent information, ontology shown as figure 2, including IC(Invention-Class),TI(Title),AN(Application-Numbe r), AD(Application-Date), AB(Abstract), AGT (Agent), AGO(Agent-Organization), PA(Patent-Assignee),and AA(Assignee-Address).

(1)

N ) + nf

Computing the term weight, we must consider these three points: (1)Term Frequency: It is the frequency of the given term.TF is statistical weight of term that is related to the text, and it is used for measuring the importance of a term in the text. (2)Inverse Document Frequency: It expresses the distributed situation of the given term in the text set, and is computed through the formula log (N/nk+0.01) usually. In the formula, N shows the number of all the train set. nk shows the number of texts with the given term in all the train texts set. Smaller the idf is, more common the term is. If all the texts have this term, the idf of this term is zero. It accords with our usual experience, that more widespread the term is, smaller the contribution it makes. (3)Normalization Factor: Considering that the length of text would influence the terms weight, the dimension quality should be standard, and the weight of each item should be standard in [0, 1].

4.3.2. Concepts matching. Before adding the new detected concepts into the ontology, some concept which relate to the new concept very closely must be found in the ontology embryo, because such relationship is precise. In this system, WordNet-based computation of semantic similarity is used for detecting the relationship between concepts. The vocabulary in WordNet is organized in the form of Synset, so that the semantic similarity based on WordNet can be transformed to be the similarity of two synsets, and the similarity is the maximum of all the groups of any two synsets with the two given terms. The computation of semantic similarity based on WordNet is shown as follow formula (2): Sim (W1 , W 2 ) =

Max

S 1 j ∈ S1 , S 2 j ∈S 2

( Sim ( S1i , S 2 j )) (2)

In formula (2), Sim(W1,W2)shows the similarity of W1 and W2, Sim(S1i,S2j) shows the similarity of synset S1i and synset S2j ,and S1(S2)stands for the synset of W1(W2)in WordNet. Considering the unique meaning of concept in the given ontology, the quality of

4.3. Ontology construction model 4.3.1. Ontology Embryo. Ontology Embryo must be created first, and based on which mature ontology can

621

instance will be enhanced greatly if the computation of similarity is completed after the Synset has been got. So the paper proposes the following method to improve the process. Firstly, we get some instances of the important concepts and relationships manually. After that, the similarity of each instance and concept is computed according to formula (2), and we save the synset ID which has the maximum similarity. Then the synset ID with highest appearance frequency would be considered as the meaning of the given keyword. When computing the similarity between train texts and keywords, the formula (3) should be accorded to. Sim ( Key , W ) = Max ( Sim ( K , S i )) (3) In this formula stands for the meaning of the keyword, S stands for the Synset set included in W, which stands for keyword. We adopt formula (4) to compute the similarity between two synsets.

Fig 3. Individual Concept match with ontology algorithm

αDis(Si , S j ) + β∆Depth

Sim(Si , S j ) = −log

K

4.3.3. Ontology extraction. When new concepts are created, we put the new concepts that have the maximum of semantic similarity to the original concept into the existing ontology. Process of automatic or semiautomatic ontology construction needs to be improved continuously. The original domain ontology is just a embryo ontology, and it didn’t have comprehensive domain knowledge. The construction process perfects the domain ontology continuously.

(4)

Dis (Si, Sj) is the distance from Si to Sj in the semantic tree of WordNet. Depth is the difference of distance, which is the closest distance from Si and Sj to their the same superior Synset in the semantic tree of WordNet. Both α and β are constant, so is K in the formula, and α＋β＝1. Shown as figure 3, if the flag is true, it’s shown that the synonym has been existed in the ontology prototype. If the flag is false, then the concept who has the MaxSimilarity value with Ci, can be considered as the patriarchal concept of Ci . There are a large number of semantic similarity calculation methods which are based on WordNet. The process of computing the similarity of two Synset developed from just considering the distance of two synset in the WordNet semantic tree, to considering the information such as depth and density, in the semantic tree now. In the text, the meaning in the ontology of the keyword, that was S in formula (3), has been determinate. So the depth and density of S has also been determinate, which isn’t considered in our paper.

5. Experimental results We select some keywords randomly, and then obtain the retrieval results and the related results according to the general correlation between them. We contrast use and nonuse of the program proposed in this paper, and the analysis result shown as table 1: Table1: Contrast of use and nonuse of the program

622

Key

Relevant

Before

After

Elevator

15268

8967

13582

Automatic

5864

1090

3671

◆Display

21662

10905

11773

◆Car

48678

24266

25619

Engine

15664

9876

12466

Axletree

9867

5554

7841

Equipment

49117

41553

46014

[1] FAURE D and POIBEAUT, “First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX [C ]”, in Proc. of the Workshop on Ontology Learning, 14th European Conference on Artificial Intelligence (ECAI’00), Berlin, 2000.

The table shows that, the retrieval quantities of keywords, except ‘Display’ and ‘car’, have been increased obviously. It is because the ontology we constructed is related to the domain of ‘elevator’, and the retrieval effect of keywords which relate to ‘elevator’ closely would be improved obviously. Contrarily, the keywords such as ‘Display’ and ‘car’, which have lower relationship with ‘elevator’, could not be improved efficiently.

[2] Khan L. and Luo F, “Ontology construction for information select ion [C ]”, in Proc. of 14th IEEE International

Conference

on

Tools

with

Artificial

Intelligence, Washington D C, 2002. [3]Maedche A. and Volzr, “The text-to-onto ontology extraction and maintenance environment [C ]”, in Proc. of the ICDM Workshop on integrating data mining and knowledge management, California, 2001.

6. Conclusions and further research

[4]Hearst M A, WordNet: an electronic lexical database[M ], MIT Press, Cambridge ,1998.

This paper presents a set of project of semiautomatic, WordNet-based, ontology construction and implement on Patent Information Retrieval System. The method based on concepts clustering, excavates new concepts and new relationships between them from the abstract information of patent, and constructs the ontology with that. Comparing with the construction by experts, this method greatly reduces the complexity of manual operation and dependence on experts, and makes update simple and convenient. Each function model and implementary process of the system is described in detail in the paper. The experimental results show that the introduction of ontology efficiently improves the recall rate. This method will be applied to the retrieval of Chinese information in the future work. We will use ICTCLAS1.0 developed by Calculating Department of Chinese Academy of Sciences to deal with Chinese segmentation, and introduce the dictionary----HowNet. The system will support Chinese and English ontology Construction, and will improve both the two language retrieval. Nay, we will build Semantic Network on the system, and it can support users to get exact results with input of blurry query.

[5]Maron M E, “On relevance probabilistic indexing and information retrieval [ J ]”, in Journal of the ACM,1960,7(3). [6]Mike Uschold, “Ontologies: principle, methods and application”, in The Knowledge Engineering Review, 1996, 11(2): 93-136. [7]Enrico Motta, “Trends in knowledge modeling: report on the 7th KEML Workshop”, in The knowledge engineering review, 1997, 12(2): 202-217. [8]Gruber, “Ontolingua: A Mechanism to Support Portable Ontologies Version 3.0”, in Technical report, .KSL, Stanford University, 1994. [9]Zhang H. and Song H.-t, “Fuzzy Related Classification Approach Based on Semantic Measurement for Web Document”, in The International Conference on Data Mining, Hong Kong, 2006. [10]Choi K.-S and Lee C.-H, “Document ontology based personalized filtering system”, in Proc. of ACM Multimedia, pp. 362–364,2000.

References

623

The construction and application of an atomistic J ...