SeRT - a tool for knowledge extraction from text

Caroline Barrière School of Information Technology and Engineering University of Ottawa Ottawa, Ontario, Canada [email protected]

CLiNE - May 24 2002

A few questions... - Why knowledge extraction from text? For building a Knowledge Base... - What’s a Knowledge Base? It depends who defines it.... - From a terminological standpoint: A static repository of domain-specific knowledge, giving the important concepts and their relations. - What kind of relations? Hyperonymy (is-a), meronymy (part-of), synonymy, function, definition, causality - Why start from text? What are the alternatives? CLiNE - May 24 2002

Semantic Relations in Text (SeRT)

- Goal : Starting from a corpus of texts on a specific domain, capture and store the important concepts (terms) of that domain, as well as their relations.

- Hypothesis - definitions can be derived from text analysis - text is used as language and meta-language - paradigmatic relations can be found in texts by pattern search - present knowledge representation formalism allow the representation of this information

CLiNE - May 24 2002

Example of a pattern search for hyperonymy (Corpus on Composting)

In clay soils, organic materials such as compost and pine bark increase drainage and air space. Some yard wastes, such as wood chips, are very difficult to compost fully and are therefore not suitable for incorporation into the soil. Grass clippings and other green vegetation tend to have a higher proportion of nitrogen (and therefore a lower C/N ratio) than brown vegetation such as dried leaves or wood chips. To help meet that requirement, North Carolina passed l law that prohibits depositing organic yard wastes such as leaves, grass clippings, or tree trimmings in the state's landfills. Table 2. Semantic relation hypernym found through the pattern such as and and other

CLiNE - May 24 2002

SeRT - Features - parallel search of terms and relations - term extraction - search for surface patterns leading to semantic relations - focus on user interaction (nothing fully automatic) - term selection and validation - user definition of surface patterns corresponding to semantic relations - user selection of concepts involved (tuple) in the semantic relation - raw text used (no preprocessing necessary) - easy access to KB : save and retrieval - to be used in “bootstrapping” mode

CLiNE - May 24 2002

Term extraction - Usage of a stop list a, able, about, above, according, accordingly, across, actually … - appropriate method for English (but maybe not for French) satellite link - liaison par satellite laser printer - imprimante au laser communication network - réseau de communication - no syntactic analysis - different from: Daille 1994: linguistic patterns (French) Bourigault 1994: morpho-syntactic markers (French) - lemmatization 'moving quickly'  ‘mov[ing] quick[ly]  'mov* quick* CLiNE - May 24 2002

Results - Corpus on “composting” - Terms 503 373 258 202 170 155 142 110 103 102 100 92 83

compost pile composting soil materials material nitrogen compost pile water bin time leaves bacteria

402 369 199 187 149 146 133 105 102 96 95 94 85

compost pile soil composting material materials nitrogen compost pile bin time water Compost leaves

CLiNE - May 24 2002

402 369 295 260 199 133 105 105 102 96 95 95 94

compost pile materi* compost* soil nitrogen compost pile temperatur* bin time leav* water Compost

CLiNE - May 24 2002

Search for patterns indicating semantic relations - pre-encoded patterns (earlier work - Barrière 1997) - find list from all other authors - pattern search has multiple possibilities: - string matching - lemmatized token matching - part of speech matching - inclusion of a dictionary look-up (derived from Collins + morphological rules added) - possibility of searching for a pattern around 1 term - usually what Computational Terminologists want to do - display limited or enlarged context

CLiNE - May 24 2002

Example of search patterns Hyperonymy such as (string matching) and other *|n (string + POS) includ* *|n (lemmatized string + POS) *|n is a *|a of [~part] (negative filter) *|y organic materi* [mostly, especially, specifically] (positive filter) + (search with specific term) Synonymy known as also called

(string matching) (string matching)

Meronymy contains *|n is a *|a part of

(string + POS) (string + POS) CLiNE - May 24 2002

CLiNE - May 24 2002

regular dictionary:

77,000

(1046 kb)

26,000

(387 kb)

94,000

(333 kb)

197,000 entries

(1766 kb)

aback,y abactinally,y abashedly,y abdominally,y abed,y abhorrently,y

irregular directory: a',a ablebodied,a ablebodieder,a ablebodiedest,a abranchial,a abranchialer,a abranchialest,a

entries with multiple POS: roughcast,nv huggermugger,anvy broadcast,anvy ground,anv like,acnrvy cut,anv draft,nv

TOTAL:

CLiNE - May 24 2002

public String[][] inflect = // plural nouns { { "", "s" }, { "", "es" }, { "y", "ies" }, { "an", "en" }, { "um", "a" }, { "", "e" }, { "us", "i" }, { ... // comparative adjectives { "", "er" }, { "e", "er" }, { "y", "ier" }, { "c", "caler" }, { "", "der" }, {

CLiNE - May 24 2002

CLiNE - May 24 2002

Information storage in the TKB - transfer of info found at previous step - user selects the terms (concepts) around the pattern - semantic relation / pattern / tuple are stored in the TKB - an uncertainty factor can also be added to the tuple - research on causal relation has lead to realize the necessity of this information - applies to different relations

CLiNE - May 24 2002

Semantic relation extraction

CLiNE - May 24 2002

Results - semantic relations - Exploration of a few patterns - contain? (meronymy) - such as & and other (hypernymy)

Fresh, young weeds from your irrigated garden can contain 60-70% moisture - no need to add water to them. Leaves from eucalyptus, walnuts, and laurel trees contain tannins. Every piece of organic material contains carbon and nitrogen in differing ratios.. Most compost also contains as much as 2 percent calcium. Table 1. Semantic relation meronymfound through the pattern contain

CLiNE - May 24 2002

In clay soils, organic materials such as compost and pine bark increase drainage and air space. Some yard wastes, such as wood chips, are very difficult to compost fully and are therefore not suitable for incorporation into the soil. Grass clippings and other green vegetation tend to have a higher proportion of nitrogen (and therefore a lower C/N ratio) than brown vegetation such as dried leaves or wood chips. To help meet that requirement, North Carolina passed l law that prohibits depositing organic yard wastes such as leaves, grass clippings, or tree trimmings in the state's landfills. Table 2. Semantic relation hypernym found through the pattern such as and and other

CLiNE - May 24 2002

tuple (place 1)

relation < meronym > tuple (place 2)

relation tuple (place 1) tuple (place 2)

60-70% moisture

young weeds

compost

organic material

tannins

leaves from eucalyptus tree

pine bark

organic material

tannins tannins

leaves from walnut tree leaves from laurel tree

wood chips grass clippings

yard wastes green vegetation

carbon

organic material

dried leaves

brown vegetation

nitrogen

organic material

wood chips

brown vegetation

calcium

compost

leaves

organic yard wastes

grass clippings

organic yard wastes

tree trimmings

organic yard wastes

Table 3. Possible relations extracted from a text

CLiNE - May 24 2002

Could we infer is-a relations and extend the type hierarchy?

CLiNE - May 24 2002

SeRT use - Parallel mode - searching on patterns can suggest terms to be explored - search on terms can suggest patterns around them - Bootstrapping mode for relations - start with one pattern: enhance - tuplet compost/soil found used to find other patterns

CLiNE - May 24 2002

As a soil amendment, compost is thought to enhance the physical, chemical, and biological properties of soils. When worm compost is added to soil, it boosts the nutrients available to plants and enhances soil structure and drainage. This discussion is an attempt to enhance your understanding of the conditions which can lead to odor formation, in the hopes that they can be avoided or at least minimized in the future. No matter your soil type, your climatic zone, or your choice of crops, composting will enhance your garden soil, resulting in stronger plants and healthier produce. Table 4. Sentences containing the verbal pattern enhance

CLiNE - May 24 2002

Before using compost, be sure to study a copy of any soil or waste chemical nutrient analyses, pesticide and heavy metal analyses, and stability tests that the producer of the compost performed. When worm compost is added to soil, it boosts the nutrients available to plants and enhances soil structure and drainage. How does compost help soil structure? Some people get around the problem of nitrogen loss by adding bloodmeal to the soil before they bury the compost materials. Composting is really quite simple, inexpensive, ecologically sound, and utterly failproof - no matter what you do, your pile wile eventually rot into soil-enriching compost! While compost is a panacea for all garden soils, poor soils especially will benefit from consistent applications. Table 5. Some examples of the tuple "compost/soil" in the corpus

CLiNE - May 24 2002

Future work

Short term (tool itself)

- Add list of predefined relations & patterns - Add flexibility in pattern search - toward a mix of semantic and syntactic search - Construction of a graphical representation of the semantic network built

CLiNE - May 24 2002

Future work Long term (tool + theoretical background) - Work on compound nouns - much implicit information that could be put explicitly in the KB - Work on representational scheme - the relational database is too limiting - causal relation requires a different type of representation - contexts for expressing the relation (possibly nested) - uncertainty factors - inferencing - Explore pattern search in French - Batch mode extraction (no user) - automatic selection of terms around patterns - after certain terms and patterns have been identified - need an integration of confidence levels on patterns CLiNE - May 24 2002

SeRT - a tool for knowledge extraction from text ...

text is used as language and meta-language. - paradigmatic relations can be ... parallel search of terms and relations. - term extraction. - search for surface ...

159KB Sizes 2 Downloads 250 Views

Recommend Documents

A Tool for Text Comparison
The data to be processed was a comparative corpus, the. METER ..... where xk denotes the mean value of the kth variables of all the entries within a cluster.

Text Region Extraction from Business Card Images for ...
Email: [email protected]. Abstract. Designing a ..... and Research (CMATER) and Project on Storage Retrieval and Understanding of Video for. Multimedia ...

Semantic Property Grammars for Knowledge Extraction ... - CiteSeerX
available for this task, which only output a parse tree. In addition, we ... to a DNA domain or region, while sometimes it refers to a protein domain or region.

Semantic Property Grammars for Knowledge Extraction ... - CiteSeerX
source of wanted concept extraction, we can directly apply the same method- ... assess a general theme in the (sub)text: since the parser retrieves the seman-.

Text Extraction and Segmentation from Multi- skewed Business Card ...
Department of Computer Science & Engineering,. Jadavpur University, Kolkata ... segmentation techniques for camera captured business card images. At first ...

WIKE: A Web Information/Knowledge Extraction System ...
for Web Service Generation ... tion/knowledge from Web sites and generate Web services .... and executed on a remote system hosting the requested ser- vice.

Inference of Regular Expressions for Text Extraction ...
Support for the or operator. ...... underlying domain-specific language supporting a prede- ..... Computers by Means of Natural Selection (Complex Adaptive .... 351–365. [45] A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao,. “Inference of ...

Text Extraction Using Efficient Prototype - IJRIT
Dec 12, 2013 - as market analysis and business management, can benefit by the use of the information ... model to effectively use and update the discovered Models and apply it ..... Formally, for all positive documents di ϵ D +, we first deploy its

Inference of Regular Expressions for Text Extraction ...
language based on positive and negative sample strings,. i.e., of strings described ...... [38] J.-R. Cano, “Analysis of data complexity measures for classification ...

Text-Line Extraction using a Convolution of Isotropic ...
... of text-lines. For a sample document image, the smoothing results of isotropic, ... of applying a set of filters, instead of one, for a given data processing task. ..... [12] W. T. Freeman and E. H. Adelson, “The design and use of steerable fil

Online education, the new age tool for enhancement of knowledge ...
Online education, the new age tool for enhancement of knowledge and skill.pdf. Online education, the new age tool for enhancement of knowledge and skill.pdf.

Text Extraction Using Efficient Prototype - IJRIT
Dec 12, 2013 - Various data mining techniques have been proposed for mining useful Models ... algorithms to find particular Models within a reasonable and ...

An Improved Text Entry Tool for PDAs
virtual) keyboards, handwriting recognition systems and voice recognition systems. Also ... This is what we call the traditional running mode of Wtx. .... In Proceedings of the 12th biennial conference of the international graphonomics society (p.

Information Extraction from Calls for Papers with ... - CiteSeerX
These events are typically announced in call for papers (CFP) that are distributed via mailing lists. ..... INST University, Center, Institute, School. ORG Society ...

Information Extraction from Calls for Papers with ...
These events are typically announced in call for papers (CFP) that are distributed ... key information such as conference names, titles, dates, locations and submission ... In [5] a CRF is trained to extract various fields (such as author, title, etc

OntoDW: An approach for extraction of conceptualizations from Data ...
OntoDW: An approach for extraction of conceptualizations from Data Warehouses.pdf. OntoDW: An approach for extraction of conceptualizations from Data ...

Machine Learning for Information Extraction from XML ...
text becoming available on the World Wide Web as online communities of users in diverse domains emerge to share documents and other digital resources.

Information Extraction from Calls for Papers with ... - CiteSeerX
information is typically distributed via mailing lists in so-called call for papers ... in a structured manner, e.g. by searching in different fields and browsing lists of ...

Information Extraction from Calls for Papers with ... - CiteSeerX
Layout features such as line begins with punctuation and line is the last line are also used to learn to detect and extract signature lines and reply lines in E-mails ...

Knowledge Extraction and Outcome Prediction using Medical Notes
to perform analysis on patient data. By training a number of statistical machine learning classifiers over the unstructured text found in admission notes and ...

TEXTLINE INFORMATION EXTRACTION FROM ... - Semantic Scholar
because of the assumption that more characters lie on baseline than on x-line. After each deformation iter- ation, the distances between each pair of snakes are adjusted and made equal to average distance. Based on the above defined features of snake

Unsupervised Features Extraction from Asynchronous ...
Now for many applications, especially those involving motion processing, successive ... 128x128 AER retina data in near real-time on a standard desktop CPU.

A new tool for teachers
Items 11 - 20 - Note: The authors wish to express their sincere thanks to Jim Davis .... of the American population) to allow confident generalizations. Children were ..... available to them and (b) whether they currently had a library card. Those to

3. MK8 Extraction From Reservoir.pdf
Try one of the apps below to open or edit this item. 3. MK8 Extraction From Reservoir.pdf. 3. MK8 Extraction From Reservoir.pdf. Open. Extract. Open with.