Evolutionary Learning of Syntax Patterns for Genic Interaction Extraction

Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, Fabiano Tarlao, Marco Virgolin

UNIVERSITÀ DEGLI STUDI DI TRIESTE DIPARTIMENTO DI INGEGNERIA E ARCHITETTURA

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Problem

➔ Identifying sentences that contain interactions between genes and proteins ◆ from biomedical literature ➔ Available data: ◆ dictionary of genes, proteins and interactors ◆ example sentences

2

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Why? ➔ Biomedical literature is: ◆ vast ◆ rapidly growing

➔ Challenging problem: automatic extraction of knowledge from a text in natural language ◆ informations are “diluted” in the text ◆ very challenging problem: discover relations between entities

3

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Goal ➔ Generation of a classifier C in order to identify sentences containing interactions between genes and proteins ◆ automatically ◆ based on recurring syntactic patterns

4

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Our approach ➔ Classifier C is a set of regular expressions (regex)

C={r1,r2,...} ➔ Each regex is a sentence classifier (“accepts” or “does not accept”) ◆ C accepts sentences accepted by at least one regex ➔ Regex applied on a semantical representation of the text

5

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Our approach (II) ➔ Regex generated automatically ◆ by means of Genetic Programming (GP) ◆ starting from examples ● strings which must be accepted ● strings which must not be accepted

6

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Sentences preprocessing Mapping of a sentence s in a ɸ-string x a. substitution of words in s with “annotations” i. gene, protein, interactor or ii. Part-Of-Speech b. mapping of annotations in Unicode characters c. concatenation

7

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Sentences preprocessing (II) Example: s = YfhP may act as a negative regulator for the transcription of yfhQ ↓ [YfhP] [may] [act] [as] [a] [negative] [regulator] [for] [the] [transcription] [of] [yfhQ]

↓ [GENEPTN] [MD] [VB] [IN] [DT] [JJ] [INOUN] [IN] [DT] [INOUN] [IN] [GENEPTN]

↓ x = GB0if6JifJiG

8

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Generation of C: GP ➔ We used a Tree-based GP ➔ In this work candidate solution = regex

9

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Key aspects ➔ Multi-objective fitness: ◆ f=(Accuracy, FPR, Regex length) ◆ we purposefully avoided to include any problemspecific knowledge (gene/protein/…)

➔ Problem handled by mean of separate-andconquer ➔ Final output: set of regular expressions C={r1, r2,...} 10

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Separate-and-conquer ➔ Each regex ri ∈ C makes an independent and parallel classification ➔ Each regex is tailored for a sub-problem ◆ the problem is solved “step-by-step”

➔ Final output = logic OR of classifications

11

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Separate-and-conquer ● C=∅ ● we execute a GP search over the examples obtaining r* ● if FPR < threshold ○ C = C ∪ {r*} ● else ○ terminate

● remove from the positive examples those which were classified correctly by r* 12

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Classifier example C = {r1, r2} r1 = GENEPTN[ˆRB][^NNS VBN GENEPTN]++ r2 = . INOUN IN GENEPTN . [ˆDT NN]

13

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Experimental evaluation: the data ➔ Dataset: 456 sentences from biomedical papers ◆ ½ with interactions e ½ without ◆ manually labelled by experts

➔ Dataset splitted in Learning e Testing ◆ ≈80% examples in Learning ◆ ≈20% examples in Testing

➔ 5 fold randomly generated ◆ with Testingi≠Testingj 14

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Baseline 1, 2: problem specific knowledge ➔ Annotations-Co-Occurrence ◆ it is tightly tailored to this specific problem ◆ sentence is positive if contains ● at least 2 genes/proteins ● at least 1 interactor

➔ Annotations-LLL05-Patterns ◆ 10 pattern generated in “LLL'05 Challenge: Genic Interaction Extraction with Alignments and Finite State Automata”

- J. Hakenberg et alia ◆ built over >90% of the dataset (also testing!)

15

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Baseline 3: ɸ-SSLEA ➔ Based on Smart State Labeling Algorithm ◆ algorithm for DFA learning ◆ works well in presence of noise

➔ Hill-Climbing ➔ Generates DFA which accepts or refuse a ɸstring x ◆ if x accepted ⇒ x contains an interaction between gene/protein ◆ otherwise, no 16

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Baseline 4, 5: Words-NaiveBayes e Words-SVM ➔ Standard for text classification ◆ Supervised Machine Learning methods

➔ Feature based on word occurrences ➔ Preprocessing ◆ stemming ◆ features selection

17

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Results Averaged over the 5 folds Classifier

Accuracy

FPR

FNR

Annotations-Co-Occurrence

77.8

40.0

4.5

Annotations-LLL05-Patterns

82.3

25.0

10.5

Words-NaiveBayes

51.3

25.0

95.0

Words-SVM

73.8

29.0

23.5

ɸ-SSLEA

59.8

44.0

33.5

C

73.7

23.5

22.5 18

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Results (II) ➔ C performs as well as Word-SVM and better than other learning approaches ➔ accuracies of C and Annotations-Co-Occurrence (which exploits domain knowledge of an expert) are very close ◆ Pro: C is composed by patterns (regex) readable ◆ Con: time to generate C (hours) ≫ time to generate other methods (minutes) ● but ≈ time taken for classifying (seconds)

19

Evolutionary Learning of Syntax Patterns

DIA - UniTs

Conclusions We proposed: ➔ a method for the automatic synthesis of a classifier for natural language sentences ◆ based on syntactic pattern ◆ by mean of GP ◆ separate-and-conquer ➔ results are highly promising

20

Evolutionary Learning of Syntax Patterns for Genic ...

Evolutionary Learning of Syntax Patterns. Key aspects. 10. ➔ Multi-objective fitness: ◇ f=(Accuracy, FPR, Regex length). ◇ we purposefully avoided to include ...

750KB Sizes 0 Downloads 93 Views

Recommend Documents

Better Learning and Decoding for Syntax Based SMT ...
Data made available by the courtesy of Microsoft .... Part-of-Speech mapping template: whether the ..... clude that PSDIG and Pharaoh each excel on dif-.

Learning the Motion Patterns of Humans for Predictive ...
Engineering, National Taiwan University, Taipei, Taiwan ( e-mail: ..... Int. Conf. on Robotics, Automation and Mechatronics, pp. 1-8, 2006. [3] H. Choset and J.

Origins of Syntax?
questions they address, and the techniques used to check the validity of current ... spring up in the future to explore other aspects of the vast research domain of ..... probabilistic information available in the input to the learner/speaker/hearer.

Fertility alteration behaviour of Thermosensitive Genic ... - CiteSeerX
The panicle development stages from meiotic division of pollen mother cell (S6) to pollen ripening (S8) were ..... Development Center, Changsha, China. pp.188-.

Spatial Patterns and Evolutionary Processes in ...
Nov 19, 2009 - Such a process could account ..... This latter test was made by checking whether the correlogram con- ..... Proceedings of the 10th international.