Evolutionary Learning of Syntax Patterns for Genic Interaction Extraction
Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, Fabiano Tarlao, Marco Virgolin
UNIVERSITÀ DEGLI STUDI DI TRIESTE DIPARTIMENTO DI INGEGNERIA E ARCHITETTURA
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Problem
➔ Identifying sentences that contain interactions between genes and proteins ◆ from biomedical literature ➔ Available data: ◆ dictionary of genes, proteins and interactors ◆ example sentences
2
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Why? ➔ Biomedical literature is: ◆ vast ◆ rapidly growing
➔ Challenging problem: automatic extraction of knowledge from a text in natural language ◆ informations are “diluted” in the text ◆ very challenging problem: discover relations between entities
3
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Goal ➔ Generation of a classifier C in order to identify sentences containing interactions between genes and proteins ◆ automatically ◆ based on recurring syntactic patterns
4
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Our approach ➔ Classifier C is a set of regular expressions (regex)
C={r1,r2,...} ➔ Each regex is a sentence classifier (“accepts” or “does not accept”) ◆ C accepts sentences accepted by at least one regex ➔ Regex applied on a semantical representation of the text
5
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Our approach (II) ➔ Regex generated automatically ◆ by means of Genetic Programming (GP) ◆ starting from examples ● strings which must be accepted ● strings which must not be accepted
6
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Sentences preprocessing Mapping of a sentence s in a ɸ-string x a. substitution of words in s with “annotations” i. gene, protein, interactor or ii. Part-Of-Speech b. mapping of annotations in Unicode characters c. concatenation
7
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Sentences preprocessing (II) Example: s = YfhP may act as a negative regulator for the transcription of yfhQ ↓ [YfhP] [may] [act] [as] [a] [negative] [regulator] [for] [the] [transcription] [of] [yfhQ]
Generation of C: GP ➔ We used a Tree-based GP ➔ In this work candidate solution = regex
9
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Key aspects ➔ Multi-objective fitness: ◆ f=(Accuracy, FPR, Regex length) ◆ we purposefully avoided to include any problemspecific knowledge (gene/protein/…)
➔ Problem handled by mean of separate-andconquer ➔ Final output: set of regular expressions C={r1, r2,...} 10
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Separate-and-conquer ➔ Each regex ri ∈ C makes an independent and parallel classification ➔ Each regex is tailored for a sub-problem ◆ the problem is solved “step-by-step”
➔ Final output = logic OR of classifications
11
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Separate-and-conquer ● C=∅ ● we execute a GP search over the examples obtaining r* ● if FPR < threshold ○ C = C ∪ {r*} ● else ○ terminate
● remove from the positive examples those which were classified correctly by r* 12
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Classifier example C = {r1, r2} r1 = GENEPTN[ˆRB][^NNS VBN GENEPTN]++ r2 = . INOUN IN GENEPTN . [ˆDT NN]
13
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Experimental evaluation: the data ➔ Dataset: 456 sentences from biomedical papers ◆ ½ with interactions e ½ without ◆ manually labelled by experts
➔ Dataset splitted in Learning e Testing ◆ ≈80% examples in Learning ◆ ≈20% examples in Testing
➔ 5 fold randomly generated ◆ with Testingi≠Testingj 14
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Baseline 1, 2: problem specific knowledge ➔ Annotations-Co-Occurrence ◆ it is tightly tailored to this specific problem ◆ sentence is positive if contains ● at least 2 genes/proteins ● at least 1 interactor
➔ Annotations-LLL05-Patterns ◆ 10 pattern generated in “LLL'05 Challenge: Genic Interaction Extraction with Alignments and Finite State Automata”
- J. Hakenberg et alia ◆ built over >90% of the dataset (also testing!)
15
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Baseline 3: ɸ-SSLEA ➔ Based on Smart State Labeling Algorithm ◆ algorithm for DFA learning ◆ works well in presence of noise
➔ Hill-Climbing ➔ Generates DFA which accepts or refuse a ɸstring x ◆ if x accepted ⇒ x contains an interaction between gene/protein ◆ otherwise, no 16
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Baseline 4, 5: Words-NaiveBayes e Words-SVM ➔ Standard for text classification ◆ Supervised Machine Learning methods
➔ Feature based on word occurrences ➔ Preprocessing ◆ stemming ◆ features selection
17
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Results Averaged over the 5 folds Classifier
Accuracy
FPR
FNR
Annotations-Co-Occurrence
77.8
40.0
4.5
Annotations-LLL05-Patterns
82.3
25.0
10.5
Words-NaiveBayes
51.3
25.0
95.0
Words-SVM
73.8
29.0
23.5
ɸ-SSLEA
59.8
44.0
33.5
C
73.7
23.5
22.5 18
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Results (II) ➔ C performs as well as Word-SVM and better than other learning approaches ➔ accuracies of C and Annotations-Co-Occurrence (which exploits domain knowledge of an expert) are very close ◆ Pro: C is composed by patterns (regex) readable ◆ Con: time to generate C (hours) ≫ time to generate other methods (minutes) ● but ≈ time taken for classifying (seconds)
19
Evolutionary Learning of Syntax Patterns
DIA - UniTs
Conclusions We proposed: ➔ a method for the automatic synthesis of a classifier for natural language sentences ◆ based on syntactic pattern ◆ by mean of GP ◆ separate-and-conquer ➔ results are highly promising
classroom use is granted without fee provided that copies are not made or distributed for profit or commercial ... GECCO '15, July 11â15, 2015, Madrid, Spain.
Jul 15, 2015 - ABSTRACT. There is an increasing interest in the development of tech- niques for automatic relation extraction from unstructured text. The biomedical domain, in particular, is a sector that may greatly benefit from those techniques due
Freeâ claim is empirically examined along two directions. The first ..... problem domain. ... age test error (over 100 independent runs as described in Sec-.
Data made available by the courtesy of Microsoft .... Part-of-Speech mapping template: whether the ..... clude that PSDIG and Pharaoh each excel on dif-.
Oct 24, 2017 - DOI 10.1007/s00227-017-3254-2. ORIGINAL PAPER. Contrasting evolutionary patterns in populations of demersal sharks throughout the western Mediterranean. Sergio RamÃrezâAmaro1,2. · Antonia Picornell1 · Miguel Arenas3,4,5 · Jose A.
Engineering, National Taiwan University, Taipei, Taiwan ( e-mail: ..... Int. Conf. on Robotics, Automation and Mechatronics, pp. 1-8, 2006. [3] H. Choset and J.
questions they address, and the techniques used to check the validity of current ... spring up in the future to explore other aspects of the vast research domain of ..... probabilistic information available in the input to the learner/speaker/hearer.
The panicle development stages from meiotic division of pollen mother cell (S6) to pollen ripening (S8) were ..... Development Center, Changsha, China. pp.188-.
of the Jacobian matrix (13) by ai ±jbi. Then the stationary ... maxi ai. , if maxi ai ⥠0. The proof is omitted for the sake of brevity. The important ..... st.html, 2004.
âanticipatoryâ learning, or, using more traditional feedback ..... if and only if γ ⥠0 satisfies. T1: maxi ai < 1âγk γ. , if maxi ai < 0;. T2: maxi ai a2 i +b2 i. < γ. 1âγk
Nov 19, 2009 - Such a process could account ..... This latter test was made by checking whether the correlogram con- ..... Proceedings of the 10th international.
[email protected]. Nicholas Trachter. Federal Reserve Bank of Richmond [email protected]. July 22, 2014. Abstract. Data reveal that individuals experience a high number of occupational switches. Over. 40% of high school graduates tran