Gene Ontology Hierarchy-Based Feature Selection Cen Wan
Alex A. Freitas
FEAST 2014
Classification Task in Data Mining “Classification task builds a model or classifier for predicting the class of an instance, based on its attributes (features).” - Han et al, 2012
2
Classification Task in Data Mining “Classification task builds a model or classifier for predicting the class of an instance, based on its attributes (features).” - Han et al, 2012 White-Box Classifiers I
Decision Tree
I
Bayesian Classifiers
I
K-Nearest Neighbours
Black-Box Classifiers I
Neural Networks
I
Support Vector Machines
3
Classification Task in Data Mining “Classification task builds a model or classifier for predicting the class of an instance, based on its attributes (features).” - Han et al, 2012 White-Box Classifiers I
Decision Tree
I
Bayesian Classifiers
I
K-Nearest Neighbours
Selected Classifier
Bayesian Classifiers Black-Box Classifiers I
Neural Networks
I
Support Vector Machines
4
Feature Selection in Data Mining
“Feature selection is a data pre-processing of filtering out redundant or irrelevant features before classification.” - Liu & Motoda, 1998
5
Feature Selection in Data Mining
“Feature selection is a data pre-processing of filtering out redundant or irrelevant features before classification.” - Liu & Motoda, 1998
Hierarchical feature selection selects subset of features by adopting pre-defined hierarchical information retained in the data.
6
Hierarchy Structure C
K
B
F
1
I
1
0 E
G
0 L
J
A
D
1
1
1
0
H
1
0
0
0
Example of hierarchy structure with multiple paths
7
Hierarchy Structure
E
F
G
C
H
A
B
D
1
1
1
1
0
0
0
0
Property of Hierarchy Structure for GO I
if the value of one GO term equals to “1”, then all its ancestor GO terms’ values equal to “1”;
I
if the value of one GO term equals to “0”, then all its descendant GO terms’ values equal to “0”.
8
Related Works on Hierarchy-Based Feature Selection Least Absolute Shrinkage and Selection Operator (LASSO) I P. Zhao, G. Rocha, and B. Yu, “The composite absolute penalties family for grouped and hierarchical variable selection,” The Annual of Statistics ; I R. Jenatton, J. Y. Audibert, and F. Bach, “Structured variable selection with sparity-inducing norms,” Journal of Machine Learning Research ; I J. Ye and J. Liu, “Sparse methods for biomedical data,” ACM SIGKDD Explorations Newsletter ; I A. F. T. Martins, N. A. Smith, P. M. Q. Aguiar, and M. A. T. Figueiredo, “Structured sparsity in structured prediction,” in Proc. the 2011 conference on empirical methods in natural language processing (EMNLP 2011) .
9
The Gene Ontology (GO) “The Gene Ontology project aims to provide dynamic, structured, unified/controlled vocabularies for the annotation of genes.” - Gene Ontology Consortium, 2004
10
Hierarchy Structure in Gene Ontology
Gharib et al. 2011
11
Hierarchy Structure in Gene Ontology
Gharib et al. 2011
Visualized by AmiGO Carbon et al, 2009
12
Hierarchy Structure in Gene Ontology
Visualized by AmiGO Carbon et al, 2009 13
Naïve Bayes (NB) and Bayesian Network Augmented Naïve Bayes (BAN) X1
Naïve Bayes P(y|x1 , x2 , ..., xn ) ∝ P(y)
n Q
X2
Class
X4
P(xi |y)
i=1 X3
X5
Topology of NB
14
Naïve Bayes (NB) and Bayesian Network Augmented Naïve Bayes (BAN) X1
Naïve Bayes P(y|x1 , x2 , ..., xn ) ∝ P(y)
n Q
X2
Class
X4
P(xi |y)
i=1 X3
X5
Topology of NB
X1
Bayesian Network Augmented Naïve Bayes
X2
Class
X4
P(y|x1 , x2 , ..., xn ) ∝ P(y)
n Q i=1
P(xi |Pa(xi ), y)
X3
X5
Topology of BAN 15
GO Hierarchy-Based Feature Selection for NB (HNB)
GO Term Relevance Value Measurement (adapted from formula proposed by Stanfill and Waltz, 1986) Relevance(GO) = (P(Class = Pro | GO = Yes) − P(Class = Pro | GO = No))2 +(P(Class = Anti | GO = Yes) − P(Class = Anti | GO = No))2
Laplace Correction P(y | xi ) =
C (y|xi )+1 C (xi )+Z
16
Pseudocode of HNB (Lazy Learning) - Part 1
Algorithm 1 Hierarchy-Based Feature Selection for NB 1: Initialize DAG with all GO terms in Dataset; 2: Initialize Dataset ; 3: Initialize Dataset ; 4: for each GOi in DAG do 5: Initialize Ancestor(GOi ) in DAG ; Initialize Descendant(GOi ) in DAG ; 6: 7: Initialize Status(GOi ) ← “Select”; 8: Calculate Relevance(GOi ) in Dataset ; 9: end for
17
Pseudocode of HNB (Lazy Learning) - Part 2
Algorithm 2 Hierarchy-Based Feature Selection for NB 1: for each Instance ∈ Dataset do 2: Conduct feature selection based on hierarchy structure; 3: Rebuild testing instance by using selected GO terms; Classify the rebuilt testing instance by Naïve Bayes; 4: 5: Re-assign each GOi : Status(GOi ) ← “Select”; 6: end for
18
Pseudocode of Hierarchy-Based Feature Selection
Algorithm 3 Hierarchy-Based Feature Selection 1: for each GOi ∈ DAG do if Value ∈ Instance = 1 then 2: 3: for each Aij ∈ Ancestor(GOi ) do 4: if Relevance(Aij ) ≤ Relevance(GOi ) then 5: Status(Aij ) ← “Remove”; 6: end if end for 7: 8: else 9: for each Dij ∈ Descendant(GOi ) do 10: if Relevance(Dij ) ≤ Relevance(GOi ) then 11: Status(Dij ) ← “Remove”; 12: end if end for 13: 14: end if 15: end for
19
Example Feature Selection Process of HNB 0.27
M
1
L
0.26
1 0.44
0.23
F
O
1
1
Q
0.33
1 0.37
K
0.25
0
0.26
I
B
J
G
1
0 0.31
0.26
1
1 0.31
0.25
E
C
1
D
0
0.26
A
0
0
0.26
0.38
H
0
0.23
0.41
P
0
N
0
0.42
R
0
20
Example Feature Selection Process of HNB 0.27
M
1
L
0.26
1 0.44
0.23
F
O
1
1
Q
0.33
1 0.37
K
0.25
0
0.26
B
0
I
0.31 0.26
J
1
1 0.31
0.25
E
C
1
D
0
0.26
A
G
1
0
0
0.26
0.38
H
0
0.23
0.41
P
0
N
0
0.42
R
0
21
Example Feature Selection Process of HNB 0.27
M
1
L
0.26
1 0.44
0.23
F
O
1
1
Q
0.33
1 0.37
0.26
0.25
K
0.31
B
G
1
0 0.31
J
1
1
0 0.25
0.26
I
E
C
1
D
0
0.26
A
0
0
0.26
0.38
H
0
0.23
0.41
P
0
N
0
0.42
R
0
22
Example Feature Selection Process of HNB 0.27
M
1
L
0.26
1 0.44
0.23
F
O
1
1
Q
0.33
1 0.37
K
0.25
0
0.26
I
B
J
G
1
0 0.31
0.26
1
1 0.31
0.25
E
C
1
D
0
0.26
A
0
0
0.26
0.38
H
0
0.23
0.41
P
0
N
0
0.42
R
0
23
Example Feature Selection Process of HNB 0.27
M
1
L
0.26
1 0.44
0.23
F
O
1
1
Q
0.33
1 0.37
K
0.25
0
0.26
I
B
J
G
1
0 0.31
0.26
1
1 0.31
0.25
E
C
1
D
0
0.26
A
0
0
0.26
0.38
H
0
0.23
0.41
P
0
N
0
0.42
R
0
24
Example Feature Selection Process of HNB 0.27
M
1
L
0.26
1 0.44
0.23
F
O
1
1
Q
0.33
1 0.37
K
0.25
0
0.26
I
B
J
G
1
0 0.31
0.26
1
1 0.31
0.25
E
C
1
D
0
0.26
A
0
0
0.26
0.38
H
0
0.23
0.41
P
0
N
0
0.42
R
0
25
Experiment Dataset
Gene\GO Gene_1 Gene_2 Gene_3 ... Gene_n
GO_1 1 0 0 ... 1
GO_2 0 1 0 ... 0
GO_3 0 0 0 ... 1
GO_4 1 0 1 ... 0
... ... ... ... ... ...
GO_n 0 1 1 ... 0
Class Pro-Longevity Anti-Longevity Pro-Longevity ... Pro-Longevity
26
Experiment Dataset
Number of GO Terms in Corresponding Datasets Threshold (user-defined parameter) for filtering GO terms 4 5 6 7 8 9 10
Number of GO terms left in dataset (remove GO term’s frequency < threshold) 586 515 465 426 392 373 361
27
Classification and Feature Selection Methods Detailed Information about Classifier and Feature Selection Methods Learning Aliases
Name of Algorithm
Feature Selection Criteria
BAN
Bayesian Network Augmented NB
All GO Terms
Eager
NB
Naïve Bayes
All GO Terms
Learning
RNB
Naïve Bayes
Relevance-based top-k GO terms
Lazy
HNB−s
Naïve Bayes
GO terms after Redundant Attributes Removal
Learning
HNB
Naïve Bayes
Top-k GO terms After Redundant Attributes Removal
Approach
*Predictive performance is evaluated by 10-fold Stratified Cross Validation.
28
Experiment Results
Aliases Thr. K 30 T4 40 50 30 T5 40 50 30 T6 40 50 30 T7 40 50 30 T8 40 50 30 T9 40 50 30 T10 40 50
BAN Acc. S.×S. 66.8 39.9 67.0 40.7 65.5 39.3 66.4 39.5 65.1 38.4 67.7 41.2 65.3 38.3 64.2 36.9 64.2 37.6 66.3 40.0 63.5 35.5 64.8 36.3 65.2 37.6 63.3 35.6 65.9 38.8 65.7 38.9 65.2 38.5 65.9 38.8 64.4 36.6 64.6 37.1 65.9 39.1
NB Acc. S.×S. 60.0 32.2 62.5 35.8 62.1 35.4 60.8 33.3 61.7 34.7 62.5 35.9 62.1 35.6 58.0 31.3 59.3 32.1 59.9 33.0 58.8 31.4 59.2 31.1 60.1 33.7 58.8 31.6 60.7 33.9 59.4 33.0 59.4 32.9 59.7 32.2 60.1 33.2 58.4 31.6 59.2 32.5
RNB Acc. S.×S. 66.4 26.7 63.8 26.1 64.2 31.7 63.0 27.5 64.9 35.3 64.9 34.5 62.7 33.4 62.5 32.8 63.0 34.5 62.2 32.1 64.8 37.7 63.3 30.8 63.5 35.3 63.5 35.5 61.4 36.7 62.4 37.9 62.2 37.3 65.5 39.7 61.8 35.7 65.5 40.7 62.9 37.0
HNB−s Acc. S.×S. 63.4 33.9 66.0 37.7 63.4 35.2 63.6 35.3 64.5 36.2 64.2 36.5 63.2 36.1 60.8 32.3 63.4 35.8 62.9 34.5 62.7 35.2 62.0 35.1 62.7 36.1 60.7 32.2 62.0 34.5 59.7 32.2 60.9 35.0 60.3 32.1 61.2 33.6 59.4 32.5 58.2 30.3
HNB Acc. S.×S. 63.6 33.6 35.5 66.4 68.1 37.4 63.0 34.1 65.5 35.8 65.3 36.7 63.4 35.6 63.6 34.6 64.2 37.4 63.9 34.6 64.4 39.0 66.1 35.4 66.3 39.9 63.1 36.4 66.3 37.5 63.5 36.4 66.7 41.8 64.4 36.4 66.7 41.1 63.9 36.3 65.0 36.6
29
Experiment Results Wilcoxon’s Signed-Rank Test I Perf (BAN) > Perf (NB) Aliases Thr. K 30 T4 40 50 30 T5 40 50 30 T6 40 50 30 T7 40 50 30 T8 40 50 30 T9 40 50 30 T10 40 50
BAN Acc. S.×S. 66.8 39.9 67.0 40.7 65.5 39.3 66.4 39.5 65.1 38.4 67.7 41.2 65.3 38.3 64.2 36.9 64.2 37.6 66.3 40.0 63.5 35.5 64.8 36.3 65.2 37.6 63.3 35.6 65.9 38.8 65.7 38.9 65.2 38.5 65.9 38.8 64.4 36.6 64.6 37.1 65.9 39.1
NB Acc. S.×S. 60.0 32.2 62.5 35.8 62.1 35.4 60.8 33.3 61.7 34.7 62.5 35.9 62.1 35.6 58.0 31.3 59.3 32.1 59.9 33.0 58.8 31.4 59.2 31.1 60.1 33.7 58.8 31.6 60.7 33.9 59.4 33.0 59.4 32.9 59.7 32.2 60.1 33.2 58.4 31.6 59.2 32.5
RNB Acc. S.×S. 66.4 26.7 63.8 26.1 64.2 31.7 63.0 27.5 64.9 35.3 64.9 34.5 62.7 33.4 62.5 32.8 63.0 34.5 62.2 32.1 64.8 37.7 63.3 30.8 63.5 35.3 63.5 35.5 61.4 36.7 62.4 37.9 62.2 37.3 65.5 39.7 61.8 35.7 65.5 40.7 62.9 37.0
HNB−s Acc. S.×S. 63.4 33.9 66.0 37.7 63.4 35.2 63.6 35.3 64.5 36.2 64.2 36.5 63.2 36.1 60.8 32.3 63.4 35.8 62.9 34.5 62.7 35.2 62.0 35.1 62.7 36.1 60.7 32.2 62.0 34.5 59.7 32.2 60.9 35.0 60.3 32.1 61.2 33.6 59.4 32.5 58.2 30.3
HNB Acc. S.×S. 63.6 33.6 35.5 66.4 68.1 37.4 63.0 34.1 65.5 35.8 65.3 36.7 63.4 35.6 63.6 34.6 64.2 37.4 63.9 34.6 64.4 39.0 66.1 35.4 66.3 39.9 63.1 36.4 66.3 37.5 63.5 36.4 66.7 41.8 64.4 36.4 66.7 41.1 63.9 36.3 65.0 36.6
I Perf (HNB) > Perf (NB) I Perf (RNB) > Perf (HNB−s ) I Perf (HNB) > Perf (HNB−s ) I Perf (HNB) > Perf (RNB) I Perf (BAN) = Perf (HNB)
30
Experiment Results Wilcoxon’s Signed-Rank Test I Perf (BAN) > Perf (NB) Aliases Thr. K 30 T4 40 50 30 T5 40 50 30 T6 40 50 30 T7 40 50 30 T8 40 50 30 T9 40 50 30 T10 40 50
BAN Acc. S.×S. 66.8 39.9 67.0 40.7 65.5 39.3 66.4 39.5 65.1 38.4 67.7 41.2 65.3 38.3 64.2 36.9 64.2 37.6 66.3 40.0 63.5 35.5 64.8 36.3 65.2 37.6 63.3 35.6 65.9 38.8 65.7 38.9 65.2 38.5 65.9 38.8 64.4 36.6 64.6 37.1 65.9 39.1
NB Acc. S.×S. 60.0 32.2 62.5 35.8 62.1 35.4 60.8 33.3 61.7 34.7 62.5 35.9 62.1 35.6 58.0 31.3 59.3 32.1 59.9 33.0 58.8 31.4 59.2 31.1 60.1 33.7 58.8 31.6 60.7 33.9 59.4 33.0 59.4 32.9 59.7 32.2 60.1 33.2 58.4 31.6 59.2 32.5
RNB Acc. S.×S. 66.4 26.7 63.8 26.1 64.2 31.7 63.0 27.5 64.9 35.3 64.9 34.5 62.7 33.4 62.5 32.8 63.0 34.5 62.2 32.1 64.8 37.7 63.3 30.8 63.5 35.3 63.5 35.5 61.4 36.7 62.4 37.9 62.2 37.3 65.5 39.7 61.8 35.7 65.5 40.7 62.9 37.0
HNB−s Acc. S.×S. 63.4 33.9 66.0 37.7 63.4 35.2 63.6 35.3 64.5 36.2 64.2 36.5 63.2 36.1 60.8 32.3 63.4 35.8 62.9 34.5 62.7 35.2 62.0 35.1 62.7 36.1 60.7 32.2 62.0 34.5 59.7 32.2 60.9 35.0 60.3 32.1 61.2 33.6 59.4 32.5 58.2 30.3
HNB Acc. S.×S. 63.6 33.6 66.4 35.5 68.1 37.4 63.0 34.1 65.5 35.8 65.3 36.7 63.4 35.6 63.6 34.6 64.2 37.4 63.9 34.6 64.4 39.0 66.1 35.4 66.3 39.9 63.1 36.4 66.3 37.5 63.5 36.4 66.7 41.8 64.4 36.4 66.7 41.1 63.9 36.3 65.0 36.6
I Perf (HNB) > Perf (NB) I Perf (RNB) > Perf (HNB−s ) I Perf (HNB) > Perf (HNB−s ) I Perf (HNB) > Perf (RNB) I Perf (BAN) = Perf (HNB) Comparison between Highest Values and Baseline Values Sensitivity Specificity Baseline
38.8%
61.2%
HNB
57.5%
72.6%
31
Most Relevant Ageing-Related GO Terms Relevance(GO) = (P(Class = Pro | GO = Yes) − P(Class = Pro | GO = No))2 +(P(Class = Anti | GO = Yes) − P(Class = Anti | GO = No))2
32
Most Relevant Ageing-Related GO Terms Relevance(GO) = (P(Class = Pro | GO = Yes) − P(Class = Pro | GO = No))2 +(P(Class = Anti | GO = Yes) − P(Class = Anti | GO = No))2
Ranking of Ageing-Related GO Terms Order
ID
Name
1
GO:0009314
response to radiation
Value 0.59
2
GO:0031667
response to nutrient levels
0.52
3
GO:0009991
response to extracellular stimulus
0.52
4
GO:0044262
cellular carbohydrate metabolic process
0.52
5
GO:0042127
regulation of cell proliferation
0.41
6
GO:0051726
regulation of cell cycle
0.36
7
GO:0048598
embryonic morphogenesis
0.33
8
GO:0018193
peptidyl-amino acid modification
0.32
9
GO:0006952
defense response
0.32
10
GO:0032880
regulation of protein localization
0.32
33
Conclusion & Future Work Conclusion I Hierarchical information in the Gene Ontology is
valuable for selecting features for predicting the effects of ageing-related genes on longevity; I Removing redundant terms from the Gene Ontology
hierarchy enhances the performance of Naïve Bayes classifier; I The proposed attribute (GO terms) relevance measure
method is helpful for ranking ageing-related GO terms according to their relevance for predicting longevity.
34
Conclusion & Future Work Conclusion I Hierarchical information in the Gene Ontology is
valuable for selecting features for predicting the effects of ageing-related genes on longevity; I Removing redundant terms from the Gene Ontology
hierarchy enhances the performance of Naïve Bayes classifier; I The proposed attribute (GO terms) relevance measure
method is helpful for ranking ageing-related GO terms according to their relevance for predicting longevity.
Future Work I Develop new feature selection approaches for redundancy
removal and GO hierarchy information representation.
35
References 1
C. Wan and A. A. Freitas, “Prediction of the Pro-longevity or Anti-longevity Effect of Caenorhabditis Elegans Genes Based on Bayesian Classification Methods,” in proceedings of IEEE International Conference on Bioinformatics and Biomedicine, 2013, pp. 373-380.
2
J. P. de Magalhaes, A. Budovsky, G. Lehmann, J. Costa, Y. Li, V. Fraifeld and G. M. Church, “The Human Ageing Genomic Resources: online databases and tools for biogerontologists,” Aging Cell, vol. 8, no. 1, pp. 65-72, Feb. 2009.
3
The Gene Ontology Consortium, “Gene Ontology: tool for the unification of biology,” Nature Genetics, vol. 25. no. 1, pp. 25-29, May 2000.
36
Acknowledgements
University of Kent 50th Anniversary Research Scholarships
Dr. João Pedro de Magalhães, Principal Investigator of Integrative Genomics of Ageing Group, University of Liverpool
37