Gene Ontology Hierarchy-Based Feature Selection

Viewer
Transcript

Gene Ontology Hierarchy-Based Feature Selection Cen Wan

Alex A. Freitas

FEAST 2014

Classification Task in Data Mining “Classification task builds a model or classifier for predicting the class of an instance, based on its attributes (features).” - Han et al, 2012

2

Classification Task in Data Mining “Classification task builds a model or classifier for predicting the class of an instance, based on its attributes (features).” - Han et al, 2012 White-Box Classifiers I

Decision Tree

I

Bayesian Classifiers

I

K-Nearest Neighbours

Black-Box Classifiers I

Neural Networks

I

Support Vector Machines

3

Classification Task in Data Mining “Classification task builds a model or classifier for predicting the class of an instance, based on its attributes (features).” - Han et al, 2012 White-Box Classifiers I

Decision Tree

I

Bayesian Classifiers

I

K-Nearest Neighbours

Selected Classifier

Bayesian Classifiers Black-Box Classifiers I

Neural Networks

I

Support Vector Machines

4

Feature Selection in Data Mining

“Feature selection is a data pre-processing of filtering out redundant or irrelevant features before classification.” - Liu & Motoda, 1998

5

Feature Selection in Data Mining

“Feature selection is a data pre-processing of filtering out redundant or irrelevant features before classification.” - Liu & Motoda, 1998

Hierarchical feature selection selects subset of features by adopting pre-defined hierarchical information retained in the data.

6

Hierarchy Structure C

K

B

F

1

I

1

0 E

G

0 L

J

A

D

1

1

1

0

H

1

0

0

0

Example of hierarchy structure with multiple paths

7

Hierarchy Structure

E

F

G

C

H

A

B

D

1

1

1

1

0

0

0

0

Property of Hierarchy Structure for GO I

if the value of one GO term equals to “1”, then all its ancestor GO terms’ values equal to “1”;

I

if the value of one GO term equals to “0”, then all its descendant GO terms’ values equal to “0”.

8

Related Works on Hierarchy-Based Feature Selection Least Absolute Shrinkage and Selection Operator (LASSO) I P. Zhao, G. Rocha, and B. Yu, “The composite absolute penalties family for grouped and hierarchical variable selection,” The Annual of Statistics ; I R. Jenatton, J. Y. Audibert, and F. Bach, “Structured variable selection with sparity-inducing norms,” Journal of Machine Learning Research ; I J. Ye and J. Liu, “Sparse methods for biomedical data,” ACM SIGKDD Explorations Newsletter ; I A. F. T. Martins, N. A. Smith, P. M. Q. Aguiar, and M. A. T. Figueiredo, “Structured sparsity in structured prediction,” in Proc. the 2011 conference on empirical methods in natural language processing (EMNLP 2011) .

9

The Gene Ontology (GO) “The Gene Ontology project aims to provide dynamic, structured, unified/controlled vocabularies for the annotation of genes.” - Gene Ontology Consortium, 2004

10

Hierarchy Structure in Gene Ontology

Gharib et al. 2011

11

Hierarchy Structure in Gene Ontology

Gharib et al. 2011

Visualized by AmiGO Carbon et al, 2009

12

Hierarchy Structure in Gene Ontology

Visualized by AmiGO Carbon et al, 2009 13

Naïve Bayes (NB) and Bayesian Network Augmented Naïve Bayes (BAN) X1

Naïve Bayes P(y|x1 , x2 , ..., xn ) ∝ P(y)

n Q

X2

Class

X4

P(xi |y)

i=1 X3

X5

Topology of NB

14

Naïve Bayes (NB) and Bayesian Network Augmented Naïve Bayes (BAN) X1

Naïve Bayes P(y|x1 , x2 , ..., xn ) ∝ P(y)

n Q

X2

Class

X4

P(xi |y)

i=1 X3

X5

Topology of NB

X1

Bayesian Network Augmented Naïve Bayes

X2

Class

X4

P(y|x1 , x2 , ..., xn ) ∝ P(y)

n Q i=1

P(xi |Pa(xi ), y)

X3

X5

Topology of BAN 15

GO Hierarchy-Based Feature Selection for NB (HNB)

GO Term Relevance Value Measurement (adapted from formula proposed by Stanfill and Waltz, 1986) Relevance(GO) = (P(Class = Pro | GO = Yes) − P(Class = Pro | GO = No))2 +(P(Class = Anti | GO = Yes) − P(Class = Anti | GO = No))2

Laplace Correction P(y | xi ) =

C (y|xi )+1 C (xi )+Z

16

Pseudocode of HNB (Lazy Learning) - Part 1

Algorithm 1 Hierarchy-Based Feature Selection for NB 1: Initialize DAG with all GO terms in Dataset; 2: Initialize Dataset ; 3: Initialize Dataset ; 4: for each GOi in DAG do 5: Initialize Ancestor(GOi ) in DAG ; Initialize Descendant(GOi ) in DAG ; 6: 7: Initialize Status(GOi ) ← “Select”; 8: Calculate Relevance(GOi ) in Dataset ; 9: end for

17

Pseudocode of HNB (Lazy Learning) - Part 2

Algorithm 2 Hierarchy-Based Feature Selection for NB 1: for each Instance ∈ Dataset do 2: Conduct feature selection based on hierarchy structure; 3: Rebuild testing instance by using selected GO terms; Classify the rebuilt testing instance by Naïve Bayes; 4: 5: Re-assign each GOi : Status(GOi ) ← “Select”; 6: end for

18

Pseudocode of Hierarchy-Based Feature Selection

Algorithm 3 Hierarchy-Based Feature Selection 1: for each GOi ∈ DAG do if Value ∈ Instance = 1 then 2: 3: for each Aij ∈ Ancestor(GOi ) do 4: if Relevance(Aij ) ≤ Relevance(GOi ) then 5: Status(Aij ) ← “Remove”; 6: end if end for 7: 8: else 9: for each Dij ∈ Descendant(GOi ) do 10: if Relevance(Dij ) ≤ Relevance(GOi ) then 11: Status(Dij ) ← “Remove”; 12: end if end for 13: 14: end if 15: end for

19

Example Feature Selection Process of HNB 0.27

M

1

L

0.26

1 0.44

0.23

F

O

1

1

Q

0.33

1 0.37

K

0.25

0

0.26

I

B

J

G

1

0 0.31

0.26

1

1 0.31

0.25

E

C

1

D

0

0.26

A

0

0

0.26

0.38

H

0

0.23

0.41

P

0

N

0

0.42

R

0

20

Example Feature Selection Process of HNB 0.27

M

1

L

0.26

1 0.44

0.23

F

O

1

1

Q

0.33

1 0.37

K

0.25

0

0.26

B

0

I

0.31 0.26

J

1

1 0.31

0.25

E

C

1

D

0

0.26

A

G

1

0

0

0.26

0.38

H

0

0.23

0.41

P

0

N

0

0.42

R

0

21

Example Feature Selection Process of HNB 0.27

M

1

L

0.26

1 0.44

0.23

F

O

1

1

Q

0.33

1 0.37

0.26

0.25

K

0.31

B

G

1

0 0.31

J

1

1

0 0.25

0.26

I

E

C

1

D

0

0.26

A

0

0

0.26

0.38

H

0

0.23

0.41

P

0

N

0

0.42

R

0

22

Example Feature Selection Process of HNB 0.27

M

1

L

0.26

1 0.44

0.23

F

O

1

1

Q

0.33

1 0.37

K

0.25

0

0.26

I

B

J

G

1

0 0.31

0.26

1

1 0.31

0.25

E

C

1

D

0

0.26

A

0

0

0.26

0.38

H

0

0.23

0.41

P

0

N

0

0.42

R

0

23

Example Feature Selection Process of HNB 0.27

M

1

L

0.26

1 0.44

0.23

F

O

1

1

Q

0.33

1 0.37

K

0.25

0

0.26

I

B

J

G

1

0 0.31

0.26

1

1 0.31

0.25

E

C

1

D

0

0.26

A

0

0

0.26

0.38

H

0

0.23

0.41

P

0

N

0

0.42

R

0

24

Example Feature Selection Process of HNB 0.27

M

1

L

0.26

1 0.44

0.23

F

O

1

1

Q

0.33

1 0.37

K

0.25

0

0.26

I

B

J

G

1

0 0.31

0.26

1

1 0.31

0.25

E

C

1

D

0

0.26

A

0

0

0.26

0.38

H

0

0.23

0.41

P

0

N

0

0.42

R

0

25

Experiment Dataset

Gene\GO Gene_1 Gene_2 Gene_3 ... Gene_n

GO_1 1 0 0 ... 1

GO_2 0 1 0 ... 0

GO_3 0 0 0 ... 1

GO_4 1 0 1 ... 0

... ... ... ... ... ...

GO_n 0 1 1 ... 0

Class Pro-Longevity Anti-Longevity Pro-Longevity ... Pro-Longevity

26

Experiment Dataset

Number of GO Terms in Corresponding Datasets Threshold (user-defined parameter) for filtering GO terms 4 5 6 7 8 9 10

Number of GO terms left in dataset (remove GO term’s frequency < threshold) 586 515 465 426 392 373 361

27

Classification and Feature Selection Methods Detailed Information about Classifier and Feature Selection Methods Learning Aliases

Name of Algorithm

Feature Selection Criteria

BAN

Bayesian Network Augmented NB

All GO Terms

Eager

NB

Naïve Bayes

All GO Terms

Learning

RNB

Naïve Bayes

Relevance-based top-k GO terms

Lazy

HNB−s

Naïve Bayes

GO terms after Redundant Attributes Removal

Learning

HNB

Naïve Bayes

Top-k GO terms After Redundant Attributes Removal

Approach

*Predictive performance is evaluated by 10-fold Stratified Cross Validation.

28

Experiment Results

Aliases Thr. K 30 T4 40 50 30 T5 40 50 30 T6 40 50 30 T7 40 50 30 T8 40 50 30 T9 40 50 30 T10 40 50

BAN Acc. S.×S. 66.8 39.9 67.0 40.7 65.5 39.3 66.4 39.5 65.1 38.4 67.7 41.2 65.3 38.3 64.2 36.9 64.2 37.6 66.3 40.0 63.5 35.5 64.8 36.3 65.2 37.6 63.3 35.6 65.9 38.8 65.7 38.9 65.2 38.5 65.9 38.8 64.4 36.6 64.6 37.1 65.9 39.1

NB Acc. S.×S. 60.0 32.2 62.5 35.8 62.1 35.4 60.8 33.3 61.7 34.7 62.5 35.9 62.1 35.6 58.0 31.3 59.3 32.1 59.9 33.0 58.8 31.4 59.2 31.1 60.1 33.7 58.8 31.6 60.7 33.9 59.4 33.0 59.4 32.9 59.7 32.2 60.1 33.2 58.4 31.6 59.2 32.5

RNB Acc. S.×S. 66.4 26.7 63.8 26.1 64.2 31.7 63.0 27.5 64.9 35.3 64.9 34.5 62.7 33.4 62.5 32.8 63.0 34.5 62.2 32.1 64.8 37.7 63.3 30.8 63.5 35.3 63.5 35.5 61.4 36.7 62.4 37.9 62.2 37.3 65.5 39.7 61.8 35.7 65.5 40.7 62.9 37.0

HNB−s Acc. S.×S. 63.4 33.9 66.0 37.7 63.4 35.2 63.6 35.3 64.5 36.2 64.2 36.5 63.2 36.1 60.8 32.3 63.4 35.8 62.9 34.5 62.7 35.2 62.0 35.1 62.7 36.1 60.7 32.2 62.0 34.5 59.7 32.2 60.9 35.0 60.3 32.1 61.2 33.6 59.4 32.5 58.2 30.3

HNB Acc. S.×S. 63.6 33.6 35.5 66.4 68.1 37.4 63.0 34.1 65.5 35.8 65.3 36.7 63.4 35.6 63.6 34.6 64.2 37.4 63.9 34.6 64.4 39.0 66.1 35.4 66.3 39.9 63.1 36.4 66.3 37.5 63.5 36.4 66.7 41.8 64.4 36.4 66.7 41.1 63.9 36.3 65.0 36.6

29

Experiment Results Wilcoxon’s Signed-Rank Test I Perf (BAN) > Perf (NB) Aliases Thr. K 30 T4 40 50 30 T5 40 50 30 T6 40 50 30 T7 40 50 30 T8 40 50 30 T9 40 50 30 T10 40 50

BAN Acc. S.×S. 66.8 39.9 67.0 40.7 65.5 39.3 66.4 39.5 65.1 38.4 67.7 41.2 65.3 38.3 64.2 36.9 64.2 37.6 66.3 40.0 63.5 35.5 64.8 36.3 65.2 37.6 63.3 35.6 65.9 38.8 65.7 38.9 65.2 38.5 65.9 38.8 64.4 36.6 64.6 37.1 65.9 39.1

NB Acc. S.×S. 60.0 32.2 62.5 35.8 62.1 35.4 60.8 33.3 61.7 34.7 62.5 35.9 62.1 35.6 58.0 31.3 59.3 32.1 59.9 33.0 58.8 31.4 59.2 31.1 60.1 33.7 58.8 31.6 60.7 33.9 59.4 33.0 59.4 32.9 59.7 32.2 60.1 33.2 58.4 31.6 59.2 32.5

RNB Acc. S.×S. 66.4 26.7 63.8 26.1 64.2 31.7 63.0 27.5 64.9 35.3 64.9 34.5 62.7 33.4 62.5 32.8 63.0 34.5 62.2 32.1 64.8 37.7 63.3 30.8 63.5 35.3 63.5 35.5 61.4 36.7 62.4 37.9 62.2 37.3 65.5 39.7 61.8 35.7 65.5 40.7 62.9 37.0

HNB−s Acc. S.×S. 63.4 33.9 66.0 37.7 63.4 35.2 63.6 35.3 64.5 36.2 64.2 36.5 63.2 36.1 60.8 32.3 63.4 35.8 62.9 34.5 62.7 35.2 62.0 35.1 62.7 36.1 60.7 32.2 62.0 34.5 59.7 32.2 60.9 35.0 60.3 32.1 61.2 33.6 59.4 32.5 58.2 30.3

HNB Acc. S.×S. 63.6 33.6 35.5 66.4 68.1 37.4 63.0 34.1 65.5 35.8 65.3 36.7 63.4 35.6 63.6 34.6 64.2 37.4 63.9 34.6 64.4 39.0 66.1 35.4 66.3 39.9 63.1 36.4 66.3 37.5 63.5 36.4 66.7 41.8 64.4 36.4 66.7 41.1 63.9 36.3 65.0 36.6

I Perf (HNB) > Perf (NB) I Perf (RNB) > Perf (HNB−s ) I Perf (HNB) > Perf (HNB−s ) I Perf (HNB) > Perf (RNB) I Perf (BAN) = Perf (HNB)

30

Experiment Results Wilcoxon’s Signed-Rank Test I Perf (BAN) > Perf (NB) Aliases Thr. K 30 T4 40 50 30 T5 40 50 30 T6 40 50 30 T7 40 50 30 T8 40 50 30 T9 40 50 30 T10 40 50

BAN Acc. S.×S. 66.8 39.9 67.0 40.7 65.5 39.3 66.4 39.5 65.1 38.4 67.7 41.2 65.3 38.3 64.2 36.9 64.2 37.6 66.3 40.0 63.5 35.5 64.8 36.3 65.2 37.6 63.3 35.6 65.9 38.8 65.7 38.9 65.2 38.5 65.9 38.8 64.4 36.6 64.6 37.1 65.9 39.1

NB Acc. S.×S. 60.0 32.2 62.5 35.8 62.1 35.4 60.8 33.3 61.7 34.7 62.5 35.9 62.1 35.6 58.0 31.3 59.3 32.1 59.9 33.0 58.8 31.4 59.2 31.1 60.1 33.7 58.8 31.6 60.7 33.9 59.4 33.0 59.4 32.9 59.7 32.2 60.1 33.2 58.4 31.6 59.2 32.5

RNB Acc. S.×S. 66.4 26.7 63.8 26.1 64.2 31.7 63.0 27.5 64.9 35.3 64.9 34.5 62.7 33.4 62.5 32.8 63.0 34.5 62.2 32.1 64.8 37.7 63.3 30.8 63.5 35.3 63.5 35.5 61.4 36.7 62.4 37.9 62.2 37.3 65.5 39.7 61.8 35.7 65.5 40.7 62.9 37.0

HNB−s Acc. S.×S. 63.4 33.9 66.0 37.7 63.4 35.2 63.6 35.3 64.5 36.2 64.2 36.5 63.2 36.1 60.8 32.3 63.4 35.8 62.9 34.5 62.7 35.2 62.0 35.1 62.7 36.1 60.7 32.2 62.0 34.5 59.7 32.2 60.9 35.0 60.3 32.1 61.2 33.6 59.4 32.5 58.2 30.3

HNB Acc. S.×S. 63.6 33.6 66.4 35.5 68.1 37.4 63.0 34.1 65.5 35.8 65.3 36.7 63.4 35.6 63.6 34.6 64.2 37.4 63.9 34.6 64.4 39.0 66.1 35.4 66.3 39.9 63.1 36.4 66.3 37.5 63.5 36.4 66.7 41.8 64.4 36.4 66.7 41.1 63.9 36.3 65.0 36.6

I Perf (HNB) > Perf (NB) I Perf (RNB) > Perf (HNB−s ) I Perf (HNB) > Perf (HNB−s ) I Perf (HNB) > Perf (RNB) I Perf (BAN) = Perf (HNB) Comparison between Highest Values and Baseline Values Sensitivity Specificity Baseline

38.8%

61.2%

HNB

57.5%

72.6%

31

Most Relevant Ageing-Related GO Terms Relevance(GO) = (P(Class = Pro | GO = Yes) − P(Class = Pro | GO = No))2 +(P(Class = Anti | GO = Yes) − P(Class = Anti | GO = No))2

32

Most Relevant Ageing-Related GO Terms Relevance(GO) = (P(Class = Pro | GO = Yes) − P(Class = Pro | GO = No))2 +(P(Class = Anti | GO = Yes) − P(Class = Anti | GO = No))2

Ranking of Ageing-Related GO Terms Order

ID

Name

1

GO:0009314

response to radiation

Value 0.59

2

GO:0031667

response to nutrient levels

0.52

3

GO:0009991

response to extracellular stimulus

0.52

4

GO:0044262

cellular carbohydrate metabolic process

0.52

5

GO:0042127

regulation of cell proliferation

0.41

6

GO:0051726

regulation of cell cycle

0.36

7

GO:0048598

embryonic morphogenesis

0.33

8

GO:0018193

peptidyl-amino acid modification

0.32

9

GO:0006952

defense response

0.32

10

GO:0032880

regulation of protein localization

0.32

33

Conclusion & Future Work Conclusion I Hierarchical information in the Gene Ontology is

valuable for selecting features for predicting the effects of ageing-related genes on longevity; I Removing redundant terms from the Gene Ontology

hierarchy enhances the performance of Naïve Bayes classifier; I The proposed attribute (GO terms) relevance measure

method is helpful for ranking ageing-related GO terms according to their relevance for predicting longevity.

34

Conclusion & Future Work Conclusion I Hierarchical information in the Gene Ontology is

valuable for selecting features for predicting the effects of ageing-related genes on longevity; I Removing redundant terms from the Gene Ontology

hierarchy enhances the performance of Naïve Bayes classifier; I The proposed attribute (GO terms) relevance measure

method is helpful for ranking ageing-related GO terms according to their relevance for predicting longevity.

Future Work I Develop new feature selection approaches for redundancy

removal and GO hierarchy information representation.

35

References 1

C. Wan and A. A. Freitas, “Prediction of the Pro-longevity or Anti-longevity Effect of Caenorhabditis Elegans Genes Based on Bayesian Classification Methods,” in proceedings of IEEE International Conference on Bioinformatics and Biomedicine, 2013, pp. 373-380.

2

J. P. de Magalhaes, A. Budovsky, G. Lehmann, J. Costa, Y. Li, V. Fraifeld and G. M. Church, “The Human Ageing Genomic Resources: online databases and tools for biogerontologists,” Aging Cell, vol. 8, no. 1, pp. 65-72, Feb. 2009.

3

The Gene Ontology Consortium, “Gene Ontology: tool for the unification of biology,” Nature Genetics, vol. 25. no. 1, pp. 25-29, May 2000.

36

Acknowledgements

University of Kent 50th Anniversary Research Scholarships

Dr. João Pedro de Magalhães, Principal Investigator of Integrative Genomics of Ageing Group, University of Liverpool

37