A new hybrid method for gene selection - Springer Link

Viewer
Transcript

Pattern Anal Applic (2011) 14:1–8 DOI 10.1007/s10044-010-0180-z

THEORETICAL ADVANCES

A new hybrid method for gene selection Ruichu Cai • Zhifeng Hao • Xiaowei Yang Han Huang

•

Received: 15 July 2008 / Accepted: 25 February 2010 / Published online: 15 July 2010 Ó Springer-Verlag London Limited 2010

Abstract Gene selection is a significant preprocessing of the discriminant analysis of microarray data. The classical gene selection methods can be classified into three categories: the filters, the wrappers and the embedded methods. In this paper, a novel hybrid gene selection method (HGSM) is proposed by exploring both the mutual information criterion (filters) and leave-one-out-error criterion (wrappers) under the framework of an improved ant algorithm. Extensive experiments are conducted on three benchmark datasets and the results confirm the effectiveness and efficiency of HGSM. Keywords Microarray data Gene selection Hybrid method Ant algorithm

R. Cai (&) Z. Hao Faculty of Computer Science, Guangdong University of Technology, Guangzhou 510006, People’s Republic of China e-mail: [email protected] X. Yang College of Mathematics Science, South China University of Technology, Guangzhou 510640, People’s Republic of China H. Huang School of Software Engineering, South China University of Technology, Guangzhou 510006, People’s Republic ofChina H. Huang Department of Management Sciences, College of Business, City University of Hong Kong, Hong Kong, People’s Republic of China

1 Introduction DNA microarray can simultaneously measure the expression levels of thousands of genes in a single experiment which is quite suitable for comparing the gene expression levels in tissues under different conditions, for example, healthy versus diseased [1]. Discriminant analysis of microarray data has been widely used to assist diagnosis [1, 2]. Gene selection is a fairly necessary procedure before the discriminant analysis of microarray data, for lots of genes in the original gene set are irrelevant or even redundant for a specific discriminant problem. From the point of view of discriminant analysis, gene selection can increase the generalization ability of classifiers and reduce the computational complexity of learning procedure. As far as the biologists are concerned, gene selection provides more compact gene sets to reduce diagnosis costs and facilitate the understanding of related gene functions. In the existing researches, genes are usually treated as features and the gene selection problem is fundamentally considered as a feature selection problem. Generally speaking, the feature selection methods can be classified into three categories: the filters, the wrappers and the embedded methods [3]. The filters employ only intrinsic properties of the feature without considering its interaction with the classifier. However, in the wrapper method, a classifier is usually built and employed as the evaluation criterion. If the feature selection criterion is derived from the intrinsic properties of a classifier, the corresponding method belongs to the embedded methods category. Pearson’s correlation coefficient [4, 5], Fisher’s ratio [4], and mutual information [6] are classical filter methods. Recently, correlation coefficient is combined with false discovery rate to pick out the differentially expressed gene

123

2

[7, 8]. Although the filter algorithm is computational efficient, most of them generate less compact feature set than the wrappers and the embedded methods. Thus the filters are mostly used as preprocessing procedure of feature selection problem. The wrappers are widely used gene selection algorithms. A typical wrapper method contains two components: the search scheme and the evaluation procedure. Sequential forward selection (SFS) [9], sequential floating forward selection (SFFS) [9], particle swarm optimization (PSO) [10] and genetic algorithms (GA) [11] are typical search methods. LS-Bound [9] and LOOC [8] are two feature selection criteria based on support vector machine (SVM) which is a commonly used classifier. By integrating above two criteria with SFS schemes, LS-Bound SFS [9] and LOOC-SFS [8] are competitive gene selection method. SVM Recursive Feature Elimination (SVM-RFE) algorithm is a typical embedded method [12]. In SVM-RFE, the features are eliminated recursively according to the criterion derived from SVM. SVM-RFE is often considered as one of the best gene selection algorithms [9], but it is computational costly. There are many methods proposed to alleviate this problem by eliminating chunks of features at a time, such as Furlanello’ entropy-based SVM-RFE [13], Ding’s simulated annealing-based SVM-RFE [14]. The recently proposed ant algorithm-based gene ranking method, ACA, can be also taken as an embedded method [15]. In ACA, the genes are ranked according to the pheromone which is updated according to classifier’s accuracy. We found that the filters are compliment to the wrappers and the embedded methods: the filters are computational efficient but the generated feature set is not as compact as that of the wrappers or embedded methods. So, if there are any proper frameworks to combine these methods, it will be a meaningful work improving the efficiency of existing gene selection method. A novel hybrid gene selection method (HGSM) is proposed accordingly. The rest of this paper is organized as follows. Section 2 gives a brief introduction to the gene selection methods which will be used in this study. In Sect. 3, HGSM is proposed. In Sect. 4, HGSM is evaluated on three benchmark gene expression datasets. Conclusions and discussions are given in Sect. 5.

2 Primaries 2.1 Problem definition {xk, yk}(k = 1m) is a gene expression data set with m samples, where xk is a n-dimensional vector which presents the kth sample’s expression profiles, yk [ {-1, 1} is

123

Pattern Anal Applic (2011) 14:1–8

the label of the sample, and xk,i is the expression level of the ith gene in the kth sample. It is very common to have m = 100 and n = 10,000, thus gene selection is an important preprocessing step in the gene expression data analysis context. In this work, we study the gene selection problem under classification framework which aims to select a most relevant genes subset S to the classifier. 2.2 Filters The filters make use of a scoring function S(i) for gene i. It is assumed that a higher score indicates a valuable variable. In a filter method, the genes are ranked according to the scoring function in descending order, and top t genes are selected. According to Pearson’s correlation coefficient [5], the discriminative power of features is independently evaluated according to: Pm k¼1 ðxk;i ui Þðyk uy Þ ﬃ SðiÞ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð1Þ Pm 2 Pm 2 k¼1 ðxk;i ui Þ k¼1 ðyk uy Þ where ui and uy are the average of gene i and label y, respectively. The method is intuitively appealing because a higher correlation coefficient means closer relation between the corresponding feature and the label. Fisher’ ratio is based on Fisher linear discriminant [16] and the following criterion is used: SðiÞ ¼

ðu1;i u1;i Þ2 r21;i þ r21;i

ð2Þ

where uc,i (c [ {1, -1}) is the average of gene i among the samples belong to class c, and r2c,i(c [ {1, -1}) is the corresponding variance. Mutual information is a quantity that measures the mutual dependence of two variables [5]. The scoring function of gene i is the empirical estimation of mutual information between gene i and label: XX PðX ¼ x:;i ; Y ¼ yÞ SðiÞ ¼ PðX ¼ x:;i ; Y ¼ yÞ log PðX ¼ x:;i ÞPðY ¼ yÞ x:;i y ð3Þ where P(X = x.,i) and P(Y = y)are the probability densities functions of x.,i and y, respectively. The filters are computational efficient, but most of them generate less compact feature set than the wrappers and the embedded methods. There are two main reasons, firstly the selected features may be highly correlated and many features can be removed without affecting the efficiency of classifier; secondly the scoring function does not take the combinational effect of features into account. A feature

Pattern Anal Applic (2011) 14:1–8

with low score may be important for the classifier when it is combined with other features. 2.3 Wrappers The wrappers assess subsets of variables according to their usefulness to a given classifier. A typical wrapper method consists of two components: the search scheme and the evaluation procedure. SFS and SFFS are two greedy search schemes used in the existing gene selection methods. SFS starts from an empty set and iteratively adds features to the selected set. In each iteration, the feature that leads to the greatest improvement of classifier is selected. SFS is computational costly, for a dataset with D features (genes) and d selected features, SFS requires (2D - d ? 1)d/2 evaluations. SFFS is similar to SFS, but a floating step is added, so SFFS requires even more evaluations. Least Squares Support Vector Machine (LS-SVM) is a commonly used classifier to evaluate the fitness of feature set in the wrappers. LS-SVM separates one class from the other with a hyperplane which is determined by solving the corresponding convex optimization problem LOOC [8] is an approximate estimation of the leaveone-out-error of LS-SVM which can be efficiently calculated according to formulation (Eq. 4) without training the LS-SVM for m - 1 times. ! m X 1 ai LOOC ¼ m sign 1 1 ð4Þ 2m ðH Þii i¼1 2.4 Embedded methods SVM Recursive Feature Elimination (SVM-RFE) algorithm is a typical embedded method [12]. In SVM-RFE, the features are eliminated recursively according to the criterion derived from the intrinsic properties of SVM. The change of objective function (Eq. 5) is taken as ranking criterion when one feature is removed [12]: m X min /ðw; b; eÞ ¼ wT w þ C ei ð5Þ i¼1

3

reducing the computational complexity of each evaluation. LS-Bound and LOOC belong to the second type. In this study, a hybrid method is proposed to cut down the number of evaluations. The hybrid method is inspired by the following observations. On the one hand, the search strategies have low efficiency, for little information is used to guide the generation of new subsets in the existing wrapper methods. For example, SFS needs to try all possible choices before adding a gene to the selected set. On the other hand, a lot of prior knowledge can be used to guild the search, such as Fisher’s ratio, mutual information criterion. So the efficiency of search procedure can be improved by exploring the prior knowledge of the problem. Thus, an improved ant algorithm is devised as search scheme which can generate new gene subset by exploring both prior knowledge of the problem and the knowledge accumulated in prior iterations. 3.1 The framework of HGSM The ant algorithms [17] are inspired by colonies of real ants, which use heuristic information and pheromone (a chemical substance deposited on the way) to search the shortest path between the nest and the food source. The ant algorithms have been successfully used to solve some permutation optimization problems, such as TSP [17]. But HGSM aims to find an optimal subset for the classifier, which cannot be handled by the classical ant algorithm. In HGSM, an improved ant algorithm is proposed to solve the gene selection problem by modifying the encoding of ant algorithm: the pheromone sgi and the heuristic information ggi are laid on the genes (illuminated in Fig. 1), while in standard ant algorithm the pheromone and heuristic information are laid on the edges between the genes. HGSM is an iterative random search algorithm. Each iteration consists of two main steps: solution construction and pheromone update. The artificial ants take the following steps to construct a new solution: firstly, set the selected gene set S to be empty, and the unselected set S~ to

When the feature i is removed from the feature set, the change of objective function / is approximate to D/(i) = w2i . So w2i is used as feature ranking criterion in SVM-RFE.

3 A hybrid gene selection method (HGSM) The difficulty of the wrapper method lies in the huge number of evaluations and the high computational complexity of each evaluation. There are two ways to alleviate this problem: cutting down the number of evaluations or

Fig. 1 The framework of generating a new subset

123

4

Pattern Anal Applic (2011) 14:1–8

be the universal set; then the ant generates a new gene subset by repeatedly selecting a gene from S~ and adding it to S according to the pseudo-random-proportional rule. Once all ants have completed their subsets, the pheromone is updated according to the quality of selected gene set which is evaluated according to the LOOC criterion. A simple framework of HGMS may look as follows:

Step1. Global initialization: initialize the pheromone and the heuristic information Step2. Solution construction. For each ant do the following steps: Setp2.1. Set S to be empty set, and S to be the universal set. Setp2.2. Repeatedly applies the pseudo-random-proportional rule to select a gene from S and add it to S until enough genes have been selected.

two classes, which is very similar to that of maximummargin principle in LOOC. In SVM-based criterion, w2i is the ith feature’s contribution to the margin which is also very similar to LOOC. These prospects are further encouraged by the experiments in Sect. 4.1. 3.3 Solution construction The following steps are taken by each ant to generate a newly selected gene set S: firstly, S is set to be empty, and S~ is set be the universal set; then the genes in S~ are randomly selected and added to S. The probability of selecting gene i is given by the pseudo-random-proportional rule: ( b Psi gi b if i 2 ~s sg pðiÞ ¼ ð6Þ j2~ s j j 0 others

Step3. Pheromone update. Setp3.1. Evaluating the fitness of the selected subsets according to LOOC. Setp3.2. Pheromone updating Step4. Check stop condition Repeat step 2 and step 3 until the stop condition is satisfied.

The selection of heuristic information, solution construction and pheromone update are three of the most important aspects of HGSM which will be further discussed in Sects. 3.2, 3.3 and 3.4, respectively. 3.2 The heuristic information Many types of prior knowledge can be used to guild the search, the appropriate selection which is very important to the efficiency of HGSM. In this study, four types of heuristic information are considered, including: correlation criterion, Fisher’ ratio, mutual information criterion and the SVM-based criterion. Among the four types of heuristic information, Pearson’s correlation and mutual information should be two of the most efficient types. Because Pearson’s correlation and mutual information are complement to LOOC and can be combined to improve the efficient of the selection method. Pearson’s correlation criterion and mutual information criterion is a measure of correlation between the feature and the label, which is entirely different from LOOC whose key concept is the maximum-margin theory. Fisher’ ratio and SVM-based criterion would not improve the efficiency of the selection procedure significantly because these two criteria are similar but not complementary to LOOC. In Fisher linear discriminant, the sample is projected to a line and the goodness of projection is based on the distance between the projected means of the

123

where sj and gj are the pheromone and heuristic information laid on gene, respectively, and b is a parameter which controls the relative importance of the heuristic information (prior knowledge) and the pheromone (knowledge accumulated in the prior iterations). The pseudo-random-proportional rule (Eq. 6) is slightly different from that of standard ant algorithms [17]. In this study the pheromone is deposited on the genes, not on the edges between genes. But they have similar effect: the ants favor the genes which have important heuristic information and a greater amount of pheromone. 3.4 Pheromone update Until all the ants have generated feasible gene sets, the quality of selected gene sets is evaluated according to LOOC criterion (Eq. 4). The ant which obtains the lowest LOOC value is allowed to deposit pheromone. The update procedure is as follows: ( ð1 qÞsðgi Þ þ qð1 LbestÞ2 if gi 2 best subset sðgi Þ ¼ ð1 qÞsðgi Þ þ qs0 if gi 62 best subset ð7Þ where Lbest is the LOOC of the best gene set, g is the pheromone decay parameter, and s0 is in a very low initial pheromone level. The pheromone update procedure tends to allocate a large amount of pheromone to the best gene subset, which provides a way to accumulate the knowledge discovered in previous iterations. Different from standard ant algorithm, only global pheromone update is applied and only the best ant is allowed to deposit pheromone in this study. This improvement accelerates the convergence of search procedure and reduces computational cost.

Pattern Anal Applic (2011) 14:1–8

5

In this section, all of the experiments are conducted on the three open microarray datasets: Leukemia [1], Breast-LN [18] and Colon Cancer [19], which are preprocessed using the techniques described in [20]. After thresholding, filtering and logarithmic-transforming, the microarray data are standardized to zero mean and unit standard variance across genes. In order to reduce the computational cost, top 1,000 genes are pre-selected from each dataset according to Fisher’s ratio, and all the experiments (except the experiment to test the computational time vs. the size of the gene set) are based on the preprocessed and pre-selected datasets (Table 1). In the discriminant analysis of microarray data, the generalization ability is an important aspect of the selected gene set. In our study, we use external B.632? to assess the performance of different gene selection algorithms. B.632? is cross-validation like evaluation schema, and it was considered to have lower variance than others crossvalidation method in the small sample case [21]. In B.632?, the balanced bootstrap samples are generated for k times, and the samples not contained in the training set are added to the corresponding testing set. In order to reduce the variance of algorithms’ performance, the bootstrap is repeated for k = 200 times. Moreover, SVM with Gaussian-kernel is used as the classifier in the experiments. We use libSVM version 2.70 [22] for the SVM classifier. In the following experiments, the maximum number of the selected genes is 50. However, the 50 genes are not the final result because subset of them may get the same or satisfactory performance. In our opinion, the B.632? error reflects the generalization performance of the selected gene subset. The decision of the final gene subset can be made based on whether the B.632? approaches the minimum, or whether adding more genes results in insignificant change. For example, on the Leukemia dataset (in Sect. 4.2), 23 genes are selected because HGSM obtained the lowest B.632? error when 23 genes are selected. The other parameters of HGSM are as follows: the number of ants is m = 10; the number of iterations is 100; the weight parameter b is b = 2; the initial level of the pheromone is s0 = 0.0001 and the pheromone decay rate is q = 0.1. The setting of the parameters are well studied in Table 1 Basic information of three microarray Datasets Dataset

Number of samples

Number of features

Leukemia

72

7,129

Breast-LN

49

7,129

Colon Cancer

62

2,000

the Dorigo’s work [17], and it is not presented here for the concision of the paper Three aspects of HGSM are studied in the following experiments: the selection of heuristic information, the effectiveness and the efficiency of HGSM. All the experiments are implemented in Visual C?? 6.0 environment and conducted on PC with 2.8 GHz P4 CPU and 512 MB RAM. 4.1 Comparison of different heuristic information The proper selection of heuristic information is very important to the efficiency of HGSM. In this section, the performance of four types of heuristic information is experimentally studied on the Leukemia dataset, including Pearson’s correlation, Fisher’s ratio, mutual information criterion, and the SVM-based criterion. From Fig. 2 we can see that, mutual information is the most effective heuristic information; correlation is slightly inferior to mutual information; both Fisher’s ration and the SVM-based criterion perform worse than mutual information and correlation. The result further confirms the analysis of Sect. 3.2: mutual information and LOOC is complementary to each other and can be combined to improve the effectiveness of the algorithm. Therefore, mutual information is chosen as the heuristic information of HGSM in all the following experiments. 4.2 The performance of HGSM In order to check the performance (including effectiveness and efficiency) of proposed method, HGSM is compared with other five prevail gene selection algorithms: the mutual information based filter [6], LS-Bound SFS [9], 0.16 mutual information correlation criteria fisher ratio svm based criteria

0.14

External B.632+ error rate

4 Experiments and results

0.12 0.1 0.08 0.06 0.04 0.02 0

0

5

10

15

20

25

30

35

40

45

50

Number of selected genes

Fig. 2 Comparison of the performance of heuristic (combined with HGSM)

123

6

Pattern Anal Applic (2011) 14:1–8

Figures 3, 4 and 5 illustrate the external B.632? errors on the Leukemia, Breast-LN and Colon Cancer datasets, respectively. For the Leukemia dataset, HGSM is superior to the other five methods when the number of the selected genes is lager than 3. And HGSM obtains the lowest B.632? error when 23 genes are selected. For the Breast-LN dataset, HGSM is the best gene selection method when the number of the selected genes is lager than 2. 11 genes are selected in this dataset, because there is no significant reduce of the B.632 when more genes are selected. For the Colon Cancer dataset, HGSM is superior to the other five methods when the number of selected genes is lager than 3. And HGSM obtains the lowest B.632? error when 13 genes are selected. As shown in the figures, HGSM generally obtains the lowest B.632? error among the five gene selection algorithms. But when there are very few genes to be selected, HGSM is slightly inferior to LS-Bound or SVM-RFE. Because both LS-Bound and SVM-RFE are greedy search methods which can achieve preferable results, when the number of selected gene is very small. But when the number of selected gene increases, HGSM is the better choice.

HSGM ACA SVM-RFE LS-Bound SFS Mutual Information PSO-SVM

0.45

External B.632+ error rate

4.2.1 The effectiveness

0.5

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0

5

10

15

20

25

30

35

40

45

50

Number of selected genes

Fig. 4 The external B.632? error for the Breast-LN dataset 0.35 HSGM ACA SVM-RFE LS-Bound SFS Mutual Information PSO-SVM

0.3

External B.632+ error rate

SVM-RFE [12], PSO-SVM [10] and ACA. Among them, mutual information is the heuristic information used in HGSM, LS-Bound SFS is one of the most efficient wrapper methods, SVM-RFE is a well-known embedded method, PSO-SVM is a typical hybrid methods for the gene selection problem, and ACA is another ant algorithm-based gene ranking method.

0.25

0.2

0.15

0.1

0

5

10

15

20

25

30

35

40

45

50

Number of selected genes

Fig. 5 The external B.632? error for the Colon Cancer dataset

4.2.2 The computational complexity 0.35 HSGM ACA SVM-RFE LS-Bound SFS Mutual Information PSO-SVM

External B.632+ error rate

0.3

0.25

0.2

0.15

0.1

0.05

0

0

5

10

15

20

25

30

35

40

45

Number of selected genes

Fig. 3 The external B.632? error for the Leukemia dataset

123

50

The computational complexity is an important aspect of the gene selection method. We will discuss the computational complexity in two aspects: the computational time in terms of the number of selected genes and the size of the whole gene set. Figure 6 illustrates the runtime of the six algorithms with different number of selected genes. HGSM is generally the most efficient method when the mutual information filter is not considered. Although LS-Bound SFS costs less time than HGSM when the number of selected genes is smaller than 15, but the computational cost of LS-Bound SFS increases much faster than that of HGSM when the number of selected genes increases. Figure 7 illustrates the computational time of the algorithms in terms of the size of gene set. As shown in this figure, the computational cost of HGSM has weak relation

Natural logrithm of computational time (seconds)

Pattern Anal Applic (2011) 14:1–8 6

7 Table 2 The selected gene for leukemia

HSGM ACA SVM-RFE LS-Bound SFS Mutual Information PSO-SVM

5 4

Access no. Selected times Description

3 2 1 0 -1 -2

0

10

20

30

40

50

60

70

80

90

Natural logrithm of computational time (seconds)

Fig. 6 The relationship between the computational time and the number of selected genes on Leukemia dataset

7 6 5

HSGM ACA SVM-RFE LS-Bound SFS Mutual Information PSO-SVM

4 3 2 1 0 -1 -2

1000

2000

3000

4000

5000

6000

200

J03779

200

Zyxin CD10

U88667

189

ABCA4

M92287

192

Cyclin D3 (CCND3)

M84526

200

Adipsin (D component of complement)

D26156

196

SMARCA4 (SNF2)

L15326

197

PTGS2 (COX2)

M31994

184

ALDH1

S57212

192

MEF2C

U40343

181

CDKN2D

100

Number of selected genes

8

X95735

7000

Size of gene set

Fig. 7 The relationship between the computational time and the size of the gene set on Leukemia dataset

with the size of the gene set, while the computational costs of the other five methods increase significantly when the size of gene set increases. When the size of gene set is larger than 1,000, HGSM is the most efficient algorithm (except the mutual information filter). In the gene expression context, the size of gene set is typically between 5,000 and 10,000. This demonstrates that HGSM retains the most efficient characteristic even in the case of large gene set. This nice characteristic is especially important for microarray data analysis.

given on the Leukemia dataset which is mostly concerned by the biologist. We use the times of the gene selected in the 200 repeated experiments to evaluate the importance of the gene, and the top 10 important genes are listed in Table 2. From Table 2 we can see that all of the 10 genes are very robust during the selection, even the 10th gene U40343 is selected 181 times during the 200 repeated experiments. Most of the 10 genes have been reported to be related to leukemia. X95735 (Zyxin) encodes a protein important for cell adhesion and is highly correlated with acute myelogenous leukemia (AML) [23]. J03779(CD10) is found to be the common acute lymphoblastic leukemia antigen early in 1989 [24]. U88667(ABCA4) is recognized as associated with drug resistance [25]. M92287 (Cyclin D3) encodes important protein for controlling the physiological progression of cell cycle. In the ALL cell, M92287 (Cyclin D3) leads to preventing glucocorticoid-induced cell cycle G1 arrest. M84526(Adipsin) is associated with myeloid cell differentiation [26], and the myeloid cell is strongly related to leukemia. L15326(PTGS2,COX2) is identified as one of the up-regulated genes related to leukemia [27]. S57212(MEF2C) is activated by multiple mechanisms in a subset of T-acute lymphoblastic leukemia cell lines [28]. Though there are some researches on the genes D26156, M31994 and U40343, the relationship between these genes and leukemia are still not clear. Some important knowledge might be discovered if the biologists pay more attentions to these genes.

5 Conclusions 4.3 Biological analysis of the selected genes Another important criterion to evaluate the gene selection method is the biological relation between the selected genes and the disease. The function of the selected genes is

In this paper, a novel gene selection method, HGSM, is proposed by hybridizing LOOC and mutual information under the framework of an improved ant algorithm. LOOC and mutual information are complementary to each other,

123

8

Pattern Anal Applic (2011) 14:1–8

and HGSM can benefit from both of them. In HGSM, the knowledge accumulated in previous iterations is also explored in the form of pheromone, which further improves the efficiency of the algorithm. Experimental results show that HGSM performs better than SVM-RFE, LS-Bound SFS, PSO-SVM and GA-SVM not only in effectiveness but also in efficiency. Especially, HGSM can better handle the datasets with a large number of genes, which is a good characteristic in the context of gene expression analysis. What is more, HGSM provides a new framework for the gene selection problem. New feature selection criteria can be easily integrated into this framework. Although HGSM has shown good performance, many further works can be done on HGSM, such as the adaptive parameters setting of HGSM, more theoretical analysis of HGSM, extending it to the multi-classification case and so on. Acknowledgments This work is partial supported by National Natural Science Foundation of China (60873078), Key Natural Science Foundation of Guangdong Province (9251009001000005, 9151600301000001), Key Technology Research and Development Programs of Guangdong Province (2008B080701005, 2009B010 800026), Social Science Foundation of Guangdong Province (08O-01), Open Foundation of the State Key Laboratory of Information Security (04-01), Technology Research and Development Program of Huizhou (08-117), Doctoral Program of the Ministry of Education (20090172120035), and the Fundamental Research Funds for the Central Universities, SCUT(2009ZM0052).

References 1. Golub TR et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537 2. Nutt CL et al (2003) Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 63(7):1602–1607 3. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324 4. Cherkassky VS, Mulier F (1998) Learning from data: concepts, theory, and methods. Wiley, New York 5. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182 6. Chang CF, Wai KM, Patterton HG (2004) Calculating the statistical significance of physical clusters of co-regulated genes in the genome: the role of chromatin in domain-wide gene regulation. Nucleic Acids Res 32(5):1798–1807 7. Reiner A, Yekutieli D, Benjamini Y (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19(3):368–375

123

8. Tang EK, Suganthan PN, Yao X (2006) Gene selection algorithms for microarray data-based on least squares support vector machine. BMC Bioinforma 7:85 9. Zhou X, Mao KZ (2005) LS Bound based gene selection for DNA microarray data. Bioinformatics 21(8):1559–1564 10. Chuang LY et al (2008) Improved binary PSO for feature selection using gene expression data. Elsevier, pp 29–37 11. Ooi CH, Tan P (2003) Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics 19(1):37–44 12. Guyon I et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422 13. Furlanello C et al (2003) Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinforma 4:54 14. Ding YY, Wilkins D (2006) Improving the performance of SVMRFE to select genes in microarray data. BMC Bioinforma 7(Suppl 2):S12 15. Robbins KR et al (2007) The ant colony algorithm for feature selection in high-dimension gene expression data for disease classification. Math Med Biol 24(4):413–426 16. Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley 17. Dorigo M, Gambardella LM (1997) Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Trans Evol Comput 1(1):53–66 18. Allinen M et al (2004) Molecular characterization of the tumor microenvironment in breast cancer. Cancer Cell 6(1):17–32 19. Alon U et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12):6745–6750 20. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87 21. Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99(10):6562–6566 22. http://www.csie.ntu.edu.tw/*cjlin/libsvm/. [cited 23. Kelly L, Clark J, Gilliland DG (2002) Comprehensive genotypic analysis of leukemia: clinical and therapeutic implications. Curr Opin Oncol 14(1):10–18 24. LeBien TW, McCormack RT (1989) The common acute lymphoblastic leukemia antigen (CD10)—emancipation from a functional enigma. Blood 73(3):625–635 25. Raaijmakers M (2007) ATP-binding-cassette transporters in hematopoietic stem cells and their utility as therapeutical targets in acute and chronic myeloid leukemia. Leukemia 21(10):2094–2102 26. Wong ETL et al (1999) Changes in chromatin organization at the neutrophil elastase locus associated with myeloid cell differentiation. Blood 94(11):3730 27. Secchiero P et al (2005) Potential pathogenetic implications of cyclooxygenase-2 overexpression in B chronic lymphoid leukemia cells. Am J Pathol 167(6):1599–1607 28. Debernardi S et al (2003) Genome-wide analysis of acute myeloid leukemia with normal karyotype reveals a unique pattern of homeobox gene expression distinct from those with translocationmediated fusion events. Genes Chromosom Cancer 37(2):149–158

A Kernel Method for Measuring Structural Similarity ... - Springer Link