A Novel Gene Ranking Algorithm Based on Random ...

Viewer
Transcript

Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, August 12-17, 2007

A Novel Gene Ranking Algorithm Based on Random Subspace Method Ruichu Cai, Zhifeng Hao, and Wen Wen Abstract-Gene selection is to select the most informative genes from the whole gene set. It's an important preprocessing procedure for the discriminant analysis of microarray data, because many of the genes are irrelevant or redundant to the discriminant problem. In this paper, the gene selection problem is considered as a gene ranking problem and a Random Subspace Method based gene ranking (RSM-GR) algorithm is proposed. In RSM-GR, firstly subsets of the genes are randomly generated; then Support Vector Machines are respectively trained on each subset and thus produce the importance factor of each gene; finally, the importance of each gene obtained from these randomly selected subsets is combined to constitute its final importance. Experiments on two public datasets show that RSM-GR obtains gene sets leading to more accurate classification results than other gene selection methods, and it demands less computational time. RSM-GR can also better deal with datasets with a large number of genes and a big number of genes to be selected.

I. INTRODUCTION DNA microarray is a technology that can simultaneously measure the expression levels of thousands of genes in a single experiment. It is commonly used for comparing the gene expression levels in tissues under different conditions, for example, healthy versus diseased[ 1]. Recently, discriminant analysis of microarray data has been widely used to assist diagnosis[ 1-2]. Gene selection is a fairly necessary procedure before the discriminant analysis of microarray data. There are several reasons for performing it. Firstly, many genes in the original gene set are probably irrelevant, insignificant or redundant to a specific discriminant problem. The cost of clinical diagnosis can be dramatically reduced with gene selection since it is Manuscript received January 25, 2007. This work has been supported by the National Natural Science Foundation of China (10471045, 60433020), the program for New Century Excellent Talents in University (NCET-05-0734), Natural Science Foundation of Guangdong Province (031360, 04020079), Excellent Young Teachers Program of Ministry of Education of China, Fok Ying Tong Education Foundation (91005), Social Science Research Foundation of MOE (2005-241), Key Technology Research and Development Program of Guangdong Province (2005B10101010, 2005B70101118), Open Research Fund of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education (93K- 17-2006-03). Ruichu Cai is with College of Computer Science and Engineering, South China University of Technology, Guangzhou, 510640 China (phone: 86-013570393001; e-mail: cairuichu cv 163.com). Zhifeng Hao is with College of Computer Science and Engineering, South China University of Technology, Guangzhou, 510640 China (e-mail: mazfhao o scut.edu.cn). Wen Wen is with College of Computer Science and Engineering, South China University of Technology, Guangzhou, 510640 China (e-mail: mathww( 126.com).

1-4244-1 380-X/07/$25.00 ©2007 IEEE

much cheaper to focus on a few gene expressions instead of the whole gene set. Secondly, if more redundant genes are included in the classifier, the generalization error will increase, because the generalization ability of a classifier partly depends on the ratio between the number of samples and the number of features. Thirdly, if the redundant genes are removed from the classifier, the storage requirement and computational complexity can be reduced. Finally, gene selection can provide a more compact gene set, which can help understand the functions of particular genes and design the diagnosis process. In previous researches, genes are usually treated as features, so gene selection problem is fundamentally considered as a feature selection problem. Generally, the methods can be classified into three categories: the filter, the wrapper and the embedded methods[3]. The filter method employs intrinsic properties of feature without considering its interaction with other features. The selection procedure is independent of the classifier, while in the wrapper method, a classifier is usually built and employed as the evaluation criterion. If the criterion is derived from the intrinsic properties of the classifier, the corresponding feature selection method will be categorized as the embedded approach. The fisher's ratio, Mahalanobis[4] and t-statistics[5], are three of the classical filter methods. Recently false discovery rate is used to pick the different expressed gene in the gene selection problem[6]. Several improvements are proposed to this algorithm[7]. The filter algorithm is computational efficient. But most of them generate less compact feature set than the wrapper method and the embedded method. So the filter is mostly used as preprocessing procedure of the feature selection problem. The fisher ratio is used to reduce the feature set in our study. The wrapper method is a widely used algorithm in the gene selection problem. A typical wrapper method contains two components: the search procedure and evaluation criterion. Sequential forward selection (SFS) [8], sequential floating forward selection (SFFS) [8] and genetic algorithms[9] are typical search methods used in the context of gene selection problem. Support Vector Machine (SVM) is a commonly used classifier for the gene selection problem. LS-Bound[8] and LOOC[1O]are two criterions based on SVM. Both of the criterions are combined with SFS schemes to obtain competitive results, thus leading to LS-Bound SFS[8] and LOOC-SFS [10] method. SVM Recursive Feature Elimination (SVM-RFE) algorithm is a typical embedded method[ 11]. In SVM-RFE, the features are eliminated recursively according to the criterion derived from the intrinsic properties of the SVM

classifier. SVM-RFE is often considered as one of the best gene selection algorithm in the literature[8]. But SVM-RFE is time consuming. There are many methods proposed to alleviate this problem by eliminating chunks of features at a time, such as Cesare Furlanello' entropy based SVM-RFE[12], Yuanyuan Ding's simulated annealing based SVM-RFE[13]. A hybrid between Random Subspace Method(RSM) and SVM-RFE is also proposed to improve the robust and accuracy of the algorithm [14]. The rest of this paper is organized as follows. Section 2 gives a brief introduction to least square support machines. In Section 3, after analyzing of the characteristics of the gene selection problem, a novel gene ranking based on Random Subspace Method (RSM-GR) is proposed. In Section 4, RSM-GR is evaluated on two famous Gene expression datasets; both the performance of the selected subset and the efficiency of the algorithm are encouraging compared with two of the most used gene selection algorithms. Conclusions and discussions are given in Section 5. II. LEAST SQUARES SUPPORT VECTOR MACHINES SVM was proposed by Vapnik et al.[15]. When used for classification, SVM separates one class from the other with a

hyperplane. The determination of the hyperplane, that is, the training procedure, involves solving a quadratic programming problem (QP), it is rather complex. Suykens and Vandewalle[16] proposed the least squares support vector machines(LS-SVM) to simplify this problem. The LS-SVM simply solves a linear system rather than QP problem to determine the hyperplane. Considering an classification problem of m training sample pairs: { xi , y, }, i = m , where xi is an n dimensional vector representing the i th sample, and y, is the corresponding class label, which is either +1 or -1. The LS-SVM can be formulated as follows (1) min 0 (w, b, e) = w'w + yee 2

yi [WT(xi) +b]= I -e; i=1 ..m

)

Where e = [el, e2 ... em ]T, e, denotes error for sample xi ,y is a given positive value assigned to penalize errors. The role of y is to adjust the generalization ability and empirical errors of the classifier. The solution to the optimization problem is given by the saddle point of the Lagrangian formulation:

L(w, b, e; a) = i(w, b, e) -

a {y

[WT(X) + b] -1 + e } (2)

i=l

The saddle point of formula (2) can be obtained by solving the linear system (3). KO QH=

Where

I]K]

Y =[Yi, Y2 Ym.] Y

[7] 0(3)

Q

y=yjK(x, ,xj1) =

K(x,Xj) = y(Xi )T yP(Xj) is the kernel function.

and

In the microarray data analysis, the linear kernel K(xi, xj) = Xi Xj is most frequently used, because of the data is "linearly separable" [ 1]. III. GENE RANKING ALGORITHM BASED ON RANDOM

SUBSPACE METHOD In our method, the gene selection problem is considered as a gene ranking problem. The genes are ranked according to their importance to the classifier and the most important L genes are selected. The basic idea is to evaluate the importance of genes on randomly selected subsets, which is called Random Subspace Method based gene ranking (RSM-GR) algorithm. In RSM-GR, the criterion to evaluate the gene's importance is similar to that of SVM-RFE. The difference is that the final importance of each gene is obtained by combining the respective importance in each random-selected gene subset. In SVM-RFE, the change in objective function when one feature is removed is used as a ranking criterion[ 11]. The objective function of the standard SVM is: min O(w, b, e) = w w +

m

Cy £i

(4)

i=l

Hence, when the feature i is removed from the feature set, the change in the objective function s is approximate to Di(i) = W2 . So the w2 is used as a feature ranking criterion in SVM-RFE. According to SVM-RFE, the ranking becomes seriously local-optimal when it ranks all the genes at a time. The following Recursive Feature Elimination is applied to alleviate this problem: 1) Train the standard SVM classifier (optimize the weights w2 with respect to 5). 2) Compute the ranking criterion for allfeatures Dq(i) or w2 . 3) Remove the feature with smallest ranking criterion. In SVM-RFE, the SVM classifier needs to be trained for O(n) times! What's more, SVM is trained on a very large feature set. The high computational complexity is the major disadvantage of SVM-RFE. Many methods are proposed to solve this problem[12] [13], but the framework of SVM-RFE is not changed, which means the computation of the ranking criterion is not changed. In fact, the difficulty of SVM-RFE lays in the characteristic of the gene expression dataset: the high dimensional feature set and the small number of training samples. The ranking of genes obtained from such dataset has low generalization ability[15]. Therefore, SVM-RFE evaluating the ranking criterion on the full feature set at each time is rather risky. In our method, the divide-and-conquer strategy is introduced to lower the risk. The feature ranking criterion used in RSM-GR is evaluated as follows: each time, the importance of the genes is evaluated on a small randomly selected subset of genes, but not on the full set; then the importance of the genes in the randomly selected subsets are combined to get the final ranking criterion of the genes.

RSM-GR can be roughly divided into two parts: (I) use LS-SVM to evaluate the individual importance of genes on each randomly selected subset; (II) combine the importance to get the final ranking criterion for the genes. The framework of RSM-GR can be described as follows. The proposed algorithm

/* Si ={gil,gi2 gik}: the randomly generated gene subset * jw1 the importance ofgene gij in S * * /* Wj: the combined importance ofgenej /* errori: empirical error rate ofLS-SVM in S * /* Cj: the number of times the gene j was selected in the t subsets. */ Step]. initialize: set C and W to n dimension zero vectors Step2. fori =Il:t Step 2.1 randomly generate a new gene subset S1 {gil,g * gg gi} Step 2.2 update C: C= C1 +1 j E{g11,g12.g1k} Step 2.3 train LS-SVM on Si, obtaining Wj Step 2.4 evaluate the empirical error rate error Step 2.5 normalize wj2 k

Wj2 = Wj2 1 w2, (j = 1... k)

(5)

/1= Step 2.6 combine the importance according to. Wg = Wg + W1 *error;, (j = 1 ...k) (6) Endfor Step3. average the Wj across Cj. (7) Wj(= WjI Cj, (j ... n) Step4. rank the genes according to their weights W, and the top L genes are selected In the i th iteration, firstly, a subset of genes S= {ggil gi2 ... gik } is randomly sampled from the full gene-set without replacement, where k is the size of the subset, and gij (j = 1, 2 k) is the gene selected in this iteration. Then a linear LS-SVM is trained on subset Si, and j is obtained, which represents the importance of gene g, in this subset. Then an internal normalization is applied to each subset according to formula (5). It is necessary, for only the relative importance in the subspace is meaningful. Furthermore, the importance of the genes in each subset is combined to constitute the ranking criterion for the whole gene set, and genes are ranked according to this criterion. The combining procedure is employed according to formula (6). In formula (6), Wg is the importance of the gene and error1 is the empirical error of the LS-SVM in the subset S, . (1 -error1) is used as the combing weight, because a low empirical error obviously means an important influence ofthe genes in Si according to the classifier. Additionally, the average weight of each gene in different iterations is

g1,,

computed and constitutes the final ranking criterion for the gene according to formula (7), where Wj is the combined importance of gene j, and C1 is the number of times gene j was selected in all the subset sets. Finally, genes are sorted according to the final ranking criterion, and top L genes are selected. IV. EXPERIMENTS AND RESULTS In this section, RSM-GR is evaluated on two open microarray datasets: Colon cancer, Leukemia. Both of the datasets are pre-processed using the procedure described thresholding, filtering and After in[ 17]. logarithmic-transforming, the microarray data were standardized to zero mean and unit standard deviation across genes. TABLE I

BASIC INFORMATION OF THE 3 MICROARRAY DATASETS

Dataset

Number of samples

Number of genes

Colon cancer Leukaemia

62

2000

72

7129

What's

the Fisher's ratio more, + is f (yu u2 )2 /( o-I o22) used as pre-selection procedure to reduce the number of features and the computational time. For each dataset, top 1000 genes are selected based on the fisher's ratio. All the simulations and comparisons (except the experiment to test the computational time vs. the size of the gene set) in this study are based on the pre-processed and pre-selected data. This technique is also used in[8] [10]. In order to check the validity of the proposed method, RSM-GR is compared with two of the most commonly used gene selection algorithms: LS-Bound SFS and SVM-RFE. The parameter of RSM-GR is in the following: the size of the subset equals the number of samples ( k =i), and the repeated times equals the size of the gene set (t n ). All of the algorithms are developed in Visual C++ 6.0 environment on a computer with 2.8GHz P4 CPU and 512MB RAM. =

-

A. The performance of the of the selected gene subset In the context of discriminant analysis of microarray data, it is very important whether the selected gene subset can bring good generalization ability to the classifier. There are different methods to evaluate the gene selection algorithm. One of them is the external cross validation: gene selection and validation procedure are performed on different subsets of the samples to produce an unbiased estimate. Due to the small sample size of gene-expression data, the approach is actually not advisable. Another method is to use internal cross validation: employed the entire dataset for gene selection, the performance of selected genes was tested using k-fold cross validation. This internal cross validation produces biased estimate. Ambroise and McLachlan suggested techniques of external (10-fold) cross validation and external .632+ Bootstrap[18]. The external (10-fold) cross validation is considered to have higher variance than the external .632+ Bootstrap. In our study, we use external B.632+, to assess the

performance of different gene selection algorithms for comparison. The balanced bootstrap samples are generated for 200 times to reduce the variance. Samples not contained in the training set are added to the corresponding testing set. 0.4

respect to the number of genes to be selected, and the size of the whole gene set. 5

8o

r

4

SVM RFE

,~

0.35

LS Bound SFS

E

RSM-GR

.2

0.3

b 0.25

.o 3

au CO

SVM RFE LS Bound SFS RSM-GR

0.2-

a

m E 0.15 0u

0.1 -1

0.05 _ 0

5

10

15

20 25 30 35 Number of selected genes

40

45

50

Fig. 1. The external B.632 +error for the colon cancer dataset. 0.08

SVM RFE LS Bound SFS RSM-GR

0.07 (d

0.06

(D

0.05

-

m 0.04

L

0

10

20

30

40 50 60 70 Number of selected genes

SVM RFE LS Bound SFS RSM-GR

0.02 0

5

10

15

20 25 30 35 Number of selected genes

40

45

50

Fig. 2. The external B.632 +error for the leukemia dataset.

Fig. 1 and Fig. 2 illustrate the external B.632+ errors on the colon cancer dataset and leukemia dataset respectively. As shown in the figures, RSM-GR generally obtains the lowest B.632+error among the three gene selection algorithms. As shown in Fig. 1, RSM-GR is superior to the other two methods when the number of the selected genes is lager than 15. In fig. 2, RSM-GR is the best gene selection method when the number of the selected genes is lager than 5. But when there are very few genes are selected, RSM-GR is slightly inferior to LS-Bound or SVM-RFE. It's reasonable. Because both LS-Bound and SVM-RFE are greedy search methods. When the number of the selected number is very small the greedy method can achieve preferable result, but when the selected gene is increased, RSM-GR is the better choice. B. Computational complexity Quite few reports of computational time are found in the previous literatures on gene selection algorithm. However, the computational complexity is in fact an important aspect of the algorithm. We will discuss the computational complexity in two aspects: the computational time of the algorithm

100

Fig. 3 illustrates the computational time of the three algorithms with different number of selected genes. As shown in this figure, RSM-GR is the most efficient method. LS-Bound SFS cost less time than RSM-GR when the number of selected genes is small than 15. But the computational cost of LS-Bound SFS increases significantly when the selected number increases. The computational time of SVM-RFE and SVM-GR don't change significantly with the number of selected genes, it's a good characteristic of them. 6

0

90

Fig. 3. The relationship between the computational time and the number of selected genes on Leukaemia dataset.

L3 003

0.01

80

4D 0

3.

2,

E

.o0 _T

X

P

1

-2

o

1000

2000

3000 4000 5000 Size of entire gene set

6000

7000

Fig. 4. The relationship between the computational time and the size of the gene set on Leukaemia dataset.

Fig. 4 illustrates the relationship between computational time of the algorithm and the size of gene set. Though the computational cost of all the three algorithms increases significantly when the size of gene set increases, the increase speed of RSM-GR is the lowest among the three. This demonstrates that RSM-GR maintains the most efficient characteristic whenever the size of gene set changes. This is a wonderful characteristic of RSM-GR, especially important for microarray data analysis.

V. CONCLUSIONS In this paper, a new gene ranking based on Random Subspace Method is proposed for the gene selection problem. RSM-GR employs similar ranking criterion to that of SVM-RFE, but evaluated in a divide-and-conquer way to

improve the generalization ability and reduce the computational time. Experiments show that RSM-GR is more efficient than SVM-RFE and LS-Bound SFS not only in accuracy but also in computational time. Especially, for the computational time, RSM-GR can better scale to the dataset with large size of gene and a big number of selected genes. What's more RSM-GR provides a new frame work to evaluate the ranking criterion. New ranking criterion, new classifier can be easily combined with this framework Although RSM-GR has shown good perfornance, it still needs further study. For example, the parameters used in RSM-GR are not explored in this study, which may influence the efficiency of the algorithm; also the reason for the good characteristics of RSM-GR demands further discussions and explanations. REFERENCES [1]

[2]

[3] [4]

[5]

[6] [7]

[8] [9]

[10]

[11] [12]

T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, "Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring," Science, vol. 286, pp. 531-537, Oct 15 1999. C. L. Nutt, D. R. Mani, R. A. Betensky, P. Tamayo, J. G. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M. E. McLaughlin, T. T. Batchelor, P. M. Black, A. von Deimling, S. L. Pomeroy, T. R. Golub, and D. N. Louis, "Gene expression-based classification of malignant gliomas correlates better with survival than histological classification," Cancer Research, vol. 63, pp. 1602-1607, Apr 1 2003. R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial Intelligence, vol. 97, pp. 273-324, Dec 1997. R. M. C. R. de Souza, F. A. T. de Carvalho, and C. P. Tenorio, "Two partitional methods for interval-valued data using mahalanobis distances," Advances in Artificial Intelligence - Iberamia 2004, vol. 3315, pp. 454-463, 2004. C. F. Chang, K. M. Wai, and H. G. Patterton, "Calculating the statistical significance of physical clusters of co-regulated genes in the genome: the role of chromatin in domain-wide gene regulation," Nucleic Acids Research, vol. 32, pp. 1798-1807, Mar 2004. A. Reiner, D. Yekutieli, and Y. Benjamini, "Identifying differentially expressed genes using false discovery rate controlling procedures," Bioinformatics, vol. 19, pp. 368-375, Feb 12 2003. J. J. Yang and M. C. K. Yang, "An improved procedure for gene selection from microarray experiments using false discovery rate criterion," Bmc Bioinformatics, vol. 7, pp. -, Jan 11 2006. X. Zhou and K. Z. Mao, "LS Bound based gene selection for DNA microarray data," Bioinformatics, vol. 21, pp. 1559-1564, Apr 15 2005. L. B. Li, W. Jiang, X. Li, K. L. Moser, Z. Guo, L. Du, Q. J. Wang, E. J. Topol, Q. Wang, and S. Rao, "A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset," Genomics, vol. 85, pp. 16-23, Jan 2005. E. K. Tang, P. N. Suganthan, and X. Yao, "Gene selection algorithms for microarray data based on least squares support vector machine," Bmc Bioinformatics, vol. 7, pp. -, Feb 27 2006. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine Learning, vol. 46, pp. 389-422, 2002. C. Furlanello, M. Serafini, S. Merler, and G. Jurman, "Entropy-based gene ranking without selection bias for the predictive classification of microarray data," Bmc Bioinformatics, vol. 4, pp. -, Nov 6 2003.

[13] Y. Y. Ding and D. Wilkins, "Improving the performance of SVM-RFE to select genes in microarray data," Bmc Bioinformatics, vol. 7, pp.-, Sep 26 2006. [14] C. Lai, M. J. T. Reinders, and L. Wessels, "Random subspace method for multivariate feature selection," Pattern Recognition Letters, vol. 27, pp. 1067-1076, Jul 15 2006. [15] V. N.Vapnik, Statistical Learning Theory. New York: John Wiley and Sons, 1998. [16] J. A. K. Suykens and J. Vandewalle, "Least squares support vector machine classifiers," Neural Processing Letters, vol. 9, pp. 293-300, Jun 1999. [17] S. Dudoit, J. Fridlyand, and T. P. Speed, "Comparison of discrimination methods for the classification of tumors using gene expression data," Journal of the American StatisticalAssociation, vol. 97, pp. 77-87, Mar 2002. [18] C. Ambroise and G. J. McLachlan, "Selection bias in gene extraction on the basis of microarray gene-expression data," Proceedings of the National Academy of Sciences of the United States ofAmerica, vol. 99, pp. 6562-6566, May 2002.

A Random-Walk Based Scoring Algorithm with ...

A Random-Walk Based Scoring Algorithm with Application to ...

ItemRank: A Random-Walk Based Scoring Algorithm for ...

A Random-Walk Based Scoring Algorithm applied to ...

A reordered first fit algorithm based novel storage ... - Springer Link

A NOVEL EVOLUTIONARY ALGORITHMS BASED ON NUMBER ...

Random Yield Prediction Based on a Stochastic Layout ...

A Universal Online Caching Algorithm Based on Pattern ... - CiteSeerX

A Universal Online Caching Algorithm Based on Pattern Matching

A Universal Online Caching Algorithm Based on ...

A Robust Color Image Quantization Algorithm Based on ...

A Block-Based Video-Coding Algorithm Focusing on ...

A Universal Online Caching Algorithm Based on Pattern ... - CiteSeerX

A Robust Color Image Quantization Algorithm Based on ...

A Simple Linear Ranking Algorithm Using Query ... - Research at Google

Re-ranking Search Results based on Perturbation of ...

Efficient Multi-Ranking Based on View Selection for ...

A Random-Key Genetic Algorithm for the Generalized ...

A Novel Error-Correcting System Based on Product ... - IEEE Xplore

A Novel Blind Watermarking Scheme Based on Fuzzy ...

A novel glucose biosensor based on immobilization of ...