Simulated Annealing based Automatic Fuzzy Clustering ...

Viewer
Transcript

Simulated Annealing based Automatic Fuzzy Clustering combined with ANN Classification for Analyzing Microarray Data Ujjwal Maulik Department of Computer Science and Engineering Jadavpur University, Kolkata - 700032, India Email: [email protected]

Anirban Mukhopadhyay∗ Department of Computer Science and Engineering University of Kalyani, Kalyani - 741235, India Email: [email protected]

Abstract Microarray technology has made it possible to monitor the expression levels of many genes simultaneously across a number of experimental conditions. Fuzzy clustering is an important tool for analyzing microarray gene expression data. In this article, a real-coded Simulated Annealing (VSA) based fuzzy clustering method with variable length configuration is developed and combined with popular Artificial Neural Network (ANN) based classifier. The idea is to refine the clustering produced by VSA using ANN classifier to obtain improved clustering performance. The proposed technique is used to cluster three publicly available real life microarray data sets. The superior performance of the proposed technique has been demonstrated by comparing with some widely used existing clustering algorithms. Also statistical significance test has been conducted to establish the statistical significance of the superior performance of the proposed clustering algorithm. Finally biological relevance of the clustering solutions are established.

Keywords: Microarray gene expression data, fuzzy clustering, cluster validity indices, variable configuration length simulated annealing, Artificial Neural Network, gene ontology.

1

Introduction

With the advancement of microarray technology, it is now possible to measure the expression levels of a huge number of genes across different experimental conditions simultaneously [1]. Microarray technology ∗ Corresponding

author: Email: [email protected], Fax: +91 33 25828282

1

in recent years has major impacts in many fields such as medical diagnosis and biomedicine, characterizing various gene functions, understanding different molecular biological processes etc [2, 3, 4, 5]. Due to its large volume, computational analysis is essential for extracting knowledge from microarray gene expression data. Clustering is one of the primary approaches to analyze such large amount of data. Clustering [6, 7, 8] is a popular exploratory pattern classification technique which partitions the input space into K regions {C1 , C2 , . . . , CK } based on some similarity/dissimilarity metric where the value of K may or may not be known a priori. The main objective of any clustering technique is to produce a K × n partition matrix U (X) of the given data set X, consisting of n patterns, X = {x1 , x2 , . . . , xn }. The partition matrix may be represented as U = [ukj ], k = 1, . . . , K and j = 1, . . . , n, where ukj is the membership of pattern xj to cluster Ck . In crisp partitioning ukj = 1 if xj ∈ Ck , otherwise ukj = 0. On the other hand, for fuzzy partitioning of the data, the following conditions hold on U (representing Pn PK PK Pn non-degenerate clustering): 0 < j=1 ukj < n, k=1 ukj = 1, and k=1 j=1 ukj = n. Some early works dealt with visual analysis of gene expression patterns to group the genes into

functionally relevant classes [2, 3, 9]. However, as these methods were very subjective, standard clustering methods, such as K-means [10], fuzzy C-means [11], hierarchical methods [4], Self Organizing Maps (SOM) [12], graph theoretic approach [13], simulated annealing based approach [14, 15] and genetic algorithm (GA) based clustering methods [16, 17, 18] etc. have been utilized for clustering gene expression data. In this article, a two-stage fuzzy clustering algorithm has been proposed that tries to use the supervised classification for unsupervised clustering of gene expression by combination of both. A variable configuration length Simulated Annealing (VSA) based fuzzy clustering algorithm that minimizes XieBeni (XB) validity index [23] has been utilized for generating the fuzzy partition matrix as well as the number of clusters in the first stage. Thereafter the high membership points of each cluster are identified and used to train an Artificial Neural Network (ANN) based classifier [19, 20]. Finally in the second stage, the trained ANN is applied to classify the remaining points. The proposed two-stage technique is named as VSA-ANN. The superiority of the proposed VSA-ANN clustering method, as compared to other popular methods for clustering gene expression data, namely Fuzzy C-means (FCM) [21], average linkage [6], Self Organizing Map (SOM) [12] and the recently proposed SiMM-TS clustering [18] is established for three real life gene expression data, viz., Yeast Sporulation, Human Fibroblasts Serum and Rat CNS data sets. Moreover, statistical tests have been carried out to establish that the proposed technique produces results that are statistically significant and do not come by chance. Finally biological significance test has been conducted to establish that the clusters identified by the proposed technique are biologically relevant.

2

Motivation and Contribution

Fuzzy clustering of microarray gene expression data has an inherent advantage over crisp partitioning. While clustering the genes, it is often the case, that some gene has an expression pattern that is similar to more than one class of genes. For example, in the MIPS (Munich Information Center for Protein

2

Sequences) categorization of data, several genes belong to more than one category [22]. Hence it is evident that great amount of imprecision and uncertainty is related with gene expression data. Therefore it is natural to apply fuzzy clustering methods for partitioning expression data. For defuzzifying a fuzzy clustering solution, genes are assigned to cluster to which they have the highest membership degree. It has been observed that for a particular cluster, some of the genes belonging to it have higher membership degree to that cluster, and can be considered as they are clustered properly. On the contrary, some other genes of the same cluster may have lower membership degrees. Thus the genes in the later case are not assigned to that cluster with high confidence. Therefore it would be better if we can identify the low confidence points (genes) from each cluster and reassign them properly. This observation motivates us to refine the clustering result by using Artificial Neural Network (ANN) based probabilistic classifier [19, 20], which is trained by the points with high membership degree in a cluster. The trained ANN classifier can thereafter be used to classify the remaining points. A variable configuration length Simulated Annealing (VSA) based fuzzy clustering algorithm that minimizes XieBeni (XB) cluster validity index [23] has been utilized for generating the fuzzy partition matrix as well as the number of clusters in the first stage. In the subsequent stage, ANN is applied to classify the points with lower membership degree. In [18], we proposed a two-stage clustering technique (SiMM-TS) that first identifies the points having significant membership to multiple clusters (SiMM points) in the first stage using variable string length GA based clustering (VGA) minimizing XB index [23]. The SiMM points are then excluded from the data set and the remaining points are clustered through a multiobjective clustering method [24]. Finally the SiMM points are assigned to the nearest clusters. SiMM-TS heavily depends on the choice of a threshold parameter P which had been fixed through several iterations and thus takes time. Moreover in SiMM-TS, there was no concept of using supervised learning tools to improve the clustering. The clustering technique (VSA-ANN) proposed in this article is different from SiMM-TS. Here we have first evolved the number of clusters and the fuzzy membership matrix through VSA based fuzzy clustering minimizing XB index. Thereafter the high confidence points (core points) for each cluster are identified and these points are used to train the ANN classifier. The remaining points are then classified using the trained classifier. This method also uses a membership threshold parameter, however it has been evolved automatically. In SiMM-TS, clustering is used in both the stages, whereas, in VSA-ANN, clustering is used in the first stage only. In the second stage, ANN classification is used. As supervised classification is known to perform better compared to unsupervised clustering techniques, in this article an effort has been made to use the strength of supervised classification for unsupervised clustering of gene expression data, which is the main novelty of the proposed approach. Thus VSA-ANN is expected to perform better than SiMM-TS and it is also established experimentally.

3

3

Microarray Data

A microarray is a small chip onto which a large number of DNA molecules (probes) are attached in fixed grids. The chip is made of chemically coated glass, nylon, membrane or silicon. Each grid cell of a microarray chip corresponds to a DNA sequence. For cDNA microarray experiment, the first step is to extract RNA from a tissue sample and amplification of RNA. Thereafter two mRNA samples are reverse-transcribed into cDNA (targets) labelled using different fluorescent dyes (red-fluorescent dye Cy5 and green-fluorescent dye Cy3). Due to the complementary nature of the base-pairs, the cDNA binds to the specific oligonucleotides on the array. In the subsequent stage, the dye is excited by a laser so that the amount of cDNA can be quantified by measuring the fluorescence intensities [4, 25]. The log ratio of two intensities of each dye is used as the gene expression profiles. gene expression level = log2

Intensity(Cy5) . Intensity(Cy3)

(1)

A microarray experiment typically measures the expression levels of large number of genes across different experimental conditions or time points. A microarray gene expression data consisting of n genes and m conditions can expressed as a real valued n × m matrix M = [gij ], i = 1, 2, . . . , n, j = 1, 2, . . . , m. Here each element gij represents the expression level of the ith gene at the j th experimental condition or time point (Figure 1). The raw gene expression data may contain noise and also suffers from some variations arising from biological experiments and missing values. Hence before applying any clustering algorithm, some preprocessing of the data is required. Two widely used preprocessing techniques are missing value estimation and normalization. Normalization is a statistical tool for transforming data into a format that can be used for meaningful cluster analysis [26, 27]. Among various kinds of normalization technique, the most used is the one by which each row of the matrix M is standardized to have mean 0 and variance 1.

4

SA based Fuzzy Clustering with Variable Length Configuration

Simulated Annealing (SA) [28, 29] is a popular search algorithm that utilizes the principles of statistical mechanics regarding the behavior of a large number of atoms at low temperature, for finding minimal cost solutions to large optimization problems by minimizing the associated energy. In statistical mechanics, it is very important to investigate the ground states or low energy states of matter. These states are achieved at very low temperatures. However, it is not sufficient to lower the temperature alone since this results in unstable states. In the annealing process, the temperature is first raised, then decreased gradually to a very low value (Tmin), while ensuring that one spends sufficient time at each temperature value. This process yields stable low energy states. In [30], a convergence proof for SA, if annealed sufficiently slowly, is given. Being based on strong theory, SA has been applied in diverse areas [31, 32, 33] successfully. In this section, an improved variable configuration length SA based fuzzy clustering algorithm is described.

4

4.1

Solution Representation

A solution to the clustering problem is a set of K cluster centers, K being the number of clusters. Here a solution (configuration) is represented by a string of real numbers which represent the coordinates of the cluster centers. If configuration i encodes the centers of Ki clusters in m dimensional space then its length li will be m × Ki . For example, in four dimensional space, the configuration <1.3 11.4 53.8 2.6 10.1 21.4 0.4 5.3 35.6 0.0 10.3 17.6> encodes 3 cluster centers, (1.3, 11.4, 53.8, 2.6), (10.1, 21.4, 0.4, 5.3) and (35.6, 0.0, 10.3, 17.6). Each center is considered to be indivisible.

4.2

Initial Configuration

The string corresponding to the initial configuration encodes the centers of some K number of clusters, such that K = (rand()%K ∗ ) + 2, where, rand() is a function returning a random integer, and K ∗ is a soft estimate of the upper bound of the number of clusters. Therefore, the number of clusters will vary from 2 to K ∗ + 1. The K centers encoded in the initial configuration are randomly selected distinct points from the input data set.

4.3

Computation of Energy

The energy of a configuration indicates the degree of goodness of the solution it represents. The goal is to minimize the energy to obtain the lowest energy state. In this article, the Xie-Beni (XB) cluster validity index [23] is used as the enrgy function. Let X = {x1 , x2 , . . . , xn } be the set of n data points to be clustered. For computing the energy, the centers encoded in the configuration are first extracted. Let these be denoted as Z = {z1 , z2 , . . . , zK }. The membership values uik , i = 1, 2, . . . , K and k = 1, 2, . . . , n are computed as follows [21]: uik = PK

1

2 D(zi ,xk ) m−1 j=1 ( D(zj ,xk ) )

, 1 ≤ i ≤ K; 1 ≤ k ≤ n,

(2)

where D(., .) is a distance function, m is the weighting coefficient and K be the number of clusters encoded in the configuration. (Note that while computing uik using Eqn. 2, if D(zj , xk ) is equal to zero for some j, then uik is set to zero for all i = 1, . . . , K, i 6= j, while ujk is set equal to one.) Subsequently, the centers encoded in the configuration are updated using the following equation [21] Pn m k=1 (uik ) xk , 1 ≤ i ≤ K, zi = P n m k=1 (uik )

(3)

and the cluster membership values are recomputed as per Eqn. 2.

The XB index is defined as a function of the ratio of the total variation σ to the minimum separation sep of the clusters. Here σ and sep can be written as σ(U, Z; X) =

K X n X

u2ik D2 (zi , xk ),

(4)

i=1 k=1

and sep(Z) = min{D2 (zi , zj )}, i6=j

5

(5)

U , Z and X represent the partition matrix, set of cluster centers and the data set, respectively. The XB index is then written as σ(U, Z; X) n × sep(Z) PK Pn ( u2 D2 (zi , xk )) . = i=1 k=1 ik 2 n × (mini6=j {D (zi , zj )})

XB(U, Z; X) =

(6)

Note that when the partitioning is compact and good, value of σ should be low while sep should be high, thereby yielding lower values of the Xie-Beni (XB) index. The objective is therefore to minimize the XB index for achieving proper clustering.

4.4

Perturbation

In this article three perturbation operations have been used to obtain a new configuration from the previous configuration. The three operations are applied with equal probability. The three operators are as follows: 4.4.1

Perturb Center

In this method, a random center of the configuration is chosen to be perturbed. A random number δ in the range [-1, 1] is generated with uniform distribution. If the value of the center in the dth dimension at is zd , after perturbation it becomes (1 ± 2.δ.p).zd , when zd 6= 0, and (±2.δ.p), when zd = 0. The ‘+’ or ‘-’ sign occurs with equal probability. Here p denotes the perturbation rate which is taken to be 0.01 in this article. 4.4.2

Split Center

The biggest cluster encoded in the configuration is split. To do this, first the size Si of each cluster i is computed as follows: Si =

n X

uij ,

1 ≤ i ≤ K,

(7)

j=1

where K is the number of clusters encoded in the configuration chosen for perturbation. Thereafter the center of the biggest cluster is selected. Next, this center is then substituted by two new centers that are created as follows: A reference point p is found that has membership value closest to the mean of the membership values above 0.5 to the center of the biggest cluster. The distance between the reference point p and the selected center at the dth dimension (distd ) is computed as: distd = |zd − pd |.

(8)

Subsequently the values of the dth dimension of the two new centers that replace the currently selected center are given by zd ± distd .

6

4.4.3

Delete Center

In the delete center operation, the smallest cluster is identified as per Eqn. 7 and its center is deleted from the configuration. If delete center operation results in a single cluster, the operation is nor performed.

4.5

Acceptance of the New Configuration

Suppose the current configuration curr has energy value Ecurr and the new configuration new obtained by perturbing the current configuration has energy value Enew . If Enew ≤ Ecurr , then the new configuration is accepted and considered as the current configuration. If Enew is greater than Ecurr , then the probability pacc of accepting the new configuration is given by: Enew − Ecurr pacc = exp − , Tt

(9)

where Tt is the current temperature. This means that the probability of acceptance of a comparatively bad solution decreases with the increasing badness of the new solution and decreasing temperature.

4.6

Annealing Schedule

Starting from the initial high temperature T1 = Tmax , the temperature is decreased as per some annealing schedule at each generation t, such that T1 ≥ T2 ≥ . . . ≥ Tt . . . ≥ Tmin ≈ 0, where Tmin is the minimum temperature. The asymptotic convergence (i.e., at t → ∞) of the SA is guaranteed for a logarithmic annealing schedule of the form Tt = Tmax /(1 + ln t), where t ≥ 1. However, in practice, the logarithmic annealing is far too slow and hence we have used a geometric schedule of the form Tt = Tmax ∗ (1 − α)t , where α is a positive real number close to zero. At each temperature, to obtain a stable state, the process of perturbation and acceptance of new configuration is repeated for Iter times. The annealing process terminates when the current temperature reaches Tmin In Fig. 2, the different steps of VSA algorithm have been shown.

5

Artificial Neural Network based Classifier

The ANN classifier (Figure 3)algorithm used in this article implements a three layer feed-forward neural network with a hyperbolic tangent function for the hidden layer and the softmax function [34] for the output layer. Using softmax, output of ith output neuron is given by: eqi pi = P n j=1

eqj

,

(10)

where qi the net input to the ith output neuron, and K is the number of output neurons. The use of softmax makes it possible to interpret the outputs as probabilities. The number of neurons in the input layer is d, where d is the number of features of the input data set. The number of neurons in the output layer is K, where K is the number of classes. The ith output neuron provides the class membership degree of the input pattern to the ith class. The number of hidden layer neurons is taken as 2 ∗ d. The weights

7

are optimized with a maximum a posteriori (MAP) approach; cross-entropy error function augmented with a Gaussian prior over the weights. The regularization is determined by MacKay’s ML-II scheme [20]. Outlier probability of training examples is also estimated [35]. Figure 3 shows the feed forward ANN classifier model.

6

Proposed VSA-ANN Clustering Technique

As discussed earlier, the fuzzy clustering techniques such as VSA generates a fuzzy partition matrix U = [uik ], i = 1, . . . , K and k = 1, . . . , n, where K and n be the number of clusters evolved and the number of data points, respectively. The solution can be defuzzified by assigning each point to the cluster to which it has the highest membership degree. Hence for each cluster, the points belonging to it may have different membership degrees ranging from high (higher confidence) to low (lower confidence). The points having lower membership degrees may be considered as they are not classified to that cluster with reasonable confidence level. On the other hand, the points having high membership values can be thought as they are properly classified. This motivates us to design a clustering method where the points having high membership values in each cluster are used to train a classifier and the class labels of the remaining points are predicted thereby using the trained classifier. In this article, we have used VSA based fuzzy clustering algorithm to evolve the fuzzy membership matrix as well as the number of clusters. Subsequently an ANN based probabilistic classifier is used to do the classification task. The method is named as VSA-ANN and its steps are as follows: Step 1: Cluster the input data set X = {x1 , x2 , . . . , xn } using VSA based fuzzy clustering algorithm to evolve the fuzzy membership matrix U = [uik ], i = 1, . . . , K and k = 1, . . . , n, where K and n be the number of clusters evolved automatically and the number of data points, respectively. Step 2:

Assign each point k, (k = 1, . . . , n), to some cluster j (1 ≤ j ≤ K) such that ujk =

maxi=1,...,K {uik }. Step 3: For each cluster i (i = 1, . . . , K), select all the points j of that cluster, for which uij ≥ Ti , where Ti (0 < Ti < 1) is a threshold value on the membership degree for cluster i. These points act as training points for cluster i. Combine the training points of all the clusters to form the complete training set. Keep the remaining points as the test set. Step 4: Train the probabilistic ANN classifier using the training set created in the previous step. Step 5: Generate the conditional membership probabilities for the remaining points (test points) using the trained ANN classifier. Step 6: Obtain the new membership matrix U ∗ = [u∗ ]K×n combining the memberships of the training points (obtained using VSA) and the test points (produced by trained ANN).

8

Step 7:

Assign each point k, (k = 1, . . . , n), to some cluster j (1 ≤ j ≤ K) such that u∗jk =

maxi=1,...,K {u∗ik }. Note that the size and the confidence of the training set depends on the choice of the membership thresholds Ti , i = 1, . . . , K. If Ti values are large, the size of the training set decreases, however the training set will then contain only the points having high membership degrees to their respective clusters. Hence the training set will have more confidence. On the contrary, for small values of Ti , the size of the training set will increase sacrificing the confidence level. Therefore the choice of the threshold parameters Ti , i = 1, . . . , K, has significant effect on the performance of VSA-ANN. During experimentation, it has been noticed that best clustering performance is achieved if for each cluster, the points that have greater membership degrees than the mean membership degree of that cluster are selected for training. Taking this into account, after several experimentation we have fixed Ti as follows: Ti =

1 X uij , ni

i = 1, . . . , K,

(11)

j∈Ci

where ni is the size of cluster i (denoted by Ci ). This implies that for each cluster, the points having membership degrees greater than the mean of the membership degrees of all the points of that cluster are chosen as the training points. Thus the membership threshold value can vary from one cluster to another. For the purpose of illustration, in Figure 4(a) a two-dimensional artificial data set is shown. The data set contains five clusters. Figure 4(b) and Figure 4(c) show the training set and test set of points obtained, respectively. This example indicates that the points in the test set are usually situated at the overlapping regions of the clusters thus having large amount of confusion regarding their class assignment.

7

Distance Measures

Choice of distance measure plays a great role in the context of microarray clustering. In this article, Pearson correlation-based distance measure has been used as this is the commonly used distance metric for clustering gene expression data. A gene expression data consisting of n genes and m time points are usually expressed as a real valued n × m matrix E = [gij ], i = 1, 2, . . . , n, j = 1, 2, . . . , m. Here each element gij represents the expression level of the ith gene at the j th time point. Pearson Correlation: Given two feature vectors, gi and gj , Pearson correlation coefficient Cor(gi , gj ) between them is computed as Pp

− µgi )(gjl − µgj ) qP . p 2 2 l=1 (gil − µgi ) l=1 (gjl − µgj )

Cor(gi , gj ) = pP p

l=1 (gil

(12)

Here µgi and µgj represent the arithmetic means of the components of the feature vectors gi and gj respectively. Pearson correlation coefficient defined in Eqn. 12 is a measure of similarity between two objects in the feature space. The distance between two objects gi and gj is computed as 1 − Cor(gi , gj ), which represents the dissimilarity between those two objects. 9

8

Complexity Analysis

In this section, we have analyzed the time and space complexity of the proposed VSA-ANN clustering and compared with that of the other clustering methods considered here.

8.1

Time Complexity

Since time taken by VSA dominates the training and testing of ANN, The worst case time complexity of VSA-ANN will be dominated by the complexity of VSA only. The complexity of VSA can be computed as follows: 1. Time required for the initialization of the configuration is proportional to the length of the configuration. As the length of the configuration is proportional to K ∗ × d (K ∗ = soft estimate of the upper bound of the number of clusters, d = data dimension), the time complexity of initialization is O(K ∗ × d). 2. One of the following perturbation operations are performed randomly: (a) Perturb center: this can be performed in O(d) time. (b) Split center: this can be performed in O(n × K ∗ × d) time, n is the number of data points. (c) Delete center: this can be performed in O(n × K ∗ × d) time. Hence in the worst case, the complexity of perturbation is O(n × K ∗ × d). 3. Energy computation is composed of three steps: (a) The complexity of computing the membership of n points to K ∗ clusters is O(n × K ∗ × d). (b) For updating K ∗ cluster centers, the complexity is O(K ∗ × d). (c) The complexity of computing the energy function is O(n × K ∗ × d). Hence the total complexity of energy computation is O(n × K ∗ × d). Thus summing up the above complexities, the total time complexity becomes O(n× K ∗ × d) per iteration. If the number of different temperatures is t and number of iterations in each temperature is Iter, then the overall time complexity of VSA and hence VSA-ANN becomes O(t × Iter × n × K ∗ × d). The times complexities of the other algorithms considered in this article are as follows: the time complexity of FCM for k number of clusters is O(I × n × k × d), where I is the number of iterations. As FCM is needed to be run for different number of clusters starting from 2 to K ∗ to find the number of P ∗ clusters, hence the total time complexity of FCM becomes O( K k=2 (I × n × k × d)). The time complexity of average linkage algorithm is O(n2 × log n × d). SOM clustering has a time

complexity of O(I × n × k × d) for k map elements and I iterations. To find the number of clusters, SOM is to be executed for each value of k starting from 2 to K ∗ , k being even. Hence the total time complexity P of SOM becomes O( k=2,4,...,K ∗ (I × n × k × d)). 10

SIMM-TS algorithm has two stages. In the first stage, it uses VGA based clustering, which has a time complexity of O(G × P × n × K ∗ × d), where G and P are the number of generations and population size, respectively. to identify the SiMM points, it takes O(n log n) time. In the second stage, it uses multiobjective GA based clustering, which has a time complexity of O(G × P × n × K ∗ × d), with two objective functions. However the second stage clustering algorithm is executed for T times with different values of the threshold parameter P. Hence the overall time complexity of the second stage and hence the SiMM-TS algorithm is O(G × P × T × n × K ∗ × d).

8.2

Space Complexity

VSA has a worst case space complexity of O(K ∗ × n), i.e., the size of the membership matrix. ANN will have a space complexity of O(N × H + H × F ) (size of the weight matrices), where N, H and F are the number of neurons in the input layer, hidden layer and the output layer, respectively. As discussed in Section 5, N = d, H = 2 ∗ d and F = K ∗ , in the worst case. Hence the space complexity of ANN is O(d2 + d × K ∗ ). Hence the space complexity of VSA-ANN is O(K ∗ × n), assuming n >> d. The space complexity of FCM is O(K ∗ × n). Average linkage algorithm has a space complexity of O(n2 ). The space complexity of SOM algorithm is also O(K ∗ × n) (The distance matrix size at each iteration). Finally, the space complexity of SiMM-TS is O(P × K ∗ × n), P being the population size. Note that the input data set of size O(n × d) is also to be kept in memory for all the above algorithms to get faster performance.

9

Data Sets and Pre-processing

In this article three real life gene expression data sets viz., Yeast Sporulation, Human Fibroblasts Serum and Rat CNS data sets have been considered for experiments. The data sets and the pre-processing techniques used are described below.

9.1

Yeast Sporulation

Microarray data on the transcriptional program of sporulation in budding yeast has been considered here. The data set [3] is publicly available at the website http://cmgm.stanford.edu/pbrown/sporulation. DNA microarrays containing 97% of the known and predicted genes is used. The total number of genes is 6118. During the sporulation process, the mRNA levels were obtained at seven time points 0, 0.5, 2, 5, 7, 9 and 11.5 hours. The ratio of each gene’s mRNA level (expression) to its mRNA level in vegetative cells before transfer to the sporulation medium, is measured. Subsequently, the ratio data are then log2-transformed. Among the 6118 genes, the genes whose expression levels did not change significantly during the harvesting have been ignored from further analysis. This is determined with a threshold level of 1.4 for the root mean squares of the log2-transformed ratios. The resulting set consists of 690 genes.

11

9.2

Human Fibroblasts Serum

This dataset contains the expression levels of 8613 human genes [36]. The data set is obtained as follows: First, human fibroblasts were deprived of serum for 48 hours and then stimulated by addition of serum. After the stimulation, expression levels of the genes were computed over twelve time points and an additional data point was obtained from a separate unsynchronized sample. Hence the data set has 13 dimensions. A subset of 517 genes whose expression levels changed substantially across the time points have been chosen. The data is then log2-transformed. This data set can be downloaded from http://www.sciencemag.org/feature/data/984559.shl.

9.3

Rat CNS

The Rat CNS data set has been obtained by reverse transcription-coupled PCR to examine the expression levels of a set of 112 genes during rat central nervous system development over 9 time points [37]. This data set is available at http://faculty.washington.edu/kayee/cluster. Each data set is normalized so that each row has mean 0 and variance 1 (Z normalization) [27].

10

Experimental Results

This section first provides a description of the performance metrics used to evaluate the performance of various algorithms. Thereafter, a comparative study has been made among several algorithms in terms of the performance metrics. Finally a statistical significance test has been carried out to establish that the superior performance of VSA-ANN is statistically significant.

10.1

Performance Metrics

For evaluating the performance of the clustering algorithms, silhouette index [38] is used. In addition, two cluster visualization tools namely Eisen plot and cluster profile plot have been utilized. 10.1.1

Silhouette Index

Silhouette index [38] is a cluster validity index that is used to judge the quality of any clustering solution C. Suppose a represents the average distance of a point from the other points of the cluster to which the point is assigned, and b represents the minimum of the average distances of the point from the points of the other clusters. Now the silhouette width s of the point is defined as: s=

b−a . max{a, b}

(13)

silhouette index s(C) is the average silhouette width of all the data points (genes) and it reflects the compactness and separation of clusters. The value of silhouette index varies from -1 to 1 and higher value indicates better clustering result.

12

10.1.2

Eisen Plot

In Eisen plot [4], (see Figure 5(a) for example) the expression value of a gene at a specific time point is represented by coloring the corresponding cell of the data matrix with a color similar to the original color of its spot on the microarray. The shades of red color represent higher expression level, the shades of green color represent low expression level and the colors towards black represent absence of differential expression values. In our representation, the genes are ordered before plotting so that the genes that belong to the same cluster are placed one after another. The cluster boundaries are identified by white colored blank rows. 10.1.3

Cluster Profile Plot

The cluster profile plot (see Figure 5(b) for example) shows for each cluster the normalized gene expression values (light green) of the genes of that cluster with respect to the time points. Also, average expression values of the genes of the cluster over different time points are shown as a black line together with the standard deviation within the cluster at each time point.

10.2

Input Parameters for VSA-ANN

The parameters for VSA are as follows: Tmax = 100, Tmin = 1, Iter = 200 and α = 0.1. The fuzzy exponent m is chosen to be 2.0. The value of K ∗ , i.e., the soft estimate of the upper bound of the number of clusters is taken to be 15 for all the data sets.

10.3

Comparative Study

In order to establish the effectiveness of the proposed VSA-ANN clustering scheme, its performance has been compared with Average Linkage, SOM [12] and SiMM-TS [18] clustering algorithms. Moreover VSA and an iterated version of fuzzy C-means (IFCM) algorithm have been applied independently. FCM [21] is a widely used partitional clustering technique. The objective of FCM algorithm is to use the principles of fuzzy sets to evolve a partition matrix U (X). It minimizes the measure given by Eqn. 14. Jm =

n X K X

2 um kj D (zk , xj ).

(14)

j=1 k=1

It is known that FCM algorithm sometimes gets stuck at some suboptimal solution [39]. In the iterated FCM (IFCM), the FCM algorithm is run for different values of K starting from 2 to K ∗ , where K ∗ is the soft estimate of the upper bound of the number of clusters. For each K, it is executed 10 times from different initial configurations and the run giving the best Jm value is taken. Among these best solutions for different K values, the solution producing the minimum XB index (Eqn. 6) value is chosen as the best partitioning. The corresponding K and the partitioning matrix are considered as the solution. For average linkage and SOM, the algorithms are executed for different values of K ranging from 2 to K ∗ and the K value that provides the best silhouette index score is reported. As SOM can only produce even number of clusters due to its grid structure, to produce the SOM result for k clusters (k being odd), 13

we have merged the two closest clusters (minimum distance between the cluster centers) from the SOM clustering result for k + 1 clusters. 10.3.1

Results for Yeast Sporulation data

Table 1 shows the average silhouette index values for algorithms VSA-ANN, IFCM, VSA, average linkage and SOM over 50 consecutive runs. It can be noted from the table that VSA (and thus VSA-ANN), average linkage, SOM and SiMM-TS have determined the number of clusters as 8, whereas IFCM obtained 7 clusters in the data set. From the s(C) values, it is evident that the performance of the proposed VSAANN clustering method is superior compared to the other methods. Table 2 reports the number of points included in the training and test sets using the proposed method for all the data sets. The table also reports the s(C) index scores for the test data points before and after the application of ANN classifier, and the percentage of test points changed their class labels after the application of ANN classifier. It can be noted from the table that for Sporulation data, before the application of ANN classifier, the s(C) score of the test points were 0.1703. Low value of s(C) for the test data (compared to the overall silhouette index, which is 0.4872 as can be noticed from Table 1) indicates that these data points have not been clustered properly and needed to be refined. After the application of ANN classifier in the second stage, 41.2% of the test points have changed their class labels and the s(C) score for test data gets improved from 0.1703 to 0.2493. This indicates the utility of the proposed VSA-ANN method. To demonstrate visually the result of VSA-ANN clustering, Figure 5 shows the Eisen plot and cluster profile plots corresponding to the best results (in terms of silhouette index) provided by VSA-ANN on Yeast data set. The Eisen plot (Figure 5(a)) clearly demonstrates the 8 prominent and distinguished clusters of the Yeast data. As evident from the figure, the genes of a cluster exhibit similar expression profile, i.e., they produce similar color patterns. The cluster profile plots (Figure 5(b)) also demonstrate how the expression profiles for the different groups of genes differ from each other, while the profiles within a group are reasonably similar. 10.3.2

Results for Human Fibroblasts Serum Data

The average s(C) values obtained by the different clustering algorithms in 50 consecutive runs on Human Fibroblasts Serum data are reported in Table 1. For this data set, All the algorithms except IFCM determine the number of clusters as 6, whereas IFCM found 8 clusters. For this data set also, the proposed VSA-ANN method outperforms all the other algorithms in terms of s(C). It is evident from Table 2, that for this data set, 33.16% of the test points have changed their class labels and the s(C) scores of the test points gets improved from 0.1063 to 0.1539 after the application of ANN classification in the second stage. Figure 6 shows the Eisen plot and cluster profile plots for the clustering solution obtained by VSAANN technique for the Serum data set. It is evident from the figure that the genes of each cluster are highly co-regulated and thus have similar expression profiles. 14

10.3.3

Results for Rat CNS Data

Table 1 reports the average s(C) values for the clustering results obtained by the different algorithms in 50 consecutive runs on Rat CNS data. For this data set also, VSA (thus VSA-ANN), Average linkage and SiMM-TS give the number of clusters as 6, similar to that found in [37]. The IFCM and SOM identified 5 clusters in the data set. For this data set also, the proposed VSA-ANN clustering method provides much improved value of s(C) compared to all the other algorithms. As is evident from Table 2, for this data set also, 20.41% of the test points have changed their class labels and the s(C) score of the test data points gets improved from 0.1142 to 0.2083 after the application of ANN classification. For illustration purpose, the Eisen plot and cluster profile plots have been shown in Figure 7 for this data set. As discussed above, the results indicate significant improvement in clustering performance using the proposed VSA-ANN clustering approach compared to the other algorithms. A statistical significance test has been carried out next to establish that the superior results obtained by VSA-ANN are statistically significant.

10.4

Statistical Significance Test

To judge the statistical significance of the clustering results, a non-parametric statistical significance test called Wilcoxons rank sum test for independent samples [40] has been conducted at the 1corresponding to six algorithms (1. VSA-ANN, 2. IFCM, 3. VSA, 4. Average linkage, 5. SOM, 6. SiMM-TS), have been created for each data set. Each group consists of the s(C) index scores produced by 50 consecutive runs of the corresponding algorithm. The median values of each group for all the data sets are shown in Table 3. It is evident from Table 3 that the median values of s(C) index scores for VSA-ANN are better than that for the other algorithms. To establish that this goodness is statistically significant, Table 4 reports the P-values produced by Wilcoxons rank sum test for comparison of two groups (the group corresponding to VSA-ANN and a group corresponding to some other algorithm) at a time. As a null hypothesis, it is assumed that there are no significant difference between the median values of two groups. Whereas, the alternative hypothesis is that there is significant difference in the median values of the two groups. All the P-values reported in the table are less than 0.01 (1% significance level). For example, the rank sum test between algorithms VSA-ANN and IFCM for Sporulation data set provides a P-value of 2.33E07, which is very small. This is strong evidence against the null hypothesis, indicating that the better median values of the performance metrics produced by VSA-ANN are statistically significant and have not occurred by chance. Similar results are obtained for all other data sets and for all other algorithms compared to VSA-ANN, establishing the significant superiority of the VSA-ANN algorithm.

11

Biological Significance

The biological relevance of a cluster can be verified based on the statistically significant GO annotation database (http://db.yeastgenome.org/cgi-bin/GO/goTermFinder). This is used to test the functional 15

enrichment of a group of genes in terms of three structured, controlled vocabularies (ontologies), viz., associated biological processes, molecular functions and biological components. The p-value of a statistical significance test is used to find the probability of getting values of a test statistic that are at least equal to in magnitude (or more) compared to the observed test statistic. The degree of functional enrichment (p-values) is computed using a cumulative hypergeometric distribution. This measures the probability of finding the number of genes involved in a given GO term (i.e., function, process, component) within a cluster. From a given GO category, the probability p for getting k or more genes within a cluster of size n, can be defined as [41]: p=1−

k−1 X i=0

f i

g−f n−i g n

,

(15)

where f and g denote the total number of genes within a category and within the genome, respectively. Statistical significance is evaluated for the genes in a cluster by computing p-values for each GO category. This signifies how well the genes in the cluster match with the different GO categories. If the majority of genes in a cluster have the same biological function, then it is unlikely that this takes place by chance and the p-value of the category will be close to 0. The biological significance test has been conducted at 1% significance level. For different algorithms, the number of clusters for which the most significant GO terms have p-value less than 0.01 (1% significance level) are as follows: VSA-ANN - 8, IFCM - 6, VSA - 8, Average linkage - 5, SOM - 6, and SiMM-TS 8. Note that only for VSA, VSA-ANN and SiMM-TS, all the clusters produced are significantly enriched with some GO categories. In Table 5, the most significant p-values of the functionally enriched clusters of Yeast Sporulation data as obtained by different algorithms are reported. The clusters are sorted according to the significance level. Lesser the p-value, better is the significance. For visual inspection, in Figure 8, the plot of most significant p-values of the functionally enriched clusters (sorted by significance level) of this data set for different algorithms are shown. The p-values are log-transformed for better readability. It is clear from the figure that the curve corresponding to VSA-ANN comes below the all other curves. This indicates that all the 8 clusters found by VSA-ANN are more significantly enriched compared to the clusters obtained by other algorithms. For the purpose of illustration, Table 6 reports the three most significant GO terms shared by the genes of each of the 8 clusters identified by VSA-ANN technique (Figure 5). The most significant GO terms for these 8 clusters are microtubule organizing center (p-value: 6.235E-9), nucleotide metabolic process (p-value: 1.320E-4), cytosolic part (p-value: 1.4E-45), spore wall assembly (sensu Fungi) (pvalue: 8.976E-25), glycolysis (p-value: 2.833E-14), M phase of meiotic cell cycle (p-value: 1.714E-25), ribosome biogenesis and assembly (p-value: 1.4E-45) and organic acid metabolic process (p-value: 1.858E4), respectively. It is evident from the table that all the clusters produced by VSA-ANN clustering scheme are significantly enriched with some GO categories, since all the p-values are less than 0.01 (1% significance level). This establishes that the proposed VSA-ANN clustering scheme is able to produce biologically relevant and functionally enriched clusters.

16

12

Discussion and Conclusions

In this article, a clustering algorithm (VSA-ANN) for clustering microarray gene expression data that combines a VSA based fuzzy clustering method with probabilistic ANN classifier has been proposed. The number of clusters in a gene expression data set is automatically evolved by the proposed VSA-ANN clustering technique. The results demonstrate how improvement in clustering performance is obtained by refining the clustering solution produced by VSA using ANN classifier. The performance of the proposed clustering method has been compared with the average linkage, SOM, VSA, IFCM and recently proposed SiMM-TS clustering algorithms to show its effectiveness on three real life gene expression data sets. It has been found that the VSA-ANN clustering scheme outperforms all the other clustering methods significantly. Moreover, it is seen that VSA performs reasonably well in determining the appropriate value of the number of clusters of the gene expression data sets. The clustering solutions are evaluated both quantitatively (i.e., using silhouette index) and using some gene expression visualization tools. Also statistical tests have also been conducted in order to establish the statistical significance of the results produced by the proposed technique. Finally a biological significance test has been carried out in order to establish the biological relevance of the clusters produced by VSA-ANN clustering method as compared to the other algorithms.

Acknowledgement References [1] R. Sharan, M.-K. Adi, and R. Shamir, “CLICK and EXPANDER: a system for clustering and visualizing gene expression data,” Bioinformatics, vol. 19, pp. 1787–1799, 2003. [2] A. A. Alizadeh, M. B. Eisen, R. Davis, C. Ma, I. Lossos, A. Rosenwald, J. Boldrick, R. Warnke, R. Levy, W. Wilson, M. Grever, J. Byrd, D. Botstein, P. O. Brown, and L. M. Straudt, “Distinct types of diffuse large b-cell lymphomas identified by gene expression profiling,” Nature, vol. 403, pp. 503–511, 2000. [3] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. O. Brown, and I. Herskowitz, “The transcriptional program of sporulation in budding yeast,” Science, vol. 282, pp. 699–705, October 1998. [4] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display og genomewide expression patterns,” in Proc. Natl. Acad. Sci., (USA), pp. 14863–14868, 1998. [5] S. Bandyopadhyay, U. Maulik, and J. T. Wang, Analysis of Biological Data: A Soft Computing Approach. World Scientific, 2007. [6] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988. 17

[7] J. T. Tou and R. C. Gonzalez, Pattern Recognition Principles. Reading: Addison-Wesley, 1974. [8] J. A. Hartigan, Clustering Algorithms. Wiley, 1975. [9] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodica, and T. G. W. et al, “A genome-wide transcriptional analysis of mitotic cell cycle,” Mol. Cell., vol. 2, pp. 65–73, 1998. [10] R. Herwig, A. Poustka, C. Meuller, H. Lehrach, and J. OBrien, “Large-scale clustering of cDNA fingerprinting data,” Genome Research, vol. 9, no. 11, pp. 1093–1105, 1999. [11] D. Dembele and P. Kastner, “Fuzzy c-means method for clustering microarray data,” Bioinformatics, vol. 19, no. 8, pp. 973–980, 2003. [12] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. Lander, and T. Golub, “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation,” in Proc. Nat. Academy of Sciences, vol. 96, (USA), pp. 2907–2912, 1999. [13] E. Hartuv and R. Shamir, “A clustering algorithm based on graph connectivity,” Information Processing Letters, vol. 76, no. 200, pp. 175–181, 2000. [14] U. Alon, N. Barkai, and D. N. et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” in Proc. Nat. Academy of Sciences, vol. 96, (USA), pp. 6745–6750, 1999. [15] A. V. Lukashin and R. Fuchs, “Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters,” Bioinformatics, vol. 17, no. 5, pp. 405– 414, 2001. [16] U. Maulik and S. Bandyopadhyay, “Genetic algorithm based clustering technique,” Pattern Recognition, vol. 33, pp. 1455–1465, 2000. [17] A. Mukhopadhyay, U. Maulik, and S. Bandyopadhyay, Multiobjective Evolutionary Approach to Fuzzy Clustering of Microarray Data, ch. 13, pp. 303–326. World Scientific, 2007. [18] S. Bandyopadhyay, A. Mukhopadhyay, and U. Maulik, “An improved algorithm for clustering gene expression data,” Bioinformatics, vol. 23, no. 21, pp. 2859–2865, 2007. [19] C. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1996. [20] D. J. C. MacKay, “The evidence framework applied to classification networks,” Neural Computation, vol. 4, no. 5, pp. 720–736, 1992. [21] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum, 1981.

18

[22] H. W. Mewes, K. Albermann, K. Heumann, S. Liebl, and F. Pfeiffer, “MIPS: A database for protein sequences, homology data and yeast genome information,” Nucleic Acid Research, vol. 25, pp. 28–30, 1997. [23] X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, pp. 841–847, 1991. [24] S. Bandyopadhyay, U. Maulik, and A. Mukhopadhyay, “Multiobjective genetic clustering for pixel classification in remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 5, pp. 1506–1511, 2007. [25] E. Domany, “Cluster analysis of gene expression data,” J. Statistical Physics, vol. 110, no. 3-6, pp. 1117–1139, 2003. [26] W. Shannon, R. Culverhouse, and J. Duncan, “Analyzing microarray data using cluster analysis,” Pharmacogenomics, vol. 4, no. 1, pp. 41–51, 2003. [27] S. Y. Kim, J. W. Lee, and J. S. Bae, “Effect of data normalization on fuzzy clustering of DNA microarray data,” BMC Bioinformatics, vol. 7, no. 134, 2006. [28] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680, 1983. [29] P. J. M. van Laarhoven and E. H. L. Aarts, Simulated Annealing: Theory and Applications. Kluwer Academic Publisher, 1987. [30] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721–741, 1984. [31] R. Caves, S. Quegan, and R. White, “Quantitative comparison of the performance of sar segmentation algorithms,” IEEE Trans. on Image Proc., vol. 7, no. 11, pp. 1534–1546, 1998. [32] U. Maulik, S. Bandyopadhyay, and J. Trinder, “SAFE: An efficient feature extraction technique,” Journal of Knowledge and Information Systems, vol. 3, pp. 374–387, 2001. [33] S. Bandyopadhyay, U. Maulik, and M. K. Pakhira, “Clustering using simulated annealing with probabilistic redistribution,” Int. J.Pattern Recognition and Artificial Intelligence, vol. 15, no. 2, pp. 269–285, 2001. [34] L. N. Andersen, J. Larsen, L. K. Hansen, and M. HintzMadsen, “Adaptive regularization of neural classifiers,” in in Proc. IEEE Workshop on Neural Networks for Signal Processing VII, (New York, USA), pp. 24–33, 1997. [35] S. Sigurdsson, J. Larsen, and L. Hansen, “Outlier estimation and detection: Application to skin lesion classification,” in in Proc. Intl Conf on Acoustics, Speech and Signal Processing, 2002. 19

[36] V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore, J. Lee, J. M. Trent, L. M. Staudt, J. J. Hudson, M. S. Boguski, D. Lashkari, D. Shalon, D. Botstein, and P. O. Brown, “The transcriptional program in the response of the human fibroblasts to serum,” Science, vol. 283, pp. 83–87, 1999. [37] X. Wen, S. Fuhrman, G. S. Michaels, D. B. Carr, S. Smith, J. L. Barker, and R. Somogyi, “Large-scale temporal gene expression mapping of central nervous system development,” in Proc. Nat. Academy of Sciences, vol. 95, (USA), pp. 334–339, 1998. [38] P. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comp. App. Math, vol. 20, pp. 53–65, 1987. [39] L. Groll and J. Jakel, “A new convergence proof of fuzzy c-means,” IEEE Transactions on Fuzzy Systems, vol. 13, no. 5, pp. 717–720, 2005. [40] M. Hollander and D. A. Wolfe, Nonparametric Statistical Methods. second ed., 1999. [41] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church, “Systematic determination of genetic network architecture,” Nature Genet, vol. 22, pp. 281–285, 1999.

20

Figures

Figure 1: Gene Expression Matrix

21

Figure 2: The VSA based fuzzy clustering algorithm

22

Figure 3: 3-layer feed forward ANN classifier model

16

16

16

14

14

14

12

12

12

10

10

10

8

8

8

6

6

6

4

4

6

8

10

(a)

12

14

16

4

4

6

8

10

(b)

12

14

16

4

4

6

8

10

12

14

16

(c)

Figure 4: (a) A two-dimensional artificial data set having 5 clusters, (b) Training data set, (c) Test data set

23

−2

1

2

3 4 5 time points −−−>

6

0

2

3 4 5 time points −−−>

6

Cluster 5

−2

1

2

2

3 4 5 time points −−−>

6

0

−2

1

2

3 4 5 time points −−−>

6

1

2

3 4 5 time points −−−>

2

6

7

6

7

6

7

6

7

Cluster 6

0

−2

1

2

3 4 5 time points −−−>

2

Cluster 8

0

−2

7

(a)

3 4 5 time points −−−>

0

−2

7

Cluster 7

2

Cluster 4

7

0

1

2

log2(R/G) −−−>

1

−2

log2(R/G) −−−>

Cluster 3

2

Cluster 2 0

7

2

−2

2

log2(R/G) −−−>

Cluster 1

0

log2(R/G) −−−>

log2(R/G) −−−> log2(R/G) −−−> log2(R/G) −−−> log2(R/G) −−−>

2

1

2

3 4 5 time points −−−>

(b)

Figure 5: Yeast Sporulation data clustered using VSA-ANN clustering method. (a) Eisen plot, (b) Cluster

2

Cluster 1

log2(R/G) −−−>

log2(R/G) −−−>

profile plots

1 0 −1 −2 4

2

6 8 10 time points −−−>

0 −1

12

Cluster 3

2

log2(R/G) −−−>

log2(R/G) −−−>

Cluster 2

1

−2 2

1 0 −1 −2

4

6 8 10 time points −−−>

2

12

Cluster 4

1 0 −1 −2

2

4

2

6 8 10 time points −−−>

12

Cluster 5

2

log2(R/G) −−−>

log2(R/G) −−−>

2

1 0 −1

4

6 8 10 time points −−−>

2

12

Cluster 6

1 0 −1 −2

−2 2

4

6 8 10 time points −−−>

(a)

12

2

4

6 8 10 time points −−−>

12

(b)

Figure 6: Human Fibroblasts Serum data clustered using VSA-ANN clustering method. (a) Eisen plot, (b) Cluster profile plots

24

2 1 0 −1 −2

2

3

4 6 time points −−−>

1 0 −1 2

4 6 time points −−−>

2

4 6 time points −−−>

8

cluster 4

2 1 0 −1 2

4 6 time points −−−>

8

3 cluster 5

2

log2(R/G) −−−>

log2(R/G) −−−>

0 −1

−2

8

3

1 0 −1 −2

1

3

cluster 3

2

−2

cluster 2 2

−2

8

log2(R/G) −−−>

log2(R/G) −−−>

3

cluster 1

log2(R/G) −−−>

log2(R/G) −−−>

3

2

4 6 time points −−−>

1 0 −1 −2

8

(a)

cluster 6

2

2

4 6 time points −−−>

8

(b)

Figure 7: Rat CNS data clustered using VSA-ANN clustering method. (a) Eisen plot, (b) Cluster profile plots

0 VSA−ANN IFCM VSA Average linkage SOM SiMM−TS

−5

log10(p−value) −−−−−−−−>

−10

−15

−20

−25

−30

−35

−40

−45

−50

1

2

3

4

5

6

7

8

Clusters (sorted by significance level) −−−−>

Figure 8: Plot of functional enrichment significance score (p-value) for the significant clusters of Yeast Sporulation data as obtained by different algorithms. The p-values have been log-transformed (base 10) for better readability. The clusters are sorted according to significance level.

25

Tables Table 1: Average values of s(C) index scores over 50 consecutive runs of various algorithms for different data sets Algorithm

Sporulation

Serum

Rat CNS

K

s(C)

K

s(C)

K

s(C)

VSA-ANN

8

0.5103

6

0.4543

6

0.5318

IFCM

7

0.3719

8

0.2933

5

0.4135

VSA

8

0.4872

6

0.3571

6

0.4662

Average linkage

8

0.4852

6

0.3092

6

0.3601

SOM

8

0.3812

6

0.3287

5

0.4121

SiMM-TS

8

0.4982

6

0.4289

6

0.4423

Table 2: The change in s(C) scores of the test points and percentage of test points that changed their class labels after application of ANN in the second stage of VSA-ANN for different data sets Data Set

Size

Size of

Size of

s(C) score for test set

% of test points

Training set

Test set

Before ANN

After ANN

changed class label

Sporulation

690

474

216

0.1703

0.2493

41.20%

Serum

517

330

187

0.1063

0.1539

33.16%

Rat CNS

112

63

49

0.1142

0.2083

20.41%

Table 3: Median values of s(C) index scores over 50 consecutive runs of various algorithms for different data sets Algorithm

Sporulation

Serum

Rat CNS

VSA-ANN

0.5211

0.4591

0.5307

IFCM

0.3982

0.3013

0.4215

VSA

0.4891

0.3525

0.4692

Average linkage

0.4852

0.3092

0.3601

SOM

0.3793

0.3278

0.4122

SiMM-TS

0.4982

0.4233

0.4468

26

Table 4: P-values produced by Wilcoxon’s rank sum test by comparing VSA-ANN with other algorithms for different data sets P-values Data Sets

(comparing medians of VSA-ANN with other algorithms) IFCM

VSA

Average Linkage

SOM

SiMM-TS

Sporulation

2.33E-07

4.87E-06

3.56E-05

3.26E-08

3.22E-03

Serum

3.88E-08

2.42E-10

1.39E-16

3.72E-13

7.19E-03

Rat CNS

1.06E-07

6.82E-07

4.42E-17

4.71E-10

1.36E-04

Table 5: The functional enrichment significance score (p-value) for the significant clusters of Yeast Sporulation data as obtained by different algorithms. The clusters are sorted according to significance level Clusters

VSA-ANN

IFCM

VSA

Avg link

SOM

SiMM-TS

1

1.400E-45

1.400E-45

1.400E-45

1.400E-45

8.124E-45

1.400E-45

2

1.400E-45

8.823E-41

1.325E-42

7.284E-32

1.332E-28

8.527E-44

3

1.714E-25

1.373E-22

7.652E-25

8.811E-11

7.362E-25

1.102E-24

4

8.976E-25

1.263E-08

1.145E-23

1.282E-08

1.635E-21

1.095E-23

5

2.833E-14

1.211E-08

1.223E-12

1.613E-04

1.434E-07

1.057E-12

6

6.235E-09

1.761E-06

1.032E-06

-

1.710E-06

7.093E-08

7

1.320E-04

-

1.445E-04

-

-

1.664E-04

8

1.858E-04

-

1.823E-03

-

-

1.208E-03

27

Table 6: The three most significant GO terms and the corresponding p-values for each of the 8 clusters of Yeast data as found by VSA-ANN clustering technique Clusters

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Significant GO term microtubule organizing center - GO:0005815

6.235E-9

spore wall assembly (sensu Fungi) - GO:0030476

1.016E-7

microtubule cytoskeleton organization and biogenesis - GO:0000226

1.672E-7

nucleotide metabolic process - GO:0009117

1.320E-4

glucose catabolic process - GO:0006007

2.856E-4

external encapsulating structure - GO:0030312

3.392E-4

cytosolic part - GO:0044445

1.4E-45

cytosol - GO:0005829

1.4E-45

ribosomal large subunit assembly and maintenance - GO:0000027

7.418E-8

spore wall assembly (sensu Fungi) - GO:0030476

8.976E-25

sporulation - GO:0030435

2.024E-24

cell division - GO:0051301

7.923E-16

glycolysis - GO:0006096

2.833E-14

cytosol - GO:0005829

3.138E-4

cellular biosynthetic process - GO:0044249

5.380E-4

M phase of meiotic cell cycle - GO:0051327

1.714E-25

M phase - GO:0000279

1.287E-23

meiosis I - GO:0007127

5.101E-22

ribosome biogenesis and assembly - GO:0042254 Cluster 7

Cluster 8

p-value

1.4E-45

intracellular non-membrane-bound organelle - GO:0043232

1.386E-23

organelle lumen - GO:0043233

9.460E-21

organic acid metabolic process - GO:0006082

1.858E-4

amino acid and derivative metabolic process - GO:0006519

4.354E-4

external encapsulating structure - GO:0030312

6.701E-4

28

A Simulated Annealing-Based Multiobjective ...