Finding Multiple Coherent Biclusters in Microarray Data ... - IEEE Xplore

Viewer
Transcript

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 6, NOVEMBER 2009

969

Finding Multiple Coherent Biclusters in Microarray Data Using Variable String Length Multiobjective Genetic Algorithm Ujjwal Maulik, Senior Member, IEEE, Anirban Mukhopadhyay, and Sanghamitra Bandyopadhyay, Senior Member, IEEE

Abstract—Microarray technology enables the simultaneous monitoring of the expression pattern of a huge number of genes across different experimental conditions. Biclustering in microarray data is an important technique that discovers a group of genes that are coregulated in a subset of conditions. Biclustering algorithms require to identify coherent and nontrivial biclusters, i.e., the biclusters should have low mean squared residue and high row variance. A multiobjective genetic biclustering technique is proposed here that optimizes these objectives simultaneously. A novel encoding scheme that uses variable chromosome length is developed. Moreover, a new quantitative measure to evaluate the goodness of the biclusters is proposed. The performance of the proposed algorithm has been evaluated on both simulated and real-life gene expression datasets, and compared with some other well-known biclustering techniques. Index Terms—Biclustering, mean squared residue (MSR), multiobjective genetic algorithm (GA), row variance, variable string length.

I. INTRODUCTION HE ADVANCEMENT of microarray technology has facilitated the study of the expression levels of large number of genes across different experimental conditions simultaneously. Microarray technology has its application in the areas of medical diagnosis, biomedicine, gene expression profiling, etc. Clustering [1], an important microarray analysis tool, is used to identify the sets of genes with similar expression profiles. In some early works, visual analysis was successfully done for grouping genes into functionally relevant classes in Yeast cell cycle [2], [3] and human large B-cell lymphoma [4] datasets. However, as these methods were very subjective, standard clustering methods, such as K-means [5], fuzzy C-means [6], hierarchical methods [7], self-organizing maps (SOMs) [8], graph theoretic approach [9], simulated-annealing-based approach [10], and genetic algorithm (GA) based clustering methods [11], have been

T

Manuscript received October 19, 2007; revised March 7, 2008. Current version published November 4, 2009. The work of S. Bandyopadhyay was supported by the Swarnajayanti Fellowship Scheme of the Department of Science and Technology, Government of India, under Grant DST/SJF/ET-02/2006-07. U. Maulik is with the Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India (e-mail: [email protected]). A. Mukhopadhyay is with the Department of Computer Science and Engineering, University of Kalyani, Kalyani 741235, India (e-mail: [email protected]). S. Bandyopadhyay is with the Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITB.2009.2017527

utilized for clustering microarray data. Clustering techniques, which aim to find the clusters of genes over all experimental conditions, may fail to discover the genes having similar expression pattern over a subset of conditions. On the contrary, biclustering algorithms discover a subset of genes that are coregulated for a subset of experimental conditions. Hence, they provide better reflection of the biological reality [12], [13]. Biclustering was first introduced in [14] in the name of direct clustering. One of the earlier works on biclustering can be found in [15], where mean squared residue (M SR) measure was used to compute the coherence among a group of genes, and the algorithm followed a greedy search technique guided by a heuristic. A coupled two-way clustering (CTWC) method has been proposed in [16] that uses hierarchical clustering in both dimensions. Different other well-known biclustering techniques are random-walk-based biclustering (RWB) [17], genetic biclustering [18], bipartite-graph-based model called statistical algorithmic method for bicluster analysis (SAMBA) [19], simulated-annealing-based biclustering [20], order preserving submatrix algorithm (OPSM) [21], iterative signature algorithm (ISA) [22], xMotif [23], BiVisu [24], etc. Many real-world situations require simultaneous optimization of several objectives to solve a certain problem. In multiobjective optimizations (MOOs) [25]–[27], search is performed over a number of, often conflicting, objective functions. This paper presents a multiobjective GA-based (MOGAB) biclustering algorithm employing a novel variable string length encoding scheme, where each string encodes a number of possible biclusters. Nondominated sorting genetic algorithm-II (NSGAII) [25], a popular multiobjective algorithm, is used as the underlying MOO strategy. Two objective functions, namely, MSR and row variance of the biclusters are optimized simultaneously. Moreover, a new quantitative measure to evaluate the goodness of the biclusters is proposed. The performance of the proposed algorithm has been compared with some well-known biclustering algorithms for both simulated and real microarray datasets. Also biological significance test has been carried out.

II. BICLUSTERING MODEL A microarray dataset is considered as a G × C matrix M representing the expression levels of a set of G genes G = {I1 , I2 , . . . , IG } over a set of C conditions C = {J1 , J2 , . . . , JC }. Each element mij of matrix M represents the expression level of the ith gene at the jth condition, where

1089-7771/$26.00 © 2009 IEEE

970

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 6, NOVEMBER 2009

i ∈ G and j ∈ C. A bicluster is a submatrix B = (I, J) of matrix M, where I ⊆ G and J ⊆ C. The volume vol(I, J) of a bicluster B = (I, J) is the total number of elements in the bicluster, i.e., vol(I, J) = |I| × |J|. The MSR (MSR(I, J)) of a bicluster B = (I, J) is defined as 1 MSR(I, J) = (aij − aiJ − aI j + aI J )2 (1) vol(I, J) i∈I ,j ∈J

where aiJ = (1/|J|) j ∈J aij , aI j = (1/|I|) i∈I aij , and aI J = (1/|I| × |J|) i∈I ,j ∈J aij , i.e., aiJ , aI j , and aI J denote the ith row’s mean, jthe column’s mean, and the mean of the elements in the bicluster, respectively. The MSR score of a bicluster represents the level of coherence among the elements of the bicluster. Lower residue score means larger coherence, and thus, better quality of the bicluster. For a given threshold value δ ≥ 0, a bicluster B(I, J) is called a δ-bicluster if MSR(I, J) < δ. The row variance var(I, J) of a bicluster B = (I, J) is defined as 1 (aij − aiJ )2 . (2) var(I, J) = vol(I, J)

Fig. 1. Six MOGAB biclusters of SD_200_100 data. For each bicluster, #genes, #conditions, MSR, var, and BI are given. (a) 38, 11, 2.1003, 3.7407, and 0.4430. (b) 52, 4, 1.0224, 2.5519, and 0.2878. (c) 42, 15, 1.6321, 3.5932, and 0.3553. (d) 59, 24, 1.5124, 2.5529, and 0.4257. (e) 41, 25, 1.3584, 2.4389, and 0.3950. (f) 58, 21, 1.3206, 2.4333, and 0.3846. TABLE I PERCENTAGE MATCH SCORES FOR SD_200_100 DATASET

i∈I ,j ∈J

The objective is to find biclusters with high row variance with MSR below a threshold δ. This is required to escape from the trivial or constant-value biclusters. III. PROPOSED ALGORITHM The proposed NSGA-II-based biclustering (MOGAB) technique is discussed here. A. String Representation and Initial Population Each string has two parts: one for clustering the genes, and another for clustering the conditions. The first M positions represent the M gene cluster centers, and the remaining N positions represent the N condition cluster centers. Thus, a string looks like {gc1 gc2 . . . gcM cc1 cc2 . . . ccN }, where each gci , i = 1, . . . , M , represents the index of a gene acting as a cluster center of a set of genes, and each ccj , j = 1, . . . , N , represents the index of a condition acting as a cluster center of a set of conditions. For a dataset having n points, √ it is usual to assume that the dataset may contain at most n clusters [11] in absence of any domain specific information. Taking this into account, for a G × C microarray, the values of the maximum number of gene clusters and √ the maximum number of condi√ tion clusters are G and√ C, respectively. Therefore, value of M √ varies from 2 to G, and value of N varies from 2 to√ C.√Hence, the length of the strings will vary from 4 to G+ C. However, if some domain specific information is available, it is always better to incorporate that to estimate the upper bound of the number of clusters. The first M positions can have values from {1, 2, . . . , G} and the next N positions can have values from {1, 2, . . . , C}, i.e., the gene and condition cluster centers are represented by indexes of the genes and conditions, respectively. A string that encodes A gene clusters and B condition clusters represents a set of A × B biclusters, taking each pair of gene and condition clusters. Each pair gci , ccj , i = 1, . . . , M and j =

Fig. 2. Six MOGAB biclusters of Yeast data. For each bicluster, #genes, #conditions, MSR, var, and BI are given. (a) 71, 7, 165.6892, 766.8410, and 0.2158. (b) 11, 8, 223.3569, 996.4375, and 0.2239. (c) 21, 8, 125.9566, 506.4115, and 0.2482. (d) 8, 7, 222.3437, 669.3858, and 0.2492. (e) 36, 10, 222.7392, 892.7143, and 0.3317. (f) 27, 8, 269.7345, 645.2176, and 0.4174.

Fig. 3.

Plot of BI scores of 100 best biclusters for Yeast data.

1, . . . , N , represents a bicluster that consists of all genes of the gene cluster centered at gene gci and all conditions of the condition cluster centered at condition ccj . The initial population contains randomly generated individuals. Each gene or condition is equally probable to become the center for a gene or a condition cluster, respectively.

MAULIK et al.: FINDING MULTIPLE COHERENT BICLUSTERS IN MICROARRAY DATA

971

TABLE II COMPARISON OF THE BICLUSTERS OF DIFFERENT ALGORITHMS FOR Yeast DATA

B. Fitness Computation Given a valid string (i.e., the string contains no √ repetition of gene or condition indexes, and 2 ≤ M ≤ G and √ 2 ≤ N ≤ C), first all the gene and condition clusters encoded in it are extracted, and each gene and condition is assigned to the respective least-distant cluster centers. Subsequently, each cluster center (both genes and conditions) is updated by selecting the most centrally located point, from which the summation of the distances of other points of that cluster is minimum. Accordingly, the strings are updated. As most of the distance functions are known to perform equally on normalized data [28], any distance function such as Euclidean, Pearson correlation, Manhattan, etc., can be used here. In this paper, Euclidean distance measure has been adopted. Next, we find all the δ-biclusters denoted by some gene cluster, condition cluster pair, encoded in the updated string. Two objective functions are the MSR (1) and row variance (var (2). MSR is to be minimized and var is to be maximized to have good quality biclusters. As the proposed multiobjective technique is posed as a minimization algorithm, hence the two objective functions (f1 and f2 ) of a bicluster B(I, J) are taken as follows: f1 (I, J) =

MSR(I, J) δ

and

f2 (I, J) =

Fig. 4. Six MOGAB biclusters of Human data. For each bicluster, #genes, #conditions, MSR, var, and BI are given. (a) 19, 59, 996.8422, 2315.0660, and 0.4304. (b) 8, 50, 1005.4557, 1976.0596, and 0.5086. (c) 12, 50, 995.8271, 1865.1578, and 0.5336. (d) 10, 59, 1006.7489, 1825.6937, and 0.5511. (e) 10, 35, 1054.5312, 1895.1600, and 0.5561. (f) 23, 36, 1194.3213, 1917.9915, and 0.6224.

1 . 1 + var(I, J)

The denominator of f2 is chosen in such a way so as to avoid divide by zero condition when row variance = 0. Both f1 and f2 are to be minimized to obtain highly coherent yet “interesting” biclusters. For each encoded δ-bicluster, the fitness vector f = {f1 , f2 } is computed. The fitness vector of a string is then the mean of the fitness vectors of all encoded δ-biclusters in it. Due to randomness of the genetic operators, invalid strings may arise at any point of the algorithm. The invalid strings are given fitness vector f = {X, X}, where X is an arbitrary large number. Thus, the invalid strings will be automatically out of the competition in subsequent generations. From the nondominated solutions produced in the final population, all the δ-biclusters are extracted from each nondominated string to produce the final biclusters. C. Genetic Operators 1) Selection: MOGAB uses the crowded binary tournament selection method as used in NSGA-II [25]. 2) Crossover: Single-point crossover is used. The gene and condition parts of the string undergo crossover separately. For

Fig. 5.

Plot of BI scores of 100 best biclusters for Human data.

the gene part, two crossover points are chosen on the two parent chromosomes, respectively, and the portions of the chromosomes beyond these crossover points are exchanged. Crossover is performed for the condition parts similarly. While performing the crossover if invalid strings with repeated gene or condition indexes, or with number of gene or condition clusters less than 2 or greater than the maximum number of clusters are generated, then they are given fitness vector f = {X, X}. Here, X is an arbitrary large number. Thus, the invalid strings get automatically out of the competition in subsequent generations. Crossover probability pc is 0.8. 3) Mutation: Suppose G and C are the total number of genes and conditions, respectively. The mutation is done as follows. A random position is chosen from the first M positions, and its value is replaced by an index randomly chosen from

972

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 6, NOVEMBER 2009

TABLE III COMPARISON OF THE BICLUSTERS OF DIFFERENT ALGORITHMS FOR Human DATA

{1, 2, . . . , G}. Similarly, to mutate the condition portion of the string, a random position is selected from the next N positions, and its value is substituted by a randomly selected index from {1, 2, . . . , C}. A string undergoes mutation depending on a mutation probability pm = 0.1. Elitism has been incorporated in MOGAB to track the nondominated Pareto-optimal solutions found so far. MOGAB has been executed for 100 generations with a population size 50. IV. RESULTS AND DISCUSSION The performance of MATLAB implementation of MOGAB is compared with that of Cheng and Church (CC) [15], RWB [17], OPSM [21], ISA [22], and BiVisu [24]. RWB has been implemented in MATLAB as per [17]. For CC, OPSM, and ISA, the implementation in BicAT [29], and for BiVisu, the original MATLAB code are used. Comparison is also made with a single-objective genetic biclustering that uses same encoding as MOGAB and minimizes f1 × f2 . A simulated dataset SD_200_100 and three real-life datasets, viz., Yeast, Human, and Leukemia are used for experiments on an Intel Core 2 Duo 2.0-GHz, Windows XP machine.

Fig. 6. Six MOGAB biclusters of Leukemia data. For each bicluster, #genes, #conditions, MSR, var, and BI are given. (a) 126, 8, 428.9371, 1865.9349, and 0.2298. (b) 29, 8, 432.0511, 1860.1433, and 0.2321. (c) 27, 7, 396.8000, 2462.4681, and 0.1611. (d) 21, 8, 441.5094, 5922.4800, and 0.0745. (e) 17, 7, 358.1648, 2378.7116, and 0.1505 (f) 16, 7, 377.4619, 4561.3842, and 0.0827.

A. Datasets and Data Preprocessing 1) Simulated Data: A simulated dataset SD_200_100 having 200 genes and 100 conditions is generated. The dataset contains 12 artificial biclusters. 2) Yeast Cell Cycle Data: This dataset [15] contains 2884 genes and 17 conditions. The rows with missing values are omitted to form a data matrix of size 2882 × 17. The dataset is publicly available at http://arep.med.harvard.edu/biclustering. 3) Human Large B-cell Lymphoma Data: There are 4026 genes and 96 conditions in this dataset [15]. The rows with missing values have been removed to reduce its matrix to a size of 854 × 96. This dataset is also publicly available at http://arep.med.harvard.edu/biclustering. 4) Acute Lymphoblastic Leukemia (ALL) Acute Myeloid Leukemia (AML) Data: This dataset [30] provides the RNA value of 7129 probes of human genes with 47 ALL samples and 25 AML samples. This dataset is publicly available at http://sdmc.lit.org.sg/GEDatasets/Datasets.html. The δ values for the aforementioned datasets are taken to be 2.5, 300, 1200, and 500, respectively. The missing values are omitted to avoid random interference. The datasets are normalized for zero mean and unit variance. The detailed description of the datasets is available on the supplementary website.

Fig. 7.

Plot of BI scores of 100 best biclusters for Leukemia data.

B. Validation Indexes 1) Match Score: For the simulated data, the match score as defined in [12] is used to compare the performance of different algorithms. This ranges from 0 to 1 with larger score indicating better matching with the implanted biclusters. 2) BI Index: This is used for the real datasets. If a bicluster has MSR H and row variance R, we define the biclustering index BI for that bicluster as BI =

H . (1 + R)

Low BI value implies high coherence and nontriviality.

(3)

MAULIK et al.: FINDING MULTIPLE COHERENT BICLUSTERS IN MICROARRAY DATA

973

TABLE IV COMPARISON OF THE BICLUSTERS OF DIFFERENT ALGORITHMS FOR Leukemia DATA

The detailed description of the validation indexes are available on the supplementary website.

TABLE V p-values FOR t-TEST COMPARING MEAN VALUE OF BI INDEX OF MOGAB WITH THAT OF OTHER ALGORITHMS FOR REAL DATASETS

C. Results on Simulated Dataset A single run of MOGAB on the simulated dataset takes about 40 s. Fig. 1 shows the biclusters found by MOGAB for the simulated data. The figure indicates that the biclusters obtained using MOGAB are highly coherent (i.e., have low MSR) and nontrivial (i.e., have high row variance var). Hence, the biclusters are “interesting” in nature. In terms of BI index, bicluster (b) is the best that has the minimum BI score of 0.2878. Hence, these results demonstrate the effectiveness of the proposed multiobjective algorithm. Table I reports the percentage match score for the simulated data for all the algorithms. It appears that MOGAB produces the maximum match score of 91.3%. This establishes that MOGAB is able to discover the implanted biclusters from the background reasonably well. D. Results on Real Datasets 1) Results on Yeast Dataset: A single run of MOGAB on the Yeast dataset takes about 140 s. In Fig. 2, six example biclusters found by MOGAB on Yeast data are shown. Visual inspection reveals that MOGAB discovers interesting biclusters having high row variance and low BI scores. Fig. 3 shows the plot of BI scores of 100 best biclusters, sorted in ascending order, produced by all the algorithms. OPSM and ISA provided only 16 and 23 biclusters, respectively. A comparative study of all the algorithms is reported in Table II. It appears from the table that on average, the MOGAB biclusters simultaneously have lower MSR and higher row variance. Also, MOGAB provides the lowest average BI score. Low standard deviations (shown within brackets) of the range of BI scores indicate that all the biclusters are equally interesting. Although CC, Bimax, OPSM, and ISA beat MOGAB in terms of the lowest BI score, Fig. 3 shows that MOGAB provides low BI score for the most of the biclusters. This means that MOGAB produces a set of biclusters with similar quality, whereas the other algorithms generate some uninteresting biclusters with higher BI scores. 2) Results on Human Dataset: A single run of MOGAB on the Human dataset takes around 180 s. Fig. 4 shows six example biclusters generated by MOGAB for Human data. The figure indicates that all the six biclusters are highly coherent and have high row variance as well as low BI scores. Fig. 5 shows the BI scores of the best 100 biclusters found by various

algorithms for Human data. OPSM, ISA, and BiVisu produced 14, 65, and 93 biclusters, respectively. For most of the biclusters, MOGAB provides smaller BI scores compared to the other algorithms. The values in the Table III also confirm that the MOGAB biclusters have lower average MSR and higher average variance. Moreover, average BI score of MOGAB is the best. Lower standard deviation proves that the scores do not vary much from one bicluster to another, i.e., MOGAB provides a set of equal quality biclusters. 3) Results on Leukemia Dataset: MOGAB takes about 570 s for a single run on Leukemia data. As evident from Fig. 6, six biclusters found by MOGAB for Leukemia data have low residue and high row variance, thus having low BI scores. OPSM, ISA, and BiVisu provide 16, 21, and 54 biclusters, respectively. From Fig. 7, it appears that for most of the biclusters, MOGAB provides lower BI scores compared to the other algorithms and its performance is more stable. Table IV indicates that the general nature of the MOGAB biclusters is characterized by lower residue and higher row variance, and thus, produces lower BI scores. 4) Statistical Significance: A statistical significance test based on t-statistic is conducted at 5% significance level to establish that the better average BI scores produced by MOGAB compared to the other algorithms is statistically significant. Table V shows the p-values provided by the t-test by comparing the mean BI scores of MOGAB with that of the other algorithms for the real datasets. It appears that the p-values are less than 0.05 (5% significance level). There is only one exception where t-test gives p-value larger than 0.05. This is the case of comparing MOGAB with OPSM for the Human dataset. However, this is acceptable as OPSM produces only 14 biclusters for this dataset. However, this is acceptable as OPSM produces only 16 and 14 biclusters for these two datasets, respectively. Hence,

974

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 6, NOVEMBER 2009

TABLE VI RESULT OF BIOLOGICAL SIGNIFICANCE TEST: THE TOP FIVE FUNCTIONALLY ENRICHED SIGNIFICANT BICLUSTERS PRODUCED BY EACH ALGORITHM FOR Yeast Data

this test proves that performance of MOGAB is significantly better than the other methods. E. Biological Significance Test The biological relevance of the biclusters can be verified based on the gene ontology (GO) annotation database (http://db.yeastgenome.org/cgi-bin/GO/goTermFinder). This is used to test the functional enrichment of a group of genes in terms of three structured, controlled vocabularies (ontologies), viz., biological processes, molecular functions, and biological components. The degree of functional enrichment (p-values) is computed using a cumulative hypergeometric distribution that measures the probability of finding the number of genes involved in a given GO term within a bicluster. From a given GO category, the probability p for getting k or more genes within of size n, can be defined as [12], [19]: −1a cluster f g −f g / p = 1 − ki=0 i n −i n , where f and g denote the total number of genes within a category and within the genome, respectively. This signifies how well the genes in the bicluster match with the different GO categories. If the majority of genes in a bicluster have the same biological function, then it is unlikely that this takes place by chance, and the p-value of the category will be close to 0. The biological significance test for yeast dataset was conducted at 1% significance level. For different algorithms, the number of biclusters for which the most significant GO terms have p-value less than 0.01 are as follows: MOGAB— 32, SGAB—19, CC—10, RWB—7, Bimax—14, OPSM—6,

ISA—6, and BiVisu—17. Table VI reports the GO terms along with their p-values and percentage of genes associated with the GO term in the bicluster for the top five significant biclusters. For example, the most significant five biclusters for MOGAB are mostly enriched with the GO terms cytosolic part (p-value = 1.4E−45), ribosomal subunit (p-value = 1.6E−45), translation (p-value = 3.8E−41), RNA metabolic process (p-value = 8.4E−25), and DNA metabolic process (p-value = 3.1E−21). MOGAB beats all other algorithms in terms of the p-values for the top five functionally enriched biclusters. Also note that the significant GO terms have very low p-values (much less than 0.01). Moreover, the percentages of genes associated with the GO terms in the biclusters for MOGAB are better than that for the other algorithms. This indicates that the MOGAB biclusters are biologically significant. V. CONCLUSION An NSGA-II-based multiobjective biclustering technique (MOGAB) that simultaneously optimizes the MSR and row variance of the biclusters to discover nontrivial biclusters from microarray data is developed. A variable string length novel encoding scheme is proposed in this regard. The performance of MOGAB has been compared with some popular biclustering methods both visually and using the proposed BI index on simulated and real microarray datasets. Finally, a biological significance test is carried out to show MOGAB’s ability to identify biologically significant biclusters.

MAULIK et al.: FINDING MULTIPLE COHERENT BICLUSTERS IN MICROARRAY DATA

REFERENCES [1] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988. [2] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodica, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart, and R. W. Davis, “A genome-wide transcriptional analysis of mitotic cell cycle,” Mol. Cell., vol. 2, pp. 65–73, 1998. [3] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. O. Brown, and I. Herskowitz, “The transcriptional program of sporulation in budding yeast,” Science, vol. 282, pp. 699–705, Oct. 1998. [4] A. A. Alizadeh, M. B. Eisen, R. Davis, C. Ma, I. Lossos, A. Rosenwald, J. Boldrick, R. Warnke, R. Levy, W. Wilson, M. Grever, J. Byrd, D. Botstein, P. O. Brown, and L. M. Straudt, “Distinct types of diffuse large B-cell lymphomas identified by gene expression profiling,” Nature, vol. 403, pp. 503–511, 2000. [5] R. Herwig, A. Poustka, C. Meuller, H. Lehrach, and J. OBrien, “Largescale clustering of cDNA fingerprinting data,” Genome Res., vol. 9, no. 11, pp. 1093–1105, 1999. [6] D. Dembele and P. Kastner, “Fuzzy c-means method for clustering microarray data,” Bioinformatics, vol. 19, no. 8, pp. 973–980, 2003. [7] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,,” Proc. Nat. Acad. Sci. USA, vol. 95, pp. 14 863–14 868, 1998. [8] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. Lander, and T. Golub, “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation,” in Proc. Nat. Acad. Sci. USA, vol. 96, pp. 2907–2912, 1999. [9] E. Hartuv and R. Shamir, “A clustering algorithm based on graph connectivity,” Inf. Process. Lett., vol. 76, no. 200, pp. 175–181, 2000. [10] A. V. Lukashin and R. Fuchs, “Analysis of temporal gene expression profiles: Clustering by simulated annealing and determining the optimal number of clusters,” Bioinformatics, vol. 17, no. 5, pp. 405–414, 2001. [11] S. Bandyopadhyay, A. Mukhopadhyay, and U. Maulik, “An improved algorithm for clustering gene expression data,” Bioinformatics, vol. 23, no. 21, pp. 2859–2865, 2007. [12] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler, “A systematic comparison and evaluation of biclustering methods for gene expression data,” Bioinformatics, vol. 22, no. 9, pp. 1122–1129, 2006. [13] A. Tanay, R. Sharan, and R. Shamir, Biclustering Algorithms: A Survey. London, U.K.: Chapman & Hall, 2006. [14] J. Hartigan, “Direct clustering of a data matrix,” J. Amer. Stat. Assoc., vol. 67, no. 337, pp. 123–129, 1972. [15] Y. Cheng and G. M. Church, “Biclustering of gene expression data,” in Proc. Int. Conf. Intell. Syst. Molecular Biol. (ISMB 2000), pp. 93–103. [16] G. Getz, E. Levine, and E. Domany, “Coupled two-way cluster analysis of gene microarray data,” in Proc. Nat. Acad. Sci., vol. 97, pp. 12 079–12 084, 2000. [17] F. Angiulli, E. Cesario, and C. Pizzuti, “Gene expression biclustering using random walk strategies,” presented at the 7th Int. Conf. Data Warehousing Knowl. Discovery (DAWAK 2005), Copenhagen, Denmark. [18] S. Bleuler, A. Prelic, and E. Zitzler, “An EA framework for biclustering of gene expression data,” in Proc. Congr. Evol. Comput., 2004, pp. 166–173. [19] A. Tanay, R. Sharan, and R. Shamir, “Discovering statistically significant biclusters in gene expression data,” Bioinformatics, vol. 18, pp. S136– S144, 2002. [20] K. Bryan, P. Cunningham, and N. Bolshakova, “Biclustering of expression data using simulated annealing,” in Proc. 18th IEEE Symp. Comput.-Based Med. Syst. (CBMS 2005), Dublin, Ireland, pp. 383–388. [21] A. Ben-Dor, B. Chor, R. Karp, and Z. Yakhini, “Discovering local structure in gene expression data: The order preserving sub-matrix problem,” in Proc. 6th Annu. Int. Conf. Comput. Biol., 2002, vol. 1-58113-498-3, pp. 49–57. [22] J. Ihmels, S. Bergmann, and N. Barkai, “Defining transcription modules using large-scale gene expression data,” Bioinformatics, vol. 20, pp. 1993– 2003, 2004. [23] T. M. Murali and S. Kasif, “Extracting conserved gene expression motifs from gene expression data,” in Proc. Pacific Symp. Biocomput., 2003, vol. 8, pp. 77–88. [24] L. Teng and L.-W. Chan, “Biclustering gene expression profiles by alternately sorting with weighted correlated coefficient,” in Proc. IEEE Int. Workshop Mach. Learning Signal Process., 2006, pp. 289–294.

975

[25] K. Deb, A. Pratap, S. Agrawal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA II,” IEEE Trans. Evol. Comput., vol. 6, no. 2, pp. 182–197, Apr. 2002. [26] C. A. Coello Coello, “A comprehensive survey of evolutionary-based multiobjective optimization techniques,” Knowl. Inform. Syst., vol. 1, no. 3, pp. 129–156, 1999. [27] S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb, “A simulated annealing-based multiobjective optimization algorithm: AMOSA,” IEEE Trans. Evol. Comput., vol. 12, no. 3, pp. 269–283, Jun. 2008. [28] W. Shannon, R. Culverhouse, and J. Duncan, “Analyzing microarray data using cluster analysis,” Pharmacogenomics, vol. 4, no. 1, pp. 41–51, 2003. [29] S. Barkow, S. Bleuler, A. Prelic, P. Zimmermann, and E. Zitzler, “BicAT: A biclustering analysis toolbox,” Bioinformatics, vol. 22, no. 10, pp. 1282– 1283, 2006. [30] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, D. D. Bloomeld, and E. S. Lander, “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,,” Science, vol. 286, pp. 531–537, 1999.

Ujjwal Maulik (M’99–SM’05) received the Ph.D. degree from Jadavpur University, Kolkata, India, in 1997. He was with the Center for Adaptive Systems Application, New Mexico, the University of New South Wales, Australia, the University of Texas at Arlington, Arlington,the University of Maryland at Baltimore, Baltimore, Fraunhofer Institute AiS, Germany, Tsinghua University, China, and the University of Rome, Italy. He is currently a Professor in the Department of Computer Science and Engineering, Jadavpur University. He has authored or coauthored four books and more than 140 papers. His current research interests include evolutionary computing, pattern recognition, data mining, bioinformatics, and distributed systems.

Anirban Mukhopadhyay received the M.E. and Ph.D. degrees in computer science from Jadavpur University, Kolkata, India, in 2004 and 2009, respectively. He is currently a Faculty Member in the Department of Computer Science and Engineering, University of Kalyani, Kalyani, India. He has authored or coauthored more than 40 papers. His current research interests include soft and evolutionary computing, data mining, bioinformatics, and optical networks. Mr. Mukhopadhyay is the recipient of the University Gold Medal and Amitava Dey Memorial Gold Medal from Jadavpur University in 2004. His biography has been included in the 2009 Edition of Marquis Who is Who in the World.

Sanghamitra Bandyopadhyay (M’99–SM’05) received the Ph.D. degree in computer science from Indian Statistical Institute, Kolkata, India, in 1998. She was with Los Alamos National Laboratory, Los Alamo, NM, the University of New South Wales, Sydney, Australia, the University of Texas at Arlington, Arlington, the University of Maryland at Baltimore County, Baltimore, Fraunhofer Institute, Germany, Tsinghua University, China, and the University of Rome, Rome, Italy. She is currently a Professor at Indian Statistical Institute. She has authored or coauthored four books and more than 150 papers. Her current research interests include computational biology and bioinformatics, soft and evolutionary computation, pattern recognition, and data mining.

Nonlinear dynamics in a multiple cavity klystron ... - IEEE Xplore