A Comparison of Scalability on Subspace Clustering ...

Viewer
Transcript

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 3, March 2014, Pg: 526-531

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

A Comparison of Scalability on Subspace Clustering Algorithms in High Dimensional Data Set Dr P Jaganathan1, T Kamalavalli2 & R Kuppuchamy3 1

Professor & Head, PSNA College of Engineering & Technology, Dindigul, Tamilnadu, India, [email protected]

2

Associate Professor, PSNA College of Engineering & Technology, Dindigul, Tamilnadu, India, [email protected]

3

Associate Professor, PSNA College of Engineering & Technology, Dindigul, Tamilnadu, India, [email protected]

Abstract Compared to just a few years ago, we now use daily huge databases in medical research, imaging, financial analysis, and many other domains. Not only new fields are open to data analysis, but also it becomes easier, and cheaper, to collect large amounts of data. One of the problems related to this tremendous evolution is the fact that analyzing these data becomes more and more difficult, and requires new, more adapted techniques than those used in the past. Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data; many dimensions are irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant and redundant dimensions by analyzing the entire dataset. This paper presents a Comparative study of two subspace clustering algorithms such as Cell-based approach and Cluster based approach for identifying their scalability. By comparing the results of original and new approach, it was found that the time taken to process the data was substantially reduced. Keywords: Clustering, Subspace clustering, Scalability, Clique, Proclus.

1. Introduction Cluster analysis seeks to discover groups, or clusters, of similar objects. The objects are usually represented as a vector of measurements, or a point in multidimensional space. The similarity between objects is often determined using distance measures over the various dimensions of the data [8, 9]. Subspace clustering algorithms attempt to find such clusters. Subspace clustering is the task of detecting all clusters in all subspaces. This means that a point might be a member of multiple clusters, each existing in a different subspace. Subspaces can either be axis-parallel or affine. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each object described. In high dimensional data, however, many of the dimensions are often irrelevant. These irrelevant dimensions confuse clustering algorithms by hiding clusters in noisy data. In very high dimensions it is common for all of the objects in a dataset to be nearly equidistant from each other, completely masking the clusters. Feature selection methods have been used somewhat successfully to Dr P Jaganathan,

IJRIT

526

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 3, March 2014, Pg: 526-531

improve cluster quality. These algorithms find a subset of dimensions on which to perform clustering by removing irrelevant and redundant dimensions. The problem with feature selection arises when the clusters in the dataset exist in multiple, possibly overlapping subspaces. Scalability remains a significance issue for large scale datasets [13].Data mining applications place the following two primary requirements on clustering algorithms: Scalability to large dataset [14] and nonpresumption of any canonical data properties like convexity. Many clustering algorithms generate accurate clusters on small data sets with limited dimensions; because those algorithms were initially developed for applications where accuracy was more important than speed. However, the scalability of data mining techniques is very important due to the rapid growth in the amount the data.

2. Literature Survey Several algorithms for the improvement of scalability in subspace clustering have been reported in the literature as given below. Mahdi et al.[21] introduces an algorithm inspired by sparse subspace clustering (SSC) [17] to cluster noisy data, and develops some novel theory demonstrating its correctness. In particular, the theory uses ideas from geometric functional analysis to show that the algorithm can accurately recover the underlying subspaces under minimal requirements on their orientation, and on the number of samples per subspace. Corresponding Author: T Kamalavalli Rahmat et al. [19] proposed that density of each object neighbors with MinPoints will be calculated. Cluster change will occur in accordance with changes in density of each object neighbors. The neighbors of each object typically determined using a distance function, for example the Euclidean distance. But their method does not propose the preprocessing, dimension reduction, and outlier detection of subspace clustering. Emmanuel et al. [20] proposed a systematic approach to evaluate the major paradigms in a common framework. They study representative clustering algorithms to characterize the different aspects of each paradigm and give a detailed comparison of their properties and also provide a benchmark set of results on a large variety of real world and synthetic data sets using different evaluation measures.

3. Preprocessing Principal Component analysis is a preprocessing stage of data mining and machine learning; dimension reduction not only decreases computational complexity, but also significantly improves the accuracy of the learned models from large data sets. PCA [22] is a classical multivariate data analysis method that is useful in linear feature extraction. Without class labels it can compress the most information in the original data space into a few new features, i.e., principal components. Handling high dimensional data using clustering techniques obviously a difficult task in terms of higher number of variables involved. In order to improve the efficiency, the noisy and outlier data may be removed and minimize the execution time and we have to reduce the no. of variables in the original data set The central idea of PCA is to reduce the dimensionality of the data set consisting of a large number of variables. It is a statistical technique for determining key variables in a high dimensional data set that explain the differences in the observations and can be used to simplify the analysis and visualization of high dimensional data set.

Dr P Jaganathan,

IJRIT

527

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 3, March 2014, Pg: 526-531

3.1 Subspace Clustering Subspace clustering is a method to determine the clusters that form on different subspaces; this method is better in handling multidimensional data than other methods. Fig. 1 (wikipedia) shows the two dimensions of the clusters placed in a different subspace. On the dimension of the subspace cluster ca1 (in the subspace {x}) and cb, cc, cd (in the subspace {y}) can found. Meanwhile cc not included in the subspace cluster. In two-dimensional cluster, cab and cad identified as clusters. The main problem in clustering is the cluster can be in a different subspace, with the combination of different dimensions, if the number of dimensions higher, caused more difficult to find clusters. Subspace clustering method will automatically find the units clustered in each subspace.

Fig. 1 Subspace clustering A. Cell-Based Approach Cell-based clustering is based on a cell approximation of the data space in a bottom –up fashion. Cells of width w are used to describe clusters. For all cell-based approaches, a cluster result R consists of a set of cells; each of them containing more than threshold many objects (jOij for i = 1 : : : k). These cells describe the objects of the clusters either by a hypercube of variable width w [4,5] or by a fixed grid of cells per dimension [6]. Fixed grids can be seen as discretization of the data space in pre-processing. In contrast, variable hype cubes are arbitrarily positioned to delimit a region with many objects. The first approach for cell-based clustering was introduced by CLIQUE [3]. CLIQUE defines a cluster as a connection of grid cells with each more than many objects. Grid cells are defined by a fixed grid splitting each dimension in equal width cells. Arbitrary dimensional cells are formed by simple intersection of the 1d cells. First enhancements of CLIQUE adapted the grid to a variable width of cells [7]. a. CLIQUE Algorithms The CLIQUE (Clustering in Quest) algorithm [3] was one of the first subspace clustering algorithms. The algorithm combines density and grid based clustering and uses an APRIORI style search technique to find dense subspaces. Once the dense subspaces are found they are sorted by coverage, defined as the fraction of the dataset the dense units in the subspace represent. The subspaces with the greatest coverage are kept and the rest are pruned. The algorithm then finds adjacent dense grid units in each of the selected subspaces using a depth first search. Clusters are formed by combining these units using a greedy growth scheme. The algorithm starts with an arbitrary dense unit and greedily grows a maximal region in each dimension until the union of all the regions covers the entire cluster. Redundant regions are removed by a repeated procedure where smallest redundant regions are discarded until no further maximal region can be removed. The hyper rectangular clusters are then defined by a Disjunctive Normal Form (DNF) expression. CLIQUE finds clusters of arbitrary shape, in any number of dimensions. Clusters may be found in the same, overlapping, or disjoint subspaces. The DNF expressions used to represent clusters are often very interpretable and can describe overlapping clusters, meaning that instances can belong to more than one cluster. This is often advantageous in subspace clustering since the clusters often exist in different subspaces and thus represent different relationships. Dr P Jaganathan,

IJRIT

528

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 3, March 2014, Pg: 526-531

CLIQUE consists of the following three steps： 1) Identification of subspace that contain clusters. 2) Identification of clusters. 3) Generation of minimal description for the clusters. b.. Clustering-Oriented Approach In contrast to the previous paradigms, clustering-oriented approaches focus on the clustering result R by directly specifying objective functions like the number of clusters to be detected or the average dimensionality of the clusters as in PROCLUS [1], the rest approach for this paradigm. PROCLUS partitions the data into k clusters with average dimensionality, extending K-means [16]. Instead of a cluster definition, clustering oriented approaches define properties of the set of resulting clusters. Each object is assigned to the cluster it t best. PROCLUS uses test and the expectation maximization algorithm to find a more sophisticated partitioning [17, 18]. Defining a statistically significant density, STATPC aims at choosing the best non-redundant clustering. Although it defines cluster properties, it aims at an overall optimization of the clustering result R. A clustering oriented result w.r.t. objective functions f(R), which is based on the entire clustering result R and an optimal value parameter optF (e.g. numClusters(R) = k and avgDim(R) = l in PROCLUS) is a result set R with: f(R) = optF. The most important property for clustering-oriented approaches is their global optimization of the clustering. Thus, the occurrence of a cluster depends on the residual clusters in the result. Based on this idea, these approaches are parameterized by specifying objective functions for the resulting set of clusters. Clustering-oriented approaches directly control the resulting clusters, e.g. the number of clusters. Other paradigms do not control such properties as they report every cluster that fulfills their cluster definition. Both cell-based and density-based paradigms provide a cluster definition; every set of objects O and set of dimensions S fulfilling this definition is reported as subspace cluster (O; S). There is no optimization process to select clusters. On the other side, clustering oriented approaches do not in hence the individual clusters to be detected. For example, keeping the number of clusters fixed and partitioning the data, optimizes the overall coverage of the clustering like in PROCLUS or P3C, but, includes noise into the clusters. As these approaches optimize the overall clustering they try to assign each object to a cluster, resulting in clusters containing highly dissimilar objects (noise). Both approaches are aware of such effects and use outlier detection mechanisms to remove noise out of the detected clusters.

4. Performance Evaluation In this paper, the high dimensional data set is preprocessed by Principal Component analysis (PCA) using Open Subspace: Weka Subspace-Clustering Integration [15]. The reduced data set is given to the algorithm for better clustering in terms of time. The scalability in terms of time is measured for various data sets which are obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. The experiment is repeated for a number of times and the results are tabulated below [Table1]. The Fig. 2 and Fig. 3 show the scalability of Clique and Proclus algorithms in terms of time. Table 1: Performance Analysis Without PCM (Time in ms) Cliq Proc ue lus Diab 46 94 etes Brea 32 33 st Sha 29 37 pes Dr P Jaganathan,

IJRIT

With PCM (Time in ms) Cli Pro que clus 5 47 7

19

6

15

529

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 3, March 2014, Pg: 526-531

Live

37

43

16

40

r Without PCA

50

With PCA

Time (ms)

40 30 20 10 0 Diabetes

Breast

Shapes

Liver

Dataset

Fig. 2. Scalability in Clique Without PCA

100

With PCA

Time (ms)

80 60 40 20 0 Diabetes

Breast

Shapes

Liver

Dataset

Fig. 3. Scalability in Proclus

5. Conclusion In this paper, we provide a through evaluation and comparison of clustering in subspace projections of high dimensional data. We gave an overview of two major paradigms (cell-based and clustering oriented). The Clique method improves the scalability by means of preprocessing techniques particularly PCA (Principal Component Analysis). This approach ensure that the total mechanism of clustering in time without loss of correctness of clusters. Research in subspace clustering method has a lot of potential to be developed further in the future. We will be planned to study related to clustering of categorical data sets in the future. References [1] [2]

[3] [4] [5] [6] [7]

Lance Parson, Huan Liu and Ehtesham Haque,” ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets”, Volume 6 Issue 1, June 2004, Pages 90-105 Kriegel, Hans-Peter, Kröger, Peer, Zimek and Arthur, "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering", ACM Transactions on Knowledge Discovery from Data 3 (1): 1–58, 2009 Rakesh Agarwal, Johannes Gehrke and P.Raghavan, ”Automatic Subspace Clustering of High Dimensional Data for Data Mining Application”, SIGMOD, pages 94-105, 1998, C. E. A. Procopiuc, “A monte carlo algorithm for fast projective clustering”, In SIGMOD, pages 418-427, 2002 M. L. Yiu and N. Mamoulis. “Frequent-pattern based iterative projected clustering”, In ICDM, pages 689-692, 2003. K. Sequeira and M. Zaki. “SCHISM: A new approach for interesting subspace mining”, In ICDM, pages 186193, 2004. H. Nagesh, S. Goil, and A. Choudhary, “Adaptive grids for clustering massive data sets” In SDM, 2001.

Dr P Jaganathan,

IJRIT

530

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 3, March 2014, Pg: 526-531

[8] [9] [10] [11]

[12] [13]

[14] [15]

[16] [17] [18]

[19] [20]

[21] [22]

A. K. Jain, M. N. Murty, and P. J. Flynn. “Data clustering: a review”, ACM Computing Surveys (CSUR), 31(3), pp264-323, 1999. Micheline Kamber and Jiawei Han. “Data Mining: Concepts and Techniques”, pages 335-393, Morgan Kaufmann Publishers, 2001. Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, “A Density-Based Algorithm for Discovering Clusters”, 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996. AAKriegel, Hans-Peter, Peer Kroger, Matthias Renz, Sebastian Wurst, “A Generic Framework for Efficient Subspace Clustering of High- Dimensional Data”, Fifth IEEE International Conference on Data Mining (ICDM’05), 2005. Rahmat Widia Sembiring, and Jasni Mohamad Zain, ”Cluster Evaluation of Density Based Subspace Clustering”, Journal Of Computing, 2(11), November 2010. Huajing Li,Zaiquing Nie and Wang-Chien Lee, ”Scalable community Discovery on Textual Data with Relations.[http://www.ics.uci.edu/~mlearn/MLRepository.html] Irvine, CA: University of California, Department of Information and Computer Science. Tian Zhang , Raghu Ramakrishnan , Miron Livny., “Birch: An efficient data clustering method for very large databases”1996. Müller E., Günnemann S., Assent I., Seidl T.,”Evaluating Clustering in Subspace Projections of High Dimensional Data”, In Proc. 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France. (2009) G. Moise and J. Sander, “Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering”, In KDD, pages 533-541, 2008. G. Moise, J. Sander, and M. Ester. “P3C: A robust projected clustering algorithm” In ICDM, pages 414-425, 2006. E. Muller, I. Assent, S. Gunnemann, T. Jansen, and Seidl., “Open Subspace: An open source framework for evaluation and exploration of subspace clustering algorithms in WEKA” In Open Source in Data Mining Workshop at PAKDD, pages 2-13, 2009. Rahmat Widia Sembiring and Jasni Mohamad Zain, “Cluster Evaluation of Density Based Subspace Clustering “,Journal Of Computing, 2(1) November 2010. Emmanuel Muller, Stephan Gunnemann, Ira Assent, Thomas Seidl, “Evaluating Clustering in Subspace Projections of High Dimensional Data”, VLDB ‘09, August 2428,2009, Lyon, France Copyright 2009 VLDB Endowment Mahdi Soltanolkotabi ,Ehsan Elhamifar and Emmanuel J. Candes, “Robust Subspace Clustering”, arXiv:1301.2603v2 [cs.LG] 1 Feb 2013. Jolliffe I.T., “Principal Component Analysis”, Springer, Second edition, 2002.

Dr P Jaganathan,

IJRIT

531

A survey on enhanced subspace clustering

Subspace Clustering with a Twist - Microsoft

COMPARISON OF CLUSTERING ... - Research at Google

Performance Comparison of Optimization Algorithms for Clustering ...

Groupwise Constrained Reconstruction for Subspace Clustering

Groupwise Constrained Reconstruction for Subspace Clustering - ICML

Groupwise Constrained Reconstruction for Subspace Clustering

A Comparison of Clustering Methods for Writer Identification and ...

Groupwise Constrained Reconstruction for Subspace Clustering - ICML

Probabilistic Low-Rank Subspace Clustering

An Efficient Approach for Subspace Clustering By ...

Distance Based Subspace Clustering with Flexible ...

Centroid-based Actionable 3D Subspace Clustering

A SYMMETRIZATION OF THE SUBSPACE ... - Semantic Scholar

On the contact domain method: A comparison of ...

Views on Abortion: A Comparison of Female Genetic Counselors ...

Guidance-Material-on-Comparison-of-Surveillance-Technologies ...

Survey on clustering of uncertain data urvey on ...

On the Scalability of Hierarchical Cooperation for ...

On the Scalability of Data Synchronization Protocols for ...

Survey on Data Clustering - IJRIT