An Efficient Approach for Subspace Clustering By ...

Viewer
Transcript

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014, Pg: 11-13

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

An Efficient Approach for Subspace Clustering By Using Cat Seeker G.Rajasekar#1, T.Aravind*2 #

Research scholar, Department of Computer Science and Engineering, Muthayammal Engineering College, Namakkal * Assistant Professor, Department of Computer Science and Engineering, Muthayammal Engineering College, Namakkal 1

[email protected] , [email protected] Abstract

Subspace clustering solves many clustering problems in which require the mining of actionable subspaces identified by objects and attributes at same time. Subspaces are used to find the profitable ideas to decision makers. Subspace clustering use CAT seeker algorithm to find most profitable object from three dimensional databases in form objectattribute-time like financial or biological database based on their centroid values. Furthermore, we propose a novel subspace clustering algorithm known as optimal centroid. This paper extend the CAT seeker algorithm with optimal centroid value in order to reduce time consuming and allows to get more profitable objects in database. The optimal centroid allows user to move the centroids based on their profitable objects in the database. Keywords- Clustering, Data mining, Optimal centroid, Subspace clustering

1. INTRODUCTION Past few year variety of databases can saves information like financial and biological data’s. This having set of large data sets for maintaining the accounting and documents for user purposes. The way of collecting and extracting user required data information from set of database. This extraction process is known as data mining. Clustering plays an important role in data mining. A lot of work has been done in the area of clustering [1]. Clustering is aims to get similar type of data (objects) from large set of database. This is a technique of grouping attributes together that share similar type of values. It can use large number of variables but not in limit.

G.Rajasekar, IJRIT

11

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014, Pg: 11-13

Fig 1: Clustering In simply Clustering is the process of grouping physical or abstract objects into classes of similar objects [2]. The quality of a clustering result also depends on both the similarity measure used by the method and its implementation. Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster. APPLICATION Clustering is a difficult problem combinatorial, and differences in assumptions and contexts in different communities have made the transfer of useful generic concepts and methodologies. Clustering also having some important applications [3] like image segmentation, object recognition and information retrieval. In following we describe some main applications of clustering in data mining. • WWW: Document classification, Cluster Weblog data to discover groups of similar access patterns. • Economic Science: particularly market investigation. • Natural language processing: linguistic analysis, parsing, learning languages, hyphenation patterns. • Image recognition and processing: segmentation, object recognition, texture recognition. • Signal processing: adaptive filters, real-time signal analysis, radar, sonar seismic, USG, EKG, EEG and other medical signals. • Optimization: configuration of telephone connections, VLSI design, time series prediction, scheduling algorithms.

2. RELATED WORK The problems of helpfulness and usability of subspace clusters are very important issue in subspace clustering [4]. Subspace clustering is a division of clustering algorithm that is talented to find low dimensional clusters in very high dimensional datasets. This approach is used to clustering allows our system to find groups of users who share a regular interest in a particular field or sub-filed regardless of differences in other fields. In high dimensional datasets, the number of potential subspaces is enormous (huge). For example, if there are ‘N dimensions’ in the data means, the number of possible subspaces is ‘2N’ [5]. In this paper, we recognize real-world problems, which encourage the need to introduce subspace clustering with actionability and users domain knowledge via centroids. This paper particularly used to compare and find datasets in Marketing, Land use, Insurance, City-planning and many others.

3. PROBLEM STATEMENT In pattern-based subspace clustering, the values in the subspace clusters satisfy some distance or similarity based functions, and these functions normally require some thresholds to set. This normal process is required fixed centroids. In existing approach we used fixed centroids to handle and prune the datasets. The subspace clustering problem using fixed centroids, but which the sensitivity problem of thresholds is mitigated as the clustering results is not sensitive to the optimization parameter. Then focus on subspace clustering on two dimensional dataset, and thus is not suitable for subspace clustering on three dimensional dataset. So here we used 3D subspace clustering algorithms CAT Seeker with fixed centroids used to mine CATSs (Centroid-based Actionable 3D Subspace clusters) subspace. This uses three-dimensional (3D) datasets, in the form of object-attribute-time. For example, the ‘stock-ratio-year’ data in the financial domain. The fixed centroids is one type of homogeneous model i.e., same type of data will be managed and return. Here we have known full domain knowledge. So if we have only some knowledge of domain means we cannot get proper outcome or result. This algorithm focuses only on separate group of data. So fixed centroids is focus on partitioning of objects into separate groups to maintain the dataset information. The main drawback is if the object can be in multiple groups’ means it cannot maintain properly.

G.Rajasekar, IJRIT

12

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014, Pg: 11-13

4. PROPOSED WORK We proposed a new algorithm called CAT Seeker with optimal centroids for handling the multiple groups at same time. This optimal technique can works on heterogeneous model i.e., this compares the multiple groups of datasets and provide the appropriate result. In case we known only some knowledge means it’s enough to find out the results. CATSeeker uses SVD to prune the search space for using the SVDpruning algorithm to detect high homogeneous values. CATS allowed incorporating their domain knowledge, by selecting their preferred objects as centroids of the actionable subspace clusters. To denote such clusters as centroid-based, actionable 3D subspace clusters (CATSs) and also denotes utility as a function measuring the profits or benefits of the objects. GS-search [7] and MASC [8] ‘flatten’ the continuous valued 3D dataset values into a dataset with having the single timestamp. They require the clusters to arise in each timestamp, and it is hard to find out clusters in dataset that has a bulky number of timestamps. CATSeeker, TRICLUSTER [9] and MIC [10] have the concept of subspace in all the three dimensions that is they mine 3D subspace clusters that are subsets of attributes and subsets of timestamps.

5. CONCLUSION AND FUTURE WORK Subspace clustering with optimal centorid will improve their efficiency and reduce time consuming process. It allows incorporating domain knowledge with a sensitive way. The CAT seeker algorithm will support for three dimensional databases only. In future, we research a new algorithm for support four dimensional datasets object-attribute-time-place in data mining.

REFERENCES [1] Karin Kailing, Hans-Peter Kriegel and Peer Kroger, “Density Connected Subspace Clustering for High dimensional Data,” SIAM, pp. 246-256. [2] Jerzy Stefanowski, “Data Mining – Clustering,” Institute of Computing Sciences, Poznan University of Technology, Poznan, Poland, Lecture 7, SE Master Course, 2008/2009. [3] Jain A.K, Murty M.N and Flynn P.J, “Data Clustering - A Review,” Michigan State University, 2008. [4] Kriegel H.P, Kroger P and Zimek A, “Clustering high dimensional data: A survey on subspace clustering, pattern based clustering, and correlation clustering,” ACM Transaction Knowledge Disc Data, 3, pp. 1–58, 2009. [5] Nitin Agarwal, Ehtesham Haque, Huan Liu and Lance Parsons, “A Subspace Clustering Framework for Research Group Collaboration,” Department of Computer Science Engineering, Arizona State University, Tempe, AZ 85281. [6] Kelvin Sim, Ghim-Eng Yap, David R. Hardoon, Vivekanand Gopal krishnan, Gao Cong, and Suryani Lukman “Centroid-based Actionable 3D Subspace Clustering”, IEEE trans on knowl and data engineering, vol. 25, I-6, June 2013. [7] D. Jiang, J. Pei, M. Ramanathan, C. Tang, and A. Zhang. Mining coherent gene clusters from gene-sample-time microarray data. In KDD, pp. 430–439, 2004. [8] K. Sim, A. K. Poernomo, and V. Gopalkrishnan. “Mining actionable subspace clusters in sequential data,” In SDM, pp. 442–453, 2010. [9] L. Zhao and M.J. Zaki. “TRICLUSTER: An effective algorithm for mining coherent clusters in 3D microarray data,” In SIGMOD, pp. 694–705, 2005. [10] K. Sim, Z. Aung, and V. Gopakrishnan, “Discovering correlated subspace clusters in 3D continuous_valued data,” In ICDM, pp. 471–480, 2010.

G.Rajasekar, IJRIT

13

Groupwise Constrained Reconstruction for Subspace Clustering