IJRIT International Journal of Research in Information Technology, Volume 3, Issue 2, February 2015, Pg. 77-81

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

An Approach to Data Mining: Clustering Archana M. Badge, Prof. Ms. S.W. Ahmad Dept. of Computer science and engineering of Prof Ram Meghe institute of Technology & Research Badnera 444701, Dept. of Computer science and engineering of Prof Ram Meghe institute of Technology & Research Badnera 444701 [email protected] , [email protected] Abstract— Data mining is the method of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD).Basically there are different types related to data mining like Text Mining, Web Mining, Multimedia Mining, Spatial Mining, Object Mining etc. Multimedia is the combination of text, image, graphics, animations, audio and video. Cluster analysis divide data into groups that are meaningful useful or both This seminar explores on survey of the current state of multimedia data mining and knowledge discovery, data mining efforts aimed at multimedia data, current approaches and well known techniques for mining multimedia data. Index Terms—Data mining, Multimedia database, Clustering, Hierarchical, K-Means

1. Introduction Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar to each other than to those in other groups (clusters).[6] It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Cluster analysis has wide applications in data mining, information retrieval, biology, medicine, marketing, and image segmentation. With the help of clustering algorithms, a user is able to understand natural clusters or structures underlying a data set.

2. Literature Review Lior Rokach & Oded Maimon states that Clustering groups data instances into subsets in such a manner that similar instances are grouped together, while different instances belong to different groups. The instances are thereby organized into an efficient representation that characterizes the population being sampled. Formally, the clustering structure is represented as a set of subsets C = C1; : : : ;Ck of S, such that: S =Sk , i=1 Ci and Ci \ Cj = ; for i 6= j. Consequently, any instance in S belongs to exactly one and only one subset. Clustering of objects is as ancient as the human need for describing the salient characteristics of men and objects and identifying them with a type. Therefore, it embraces various scientific disciplines: from mathematics and statistics to biology and genetics, each of which uses different terms to describe the topologies formed using this analysis. From biological “taxonomies”, to medical “syndromes” and genetic “genotypes” to manufacturing ”group technology”—the problem is identical: forming categories of entities and assigning individuals to the proper groups within it. A. How Clustering work. • pattern representation (including feature abstraction and/or selection),definition of a pattern proximity measure appropriate to the • data domain, • clustering, • data abstraction, and

Archana M. Badge, IJRIT- 77

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 2, February 2015, Pg. 77-81

• assessment of output. B. Types of clustering

Fig1: Classification of clustering techniques It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them.

3. Clustering Algorithms A. Forgy’s Method The earliest method to initialize K-means was proposed by Forgy in 1965. Forgy’s method involves choosing initial centroids randomly from the database. This approach takes advantage of the fact that if we choose points randomly we are more likely to choose a point near a cluster centre by virtue of the fact that this is where the highest density of points is located [4]. In their research paper M.E. Celebi et al. revealed that cluster centroid initialization methods such as Forgy, Macqueen, and max-min often perform poorly and there are other methods with same computational requirements which can give better results. B. Simple Cluster Seeking Method Simple Cluster-Seeking (SCS) method was suggested by Tou and Gonzales. This method initializes the first seed with the first value in the database. It then calculates the distance between the chosen seed and the next point in the database, if this distance is greater than some threshold then this point is chosen as the second seed, otherwise it will move to the next instance in the database and repeat the process. C. Partitioning methods Partitioning clustering algorithms, such as K-means, K-medoids PAM, CLARA and CLARANS assign objects into k (predefined cluster number) clusters, and iteratively reallocate objects to improve the quality of clustering results. K-means is the most popular and easy-to understand clustering algorithm.

Fig 2: K-means clustering D. Hierarchical methods Hierarchical clustering algorithms assign objects in tree structured clusters, i.e., a cluster can have data points or representatives of low level clusters [7]. Hierarchical clustering algorithms can be classified into categories according their clustering process: agglomerative and divisive.

Archana M. Badge, IJRIT- 78

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 2, February 2015, Pg. 77-81

Fig 3: Hierarchical Clustering

5. Units Use either SI (MKS) or CGS as primary units. (SI units are strongly encouraged.) English units may be used as secondary units (in parentheses). This applies to papers in data storage. For example, write “15 Gb/cm2 (100 Gb/in2).” An exception is when English units are used as identifiers in trade, such as “3½ in disk drive.” Avoid combining SI and CGS units, such as current in amperes and magnetic field in oversteps. This often leads to confusion because equations do not balance dimensionally. If you must use mixed units, clearly state the units for each quantity in an equation. The SI unit for magnetic field strength H is A/m. However, if you wish to use units of T, either refer to magnetic flux density B or magnetic field strength symbolized as µ 0H. Use the center dot to separate compound units, e.g., “A·m2.”

6. Bisecting K-Means algorithm Bisecting k-Means is like a combination of k-Means and hierarchical clustering. It starts with all objects in a single cluster. The pseudocode of the algorithm is displayed below: Basic Bisecting K-means Algorithm for finding K clusters. Pick a cluster to split. Find 2 sub-clusters using the basic k-Means algorithm (Bisecting step) Repeat step 2, the bisecting step, for ITER times and take the split that produces the clustering with the highest overall similarity. Repeat steps 1, 2 and 3 until the desired number of clusters is reached. The critical part is which cluster to choose for splitting. And there are different ways to proceed, for example, you can choose the biggest cluster or the cluster with the worst quality or a combination of both. Let us apply the k-Means clustering algorithm to the same example as in the previous page and obtain four clusters Food items

Protein content P

Fat content F

F1

1.1

60

F2

8.2

20

F3

4.2

35

F4

1.5

21

F5

7.6

15

F6

2.0

55

F7

3.9

39

Table 1: Food Items Let us plot these points so that we can have better understanding of the problem. Also, we can select the three points which are farthest apart.

Archana M. Badge, IJRIT- 79

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 2, February 2015, Pg. 77-81

Fig 4: Graph We see from the graph that the distance between the points 1 and 2, 1 and 3, 1 and 4, 1 and 5, 2 and 3, 2 and 4, 3 and 4 is maximum. Thus, the four clusters chosen are :

Cluster number

Protein content, P

Fat content, F

C1

1.1

60

C2

8.2

20

C3

4.2

35

C4

1.5

21

Table 2: Clustered items Also, we observe that point 1 is close to point 6. So, both can be taken as one cluster. The resulting cluster is called C16 cluster. The value of P for C16 centroid is (1.1 + 2.0)/2 = 1.55 and F for C16 centroid is (60 + 55)/2 = 57.50. Upon closer observation, the point 2 can be merged with the C5 cluster. The resulting cluster is called C25 cluster. The values of P for C25 centroid is (8.2 + 7.6)/2 = 7.9 and F for C25 centroid is (20 + 15)/2 = 17.50 The point 3 is close to point 7. They can be merged into C37 cluster. The values of P for C37 centroid is (4.2 + 3.9)/2 = 4.05 and F for C37 centroid is (35 + 39)/2 = 37. The point 4 is not close to any point. So, it is assigned to cluster number 4 i.e., C4 with the value of P for C4 centroid as 1.5 and F for C4 centroid is 21. Finally, four clusters with three centroids have been obtained. Cluster number

Protein content, P

Fat content, F

C16

1.55

57.50

C25

7.9

17.5

C37

4.05

37

C4

1.5

21

Table 3: Finally obtained clusters

Archana M. Badge, IJRIT- 80

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 2, February 2015, Pg. 77-81

6. Conclusion Cluster analysis is still an active field of development. In the areas of statistics (mixture models), computer science (Data Mining, machine learning, nearest neighbor search), pattern recognition, and vector quantification, there is still a lot of work being done. Many cluster analysis techniques do not have a strong formal basis. While some techniques make use of formal mathematical methods, they often do not work better than more informal methods. Cluster analysis is a rather ad-hoc field. Almost all techniques have a number of arbitrary parameters that can be “adjusted” to improve results. It remains to be seen, whether this represents a temporary situation, or is an unavoidable use of problem and domain specific heuristics

7. References [1] L. Abul, R. Alhajj, F. Polat and K. Barker “Cluster Validity Analysis Using Subsampling,” in proceedings of IEEE International Conference on Systems, Man and Cybernetics, Washington DC, Oct. 2003 Volume 2: pp. 1435-1440. [2] M. Ankerst, M.M.Breunig, H.-P. Kriegel, J.Sander, “OPTICS: Ordering points to identify the clustering structure”, in proceedings of ACM SIGMOD Conference, 1999 pp. 49-60. [3] P. Berkhin, “A Survey of Clustering Data Mining Techniques” Kogan, Jacob; Nicholas, Charles; Teboulle, Marc (Eds) Grouping Multidimensional Data, Springer Press (2006) 25-72 [4] C. Baumgartner, C. Plant, K. Railing, H-P. Kriegel, P. Kroger, “Subspace Selection for Clustering High-Dimensional Data”, Proc. of the Fourth IEEE International Conference on Data Mining (ICDM’04), 2004, pp.11- 18. [5] Ester M., Kriegel HP., Sander J., Xu X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Second International Conference on Knowledge Discovery and Data Mining (1996) [6] Guha S., Rastogi R., Shim K.: CURE: An efficient clustering algorithm for large databases. Proc. Of ACM SIGMOD Conference (1998) [7] J. Han and M. Kamber, “Data Mining: Concepts and Techniques,” Morgan Kaufmann Publishers, 2001. [8] M. Halkidi, Y. Batistakis, M. Vazirgiannis, “On Clustering Validation Techniques” Journal of Intelligent Information Systems, Volume 17 (2/3), 2001, pp. 107–145. [9] M. Halkidi, Y. Batistakis, M. Vazirgiannis, “Cluster validity methods: Part I and II”, SIGMOD Record,31, 2002. [10] Z. Huang, D. W. Cheung and M. K. Ng, ”An Empirical Study on the Visual Cluster Validation Method with Fastmap”, Proceedings of DASFAA01, Hong Kong, April 2001, pp.84-91.

Archana M. Badge, IJRIT- 81

An Approach to Data Mining: Clustering

analysis. Data mining uses sophisticated mathematical algorithms to segment ... It is a main task of exploratory data mining, and a common technique for statistical ... Let us apply the k-Means clustering algorithm to the same example as in the ...

181KB Sizes 0 Downloads 230 Views

Recommend Documents

Data Mining Approach, Data Mining Cycle
The University of Finance and Administration, Prague, Czech Republic. Reviewers: Mgr. Veronika ... Curriculum Studies Research Group and who has been in the course of many years the intellectual co-promoter of ..... Implemented Curriculum-1 of Radiol

A Data Mining Approach To Rapidly Learning Traveler ...
cell phone connectivity to projecting travel demand to location aware ...... business world for analyzing purchase patterns through market basket analysis.

Multilevel Clustering Approach Using an Energy ...
IJRIT International Journal of Research in Information Technology, Volume 1, ... A Wireless Sensor Network (WSN) consists of a large number of tiny nodes with ...

An Efficient Approach for Subspace Clustering By ...
Optimization: configuration of telephone connections, VLSI design, time series ... The CAT seeker algorithm will support for three dimensional databases only.

data clustering
Clustering is one of the most important techniques in data mining. ..... of data and more complex data, such as multimedia data, semi-structured/unstructured.

Clustering Graphs by Weighted Substructure Mining
Call the mining algorithm to obtain F. Estimate θlk ..... an advanced graph mining method with the taxonomy of labels ... Computational Biology Research Center.

An Approach to Large-Scale Collection of Application Usage Data ...
system that makes large-scale collection of usage data over the. Internet a ..... writing collected data directly to a file or other stream for later consumption.

An Efficient Algorithm for Clustering Categorical Data
the Cluster in CS in main memory, we write the Cluster identifier of each tuple back to the file ..... algorithm is used to partition the items such that the sum of weights of ... STIRR, an iterative algorithm based on non-linear dynamical systems, .

A Temporal Data-Mining Approach for Discovering End ...
of solution quality, scale well with the data size, and are robust against noises in ..... mapping is an one-to-one mapping m between two sub- sets Ai. 1 and Ai.

Survey on Data Clustering - IJRIT
common technique for statistical data analysis used in many fields, including machine ... The clustering process may result in different partitioning of a data set, ...

Survey on Data Clustering - IJRIT
Data clustering aims to organize a collection of data items into clusters, such that ... common technique for statistical data analysis used in many fields, including ...

Supervised Scaled Regression Clustering: An Alternative to ... - GitHub
This paper describes a model for a regression analysis tool that can be seen as a kind of ... The data analysis task concerns the environmental problem of determining the .... is complex and can be expected to need the application of advanced.

A Partition-Based Approach to Graph Mining
ral data can be modeled as graphs. ... Proceedings of the 22nd International Conference on Data Engineering ...... CPU, 2.5GB RAM and 73GB hard disk.

Mining Software Engineering Data
Apr 9, 1993 - To Change. Consult. Guru for. Advice. New Req., Bug Fix. “How does a change in one source code entity propagate to other entities?” No More.

A Partition-Based Approach to Graph Mining
Proceedings of the 22nd International Conference on Data Engineering (ICDE'06) ... volves splitting a dataset into subsets, learning/mining from one or more of ...

Advanced Clustering Methods for Mining Chemical ...
Sep 3, 2007 - Results are compared to standard methods in the field of chemical drug profiling and show that conventional approaches miss the inherent.

Web Usage Mining Using Artificial Ant Colony Clustering and Genetic ...
the statistics provided by existing Web log file analysis tools may prove inadequate ..... evolutionary fuzzy clustering–fuzzy inference system) [1], self-organizing ...

web usage mining using rough agglomerative clustering
is analysis of web log files with web pages sequences. ... structure of web sites based on co-occurrence ... building block of rough set theory is an assumption.