January 23, 2009 Abstract Clustering is a popular data mining technique that is used to place data elements into related groups of “similar behaviour”. The traditional clustering algorithm is the so-called k-means algorithm. However, k-means has some well-known problems, i.e. it does not work well on clusters with not well-defined centers, it is difficult to choose the number k of clusters to construct upfront and different initial centers can lead to different final clusters. In recent years, spectral clustering has become popular and widely used since its results often outperform the outcomes of the k-means algorithm. Spectral clustering is a more advanced algorithm compared to k-means as it uses several mathematical concepts (i.e. degree matrices, weight matrices, similarity matrices, similarity graphs, graph Laplacians, eigenvalues and eigenvectors) in order to divide similar data points in the same group and dissimilar data points in different groups. This report gives a description on how the spectral clustering works, i.e. it describes the single steps and it also shows a possible implementation. Furthermore, it shows the behaviour of the algorithm on a few selected datasets. Finally, it gives a conclusion based on the observations obtained from the experiments.

1

Spectral Clustering

Contents

Contents 1 Introduction

4

2 Implementation

5

3 Strengths and weaknesses 3.1 Spherical, well separated clusters . . . . . . 3.1.1 Clustering visualization . . . . . . . 3.1.2 Reasoning . . . . . . . . . . . . . . . 3.1.3 Impact of changing the parameter k 3.2 Arbitrary shaped clusters . . . . . . . . . . 3.2.1 Spiral rings . . . . . . . . . . . . . . 3.2.2 3D lines . . . . . . . . . . . . . . . . 3.2.3 Two “parabolas” . . . . . . . . . . . 3.3 Mixture of different types of clusters . . . . 3.3.1 Two spheres and a line . . . . . . . . 3.3.2 Two spheres and a spiral ring . . . . 3.3.3 Two lines, a sphere and a spiral . . . 3.4 Impact of outliers . . . . . . . . . . . . . . . 3.4.1 Clustering visualization . . . . . . . 3.4.2 Reasoning . . . . . . . . . . . . . . . 3.5 Impact of noise . . . . . . . . . . . . . . . . 3.5.1 Clustering visualization . . . . . . . 3.5.2 Reasoning . . . . . . . . . . . . . . . 4 Conclusion

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

6 6 6 7 7 9 9 11 13 15 15 17 18 19 19 20 20 21 23 24

5 Appendix 25 5.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Installation instructions . . . . . . . . . . . . . . . . . . . . . . . 25 5.3 User manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 References

30

2

Spectral Clustering

List of Figures

List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Visualization of the spherical clusters example . . . . . . . . . . . Spherical clusters: outcomes with k = 1 . . . 5 . . . . . . . . . . . . Spiral rings: clustering visualization of the k-means algorithm . . Spiral rings: clustering visualization of the spectral clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clustering outcome of the 3D lines of the k-means algorithm . . Clustering 3D lines with k = 2 and σ = 12 (left), σ = 14(right) . 3D lines: computed eigenvalues with σ = 14 (left) and σ = 16 (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parabolas, clustering outcomes of the k-means algorithm . . . . . Parabolas, clustering outcomes. From left to right: with σ = 1, 20, 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parabolas, computed eigenvalues with σ = 20 (left) and σ = 28 (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mixture of spheres and line: Clustering visualization of the kmeans algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . Mixture of spheres and line: Clustering visualization of the spectral clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . Mixture of spheres and spiral ring: Clusters detected by using only k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mixture of spheres and spiral ring: Clusters detected by the spectral clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . Mixture of lines, sphere and spiral: Resulting clusters of the spectral clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . Clusters with outliers . . . . . . . . . . . . . . . . . . . . . . . . . Spheres with 0% noise . . . . . . . . . . . . . . . . . . . . . . . . Spheres with 5% noise . . . . . . . . . . . . . . . . . . . . . . . . Spheres with 10% noise . . . . . . . . . . . . . . . . . . . . . . . Spheres with 40% noise . . . . . . . . . . . . . . . . . . . . . . . Spheres with 80% noise . . . . . . . . . . . . . . . . . . . . . . . Spheres with 100% noise . . . . . . . . . . . . . . . . . . . . . . . Initial startup window of the XVDM system loading the spectral clustering module . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of loading the spiral rings dataset . . . . . . . . . . . . Spectral clustering module perspective . . . . . . . . . . . . . . . User’s dataset entry when loading a custom dataset from the file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

7 8 9 10 11 11 12 13 14 15 16 16 17 18 19 20 21 21 22 22 23 23 26 27 27 29

Spectral Clustering

1

1

Introduction

Introduction

There exist several versions of the spectral clustering algorithm. The most common spectral clustering algorithms take a similarity matrix S ∈ Rnxn (where n denotes the number of data points in the set) and the number k of clusters to construct as input. The similarity matrix denotes the similarities between all pairs of points xi and xj in the dataset, denoted as sij ≥ 0. The similarities can be computed by using a similarity function (e.g. the Gaussian similarity function). However, one could also use a distance function (e.g. Euclidean distance), since distances and similarities are inverse to each other. Based on the similarity matrix, the spectral clustering algorithm first builds a similarity graph. The most common similarity graphs are the followings: • -neighborhood graph: This kind of graph connects all points whose pairwise distances (dissimilarities) are smaller than . The -neighborhood graph is an undirected graph. • k-nearest neighbor graph: In this graph, a vertex vi is connected with vertex vj , when vj is among the k nearest neighbors of vi . This concept would lead to a directed graph, however there exist two possibilities to convert this directed graph into an undirected graph. The simplest method is to ignore the directions of the edges. This kind of graph is denoted as k-nearest neighbor graph. The other possibility is to connect two vertices vi and vj if vi is among the k-nearest neighbors of vj and vice-versa. This kind of graph is denoted as mutual k-nearest neighbor graph. • Fully connected graph: In this graph, all points with positive similarities are connected. Once the similarity graph has been constructed, the spectral clustering algorithm computes its corresponding graph Laplacian matrix. Graph Laplacians are used in the spectral graph theory to find many properties of a graph (e.g. number of connected components). There exist two types of graph Laplacians: unnormalized graph Laplacians and normalized graph Laplacians. The unnormalized graph Laplacian L is obtained as follows: L=D−S (1) wherePD is the degree matrix defined as the diagonal matrix with the degrees n di = i=1 xi and S is the similarity matrix. One important property of the unnormalized graph Laplacian is that the multiplicity of the eigenvalue 0 of L corresponds to the number k of connected components A1 , ..., Ak of the graph. In addition, the eigenspace of the eigenvalue 0 is spanned by the indicator vectors of those components (so all eigenvectors are piecewise constant). On the other hand, there exist two matrices that are defined as normalized graph Laplacians: 1

1

1

1

Lsym = D− 2 LD− 2 = I − D− 2 SD− 2 Lrw = D

−1

L=I −D

−1

S

(2) (3)

The first matrix (equation 2) is obtained through a symmetric normalization, while the second one (equation 3) is the result of performing a row sum (random 4

Spectral Clustering

2

Implementation

walk) normalization on the unnormalized Laplacian. Again, these two matrices have similar spectral properties like the normalized 1 graph Laplacian, except that for Lsym the eigenspace is spanned by S 2 multiplied with the indicator vectors (meaning that the eigenvectors are not piecewise constant). After the computation of a graph Laplacian, the spectral clustering algorithm derives its first k eigenvectors (i.e. the eigenvectors corresponding to the k smallest eigenvalues) and builds a new matrix V ∈ Rnxk with the eigenvectors as columns. Finally, the rows of V are interpreted as new data points zi ∈ Rk . This transformation in the representation enhances the cluster-properties in the data and allows the k-means clustering algorithm to easily detect the clusters afterwards. Nevertheless, there are several choices that must be made and parameters that must be set when using the spectral clustering. These choices, which may have impacts on the final clusters, are the followings: • Choosing a similarity function • Choosing the similarity graph • Defining the parameter k when using k-nearest neighbor graph or when using -neighborhood graph • Selecting the number k of clusters • Selecting one of the three graph Laplacians

2

Implementation

For the spectral clustering implementation the normalized spectral clustering according to Ng, Jordan and Weiss (2002) has been used. The algorithm takes as input a similarity matrix S ∈ Rnxn and a number k indicating the resulting clusters. It can then be decomposed into the following steps: • Construction of the similarity graph As already mentioned, there exist different kinds of similarity graphs. For this specific project the fully connected graph with the Gaussian similarity function 2 s(xi , xj ) = e−

||xi −xj || 2σ 2

(4)

has been used where σ controls the width of the neighborhoods and ||xi − xj || is the Euclidean distance which, between two points P = (p1 , p2 , . . . , pn ) and Q = (q1 , q2 , . . . , qn ) is defined as v u n p uX 2 2 2 (p1 − q1 ) + (p2 − q2 ) + . . . + (pn − qn ) = t (pi − qi )2 (5) i=1

• Computation of the normalized Laplacian matrix Lsym For the computation of the normalized Laplacian matrix (see formula 2) has been used where

5

Spectral Clustering

3

Strengths and weaknesses

– I denotes the identity matrix . . . – D denotes the degree matrix which is defined as the diagonal matrix with degrees d1 , d2 , . . . , dn and di =

n X

sij

(6)

j=1

and sij ∈ S – S denotes the similarity matrix • Computation of the first k eigenvectors v1 , . . . vk of Lsym • Normalization step Assuming V ∈ Rnxk to be the matrix containing the computed eigenvectors v1 , . . . vk of Lsym as columns. Then the matrix U ∈ Rnxk is constructed from V by normalizing the row sum to have norm 1, such that vij (7) uij = P 2 1 ( k vik ) 2 • Clustering using k-means The k-means takes the matrix U as input and computes the k clusters. The output of the spectral clustering algorithm are the clusters A1 , . . . , Ak .

3

Strengths and weaknesses

In this section the strengths and weaknesses of the spectral clustering algorithm will be explained on a series of examples.

3.1

Spherical, well separated clusters

This section shows a simple 3-dimensional dataset that falls in the application area of the spectral clustering algorithm, i.e. the algorithm performs well by identifying all the dataset’s clusters. The dataset consists of 27 data points and it contains no outlier and noise points. 3.1.1

Clustering visualization

The following figure shows the resulting clusters by running the algorithm on the dataset (with k = 3 and σ = 1):

6

Spectral Clustering

3

Strengths and weaknesses

Figure 1: Visualization of the spherical clusters example

3.1.2

Reasoning

Already visually it can be seen that the dataset can be grouped into three clusters. The clusters are clearly separated from each other and the distance between the different data points within the same cluster is very low. As a consequence, the Gaussian similarity between the data points within the same cluster is relatively high. This fact allows the spectral clustering algorithm to clearly identify the different clusters. Actually in this “optimal” dataset even just running the k-means algorithm will give the same result (this has been tested with the XVDM system). 3.1.3

Impact of changing the parameter k

This section shows the impact on the clustering performance when the parameter k (number of clusters) changes. Figure 2 demonstrates the behaviour of the spectral clustering algorithm when increasing k, starting from k = 1 (top left figure) up to k = 5 (bottom most figure).

7

Spectral Clustering

3

Strengths and weaknesses

Figure 2: Spherical clusters: outcomes with k = 1 . . . 5

The number k corresponds to the number of clusters that have to be identified in the dataset, i.e. the number of centroids that are placed on the data. Therefore in the simplest case, where k = 1, only one cluster will be identified 8

Spectral Clustering

3

Strengths and weaknesses

(see figure top-left). Increasing k to 2, will increase the quality of the clustering algorithm, since it now correctly identifies one of the three visually recognizable clusters (the cluster in the bottom left corner). But still the other two clusters are recognized as a single one, since the 2nd centroid is placed in between of the two. Setting k = 3 matches all the clusters correctly and produces the expected result. Increasing k further, will again decrease the clustering quality, since now other groupings within the clusters will be identified.

3.2

Arbitrary shaped clusters

The datasets presented in this section are examples of datasets where the spectral clustering algorithm works, but where just running the k-means algorithm would usually fail in identifying the correct clusters. 3.2.1

Spiral rings

This 2-dimensional dataset consists of 100 data points and contains two spiral rings. Intuitively, both, the outer and inner ring, should be grouped into separate clusters. In contrast to the example shown in section 3.1, this is a case where just running the k-means algorithm would fail in producing the correct clusters. Clustering visualization The following figure shows the results of using only the k-means algorithm (with k = 2):

Figure 3: Spiral rings: clustering visualization of the k-means algorithm

The results of the spectral clustering algorithm (with k = 2 and σ = 9) are as expected:

9

Spectral Clustering

3

Strengths and weaknesses

Figure 4: Spiral rings: clustering visualization of the spectral clustering algorithm

Reasoning The k-means algorithm is not able to recognize this quite complex cluster shape because it would position the two centroids in the center of the left and right (or upper and lower half, depending on where the centroids have initially be positioned) half of the Cartesian space. As a consequence, the upper (left) halfs of the two rings is grouped into one cluster, whereas the lower (right) halfs of the two rings is grouped into the other cluster. The spectral clustering algorithm overcomes this problem by constructing the Laplacian matrix from a similarity graph that has been created for the original data points. Afterwards, the two eigenvectors for the first two smallest eigenvalues are computed. These two eigenvectors characterize connected components of the similarity graph, i.e. each element fi ∈ R of an eigenvector gives information whether the i-th original data point belongs to a certain connected component. These two eigenvectors can be assembled in a n x k matrix and the rows of this resulting matrix can be treated as new data points to be clustered with the k-means. The following figure shows the plot of the two normalized eigenvectors for the two spiral rings:

As one can see, this new representation of the data points can now easily be 10

Spectral Clustering

3

Strengths and weaknesses

clustered by using k-means. 3.2.2

3D lines

This sections shows the clustering of two (nearly crossing) 3D lines. Clustering visualization Figure 5 shows the clustering outcome of running just the k-means algorithm over the given dataset.

Figure 5: Clustering outcome of the 3D lines of the k-means algorithm

The following figures show the clustering outcomes of the spectral clustering algorithm depending on the different configurations of σ.

Figure 6: Clustering 3D lines with k = 2 and σ = 12 (left), σ = 14(right)

11

Spectral Clustering

3

Strengths and weaknesses

Reasoning As can be seen from figure 5, the k-means algorithm has problems in detecting the correct clusters. Also the spectral clustering algorithm is not immediately able to clearly separate the two clusters (see left clustering visualization of figure 6). The points where the two lines overlap on the z-axis are not associated to the correct cluster, i.e. in figure 6 they’re associated to the green line instead of the red one. Plotting the computed eigenvectors will show the reason for this behaviour:

Taking a look at this chart, one is not able to clearly identify two distinct clusters. Similarly, when running the k-means algorithm it is not able to detect any clusters. Points around y = 0 will be associated “randomly” to one of the clusters, producing an invalid clustering (see left outcome of figure 6). This problem can be solved by “tuning” the σ parameter. The effect of increasing σ on the computed eigenvectors can be seen below.

Figure 7: 3D lines: computed eigenvalues with σ = 14 (left) and σ = 16 (right) The two charts show how the two clusters start to become visible when increasing the σ parameter and so also the final clustering quality increases. 12

Spectral Clustering

3

Strengths and weaknesses

Actually the k-means algorithm already returns the right result with σ = 14, as can be seen on the right outcome of figure 6. As already mentioned in section 2, σ controls the width of the neighbourhoods. It basically exponentially weights the proximity of two data points. So a high value of sigma will heavily penalise points that are far away from each other. 3.2.3

Two “parabolas”

This section shows two “parabola”-like figures. It is again an example where the σ-parameter has to be adjusted carefully in order to achieve a acceptable clustering result. Clustering visualization The following figure demonstrates the outcome when just using the k-means algorithm for performing the clustering:

Figure 8: Parabolas, clustering outcomes of the k-means algorithm

Figure 9 instead shows the outcome of using the spectral clustering algorithm.

13

Spectral Clustering

3

Strengths and weaknesses

Figure 9: Parabolas, clustering outcomes. From left to right: with σ = 1, 20, 28

Reasoning Similar to the previous section 3.2.2, also in this example, the spectral clustering algorithm is not able to immediately cluster the given dataset. Only starting from a σ-value that is bigger than 26, the two clusters can be separated successfully.

14

Spectral Clustering

3

Strengths and weaknesses

Figure 10: Parabolas, computed eigenvalues with σ = 20 (left) and σ = 28 (right)

3.3

Mixture of different types of clusters

The datasets presented in this section contain various shapes of clusters, i.e. different cluster shapes (e.g. spiral rings, spheres, lines, cones) are mixed together. As we will see, the spectral clustering algorithm has no difficulties in recognizing the correct clusters. 3.3.1

Two spheres and a line

This 3-dimensional dataset consists of 150 data points and contains two spheres and one line. Visually, it should be grouped into three clusters, i.e. one for each sphere and one for the line. Again, this is a case where just running the k-means algorithm would fail in producing the correct clusters. Clustering visualization The following figure shows the results of using only the k-means algorithm (with k = 3):

15

Spectral Clustering

3

Strengths and weaknesses

Figure 11: Mixture of spheres and line: Clustering visualization of the k-means algorithm

The results returned by the spectral clustering algorithm (with k = 3 and σ = 1) are as expected:

Figure 12: Mixture of spheres and line: Clustering visualization of the spectral clustering algorithm

Reasoning The k-means algorithm is not able to recognize this cluster shape. Some points residing on the left of the line will be assigned to the cluster that is constituted by the left sphere. The reason for this is that the centroid for this cluster is nearer to these data points than the centroid for most points of the line. On the other hand, the spectral algorithm is able to correctly cluster this dataset. Again, this is achieved by transforming the dataset into a new representation (i.e. a 3-dimensional representation since we are grouping the data points into 3 clusters) that exhibits the characteristics of the original 16

Spectral Clustering

3

Strengths and weaknesses

dataset. This new representation of the dataset can afterwards be clustered more easily by using the k-means algorithm. 3.3.2

Two spheres and a spiral ring

This 3-dimensional dataset consists of 200 data points and contains two spheres and one spiral ring. Visually, it should be grouped into three clusters, i.e. one for each sphere and one for the spiral ring. As one will see, the results of the spectral clustering algorithm are as expected whereas the k-means algorithm fails to recognize the correct clusters. Clustering visualization The following figure shows the results of using only the k-means algorithm (with k = 3):

Figure 13: Mixture of spheres and spiral ring: Clusters detected by using only k-means

On the other hand, the spectral clustering algorithm groups the data points into the correct clusters (with k = 3 and σ = 14):

17

Spectral Clustering

3

Strengths and weaknesses

Figure 14: Mixture of spheres and spiral ring: Clusters detected by the spectral clustering algorithm

Reasoning Due to the spiral ring, the k-means algorithm is not able to produce the correct clusters. Again, the spectral clustering algorithm is able to detect the correct clusters by using the transformed dataset. 3.3.3

Two lines, a sphere and a spiral

This 3-dimensional dataset consists of 200 data points and contains two lines, one sphere and one spiral. Visually, one is able to group the data points into 4 clusters, i.e. one for each line, one for the sphere and one for the spiral. Both, spectral clustering and k-means algorithm, are able to detect the correct clusters contained in this dataset. Clustering visualization The following figure shows the results of the spectral clustering algorithm (with k = 4 and σ = 2):

18

Spectral Clustering

3

Strengths and weaknesses

Figure 15: Mixture of lines, sphere and spiral: Resulting clusters of the spectral clustering algorithm

Reasoning In this case, both clustering algorithms are able to detect the correct clusters since the distances between the clusters are quite large and as a consequence also the similarities between data points within different clusters are quite high.

3.4

Impact of outliers

We have seen that the spectral clustering algorithm works well on certain cluster shapes when there is no noise and outliers are not present. It is also able to recognize complex cluster shapes which can’t be recognized by using only the k-means algorithm. This section now shows a simple 3-dimensional dataset that consists of 15 data points, whereas one of these is an outlier. 3.4.1

Clustering visualization

The following figure shows the resulting clusters by running the algorithm on the dataset (with k = 2 and σ = 1):

19

Spectral Clustering

3

Strengths and weaknesses

Figure 16: Clusters with outliers

3.4.2

Reasoning

The spectral clustering algorithm is not able to correctly cluster this dataset, i.e. the outlier point makes up a separate cluster. The reason for this is that also the transformed representation of the data points still contains the outlier. If then the k-means algorithm is used on this transformation then one centroid lies exactly on the outlier point. The following figure shows the plot of the two normalized eigenvectors for the this dataset:

3.5

Impact of noise

This section shows a dataset containing uniform noise. It is an example where the k-means algorithm as well as the spectral clustering algorithm fail to properly identify the clusters. It is therefore an example where the algorithm performs badly. This section now shows a 3-dimensional dataset that consists of 200 data points. The data set contains uniform noise and visually it can be grouped into two clusters that are made up of the two spheres.

20

Spectral Clustering

3.5.1

3

Strengths and weaknesses

Clustering visualization

The analysis uses a dataset consisting of two spheres which are clearly separated and can be identified by the spectral clustering algorithm without any problems. Then noise is added in the form of 50 uniformly distributed datapoints. Starting from a situation of 0% of noise, the percentage is increased gradually for analysing the effect on the clustering quality. The following figures demonstrate this evolvement, where on each percentagelevel of noise, the clustering outcome together with the computed normalized eigenvectors are shown. The clustering parameters are set as follows: k = 2, σ = 2.

Figure 17: Spheres with 0% noise

Figure 18: Spheres with 5% noise

21

Spectral Clustering

3

Strengths and weaknesses

Figure 19: Spheres with 10% noise

Figure 20: Spheres with 40% noise

22

Spectral Clustering

3

Strengths and weaknesses

Figure 21: Spheres with 80% noise

Figure 22: Spheres with 100% noise

3.5.2

Reasoning

Figure 17 shows that the algorithm has no problems in identifying the two clusters. The computed eigenvectors are clearly separated from each other s.t. the spectral clustering algorithm can easily manage it. Actually even just running the k-means algorithm alloane on the original dataset would produce the same result. Adding 5% of noise to the dataset (figure 18) has the effect that the noise datapoints are associated to the nearest cluster (in this specific case to the green cluster). The main two spheres are however still correctly recognized as separate clusters. Also with 40% of noise this doesn’t change much (see figure 20). Increasing the noise percentage further however (see figure 21 and 22), then shows a dramatic impact in the clustering quality. The algorithm is no more able to clearly separate the clusters, and even the two main sphere-clusters get 23

Spectral Clustering

4

Conclusion

separated, since some points of the original “green” cluster are associated to the “red” one now. This is due to the fact that the added noise datapoints are not eliminated in the transformation process of the original dataset by the spectral clustering algorithm. They “blur” the dataset such that the transformation by computing the eigenvectors doesn’t give any improvement. As can be seen from the corresponding illustrations of the computed eigenvectors (on the right of the figures above), with increasing noise the two clusters start to disappear. From this it can be stated that if the dataset contains noise, running the spectral clustering algorithm gives no improvement compared to just running the k-means algorithm directly on the original dataset. They both produce the same output and fail to correctly identify the clusters. When changing the parameter k (i.e. the number of computed clusters which corresponds to the number of placed centroids), the noise points may even make up separate clusters.

4

Conclusion

After conducting a series of experiments on certain datasets, we can conclude that the spectral clustering algorithm works well in the ideal case when there is no noise and outliers are not present. The algorithm works also on certain cluster shapes (i.e. non-convex clustering structures) which can’t be recognized by using only the k-means algorithm. This has been shown with the datasets presented in sections 3.2 and 3.3. Besides these good results, the algorithm has also some drawbacks. First of all, it is not robust enough against noise and outliers since it is not able to threshold them out. As a consequence, it is possible that a small set of outlier and noise points may make up a cluster, which prevents the algorithm to return the correct clusters. This has been demonstrated with datasets that have been presented in sections 3.4 and 3.5. Furthermore, we experienced that the choice for σ might have significant impacts on the performance of the spectral clustering algorithm. So for instance, having a value of σ less than 9, it was not able to recognize the two spiral rings presented in section 3.2. Instead the performance of the spectral clustering algorithm was equivalent to the one of using the k-means algorithm. So it can be seen that it is not a trivial task to find the best value for σ since it may differ from dataset to dataset. When performing the different experiments, the best value for σ has been chosen from a visual perspective by trying out different values and then selecting the one that produced the best clustering results. Another disadvantage is that the number of clusters (i.e. k) must be specified upfront. Quite often, the parameter k can’t be determined in advance and, as we have seen in section 3.1, the parameter k affects the quality of the resulting clusters.

24

Spectral Clustering

5

5

Appendix

Appendix

This section includes the installation instructions of the system as well as a short manual on how to use it.

5.1

Prerequisites

The Spectral Clustering application is a module written for the XVDM system1 . In order to run XVDM on Linux (Ubuntu or Debian based distribution), the following packages have to be installed: libtool, automake, subversion, libgsl0-dev, libglut3-dev, libgl1-mesa-dev, libncurses5-dev. The following command should install these necessary packages: sudo apt-get install g++ libtool automake1.9 subversion libgsl0-dev libglut3-dev libgl1-mesa-dev libncurses5-dev The XVDM system can be downloaded from its official website2 where further installation instructions can be found.

5.2

Installation instructions

Unzip the archive containing the XVDM system including the spectral clustering module by using the following command: tar xvfz xvdm_spectral.tar.gz Step into the extracted folder “xvdm spectral” by typing cd xvdm_spectral and execute first autoreconf followed by ./configure After that type make all to compile the system. For then running the XVDM application refer to the user manual section 5.3.

5.3

User manual

Once the XVDM system together with the spectral clustering module is installed, it can be launched from some arbitrary shell. To do so, navigate (within your shell) to the installation directory of XVDM and then into the “spectral” folder, e.g. if the XVDM has been unzipped on the home folder, the following commands have to be executed: 1 High-dimensional

visual data mining tool developed at the faculty of Computer Science at the Free University of Bolzano/Bozen 2 https://www.inf.unibz.it/dis/wiki/doku.php?id=xvdm:xvdm

25

Spectral Clustering

5

Appendix

cd xvdm_spectral/spectral The folder contains an executable named “SpectralClustering” which has to be launched as follows: ./SpectralClustering

Figure 23: Initial startup window of the XVDM system loading the spectral clustering module

As a result the window shown in figure 23 should pop up. On the left side it shows the navigation panel containing first a set of data primitives that can be used and then a section with predefined datasets for the spectral clustering module. Most of these datasets have been used in section 3 for the analysis of the strength and weaknesses of the algorithm. When choosing one of these datasets, i.e. the “spiral rings” dataset, the according datapoints will be shown in the central area, together with an additional menu on the right, wich allows to specify some further properties of the dataset (x,y orientation, the number of datapoints, etc.).

26

Spectral Clustering

5

Appendix

Figure 24: Example of loading the spiral rings dataset

Once all of the dataset’s properties have been adjusted accordingly, the spectral clustering menu can be opened by clicking on the “Spectral Clustering” button on the left navigaton panel. As a conseguence, the left navigation panel changes, showing now the specific controls for the spectral clustering module.

Figure 25: Spectral clustering module perspective

It allows for adjusting different input (parameters) and the output behaviour of the application: • Show Database: 27

Spectral Clustering

5

Appendix

Shows the original datapoints • Number of clusters: The number of clusters that should be identified by the spectral clustering algorithm • Sigma: The sigma σ input parameter to be used by the algorithm. See section 2 for more details. • Show debug info: If activated, prints the different steps of the computation on the shell which has been used to launch XVDM. Attention, don’t use this with larger datasets since it slows down the computation process!! • Initialize: This button has to be pressed when loading a new dataset and before actually running the clustering algorithm. It initializes the new datapoints and removes the old ones. • Spectral clustering: Computes the given number of clusters out of the given dataset by using the spectral clustering algorithm. • k-Means: Computes the given number of clusters out of the given dataset by using the k-means algorithm. Beside composing datasets from the XVDM predefined data primitive or by using the given spectral clustering datasets, there is also the possibility to load a custom dataset from the file system by launching XVDM as shown in this example: ./SpectralClustering ../myDataFiles/testDataSet1.csv This will launch the XVDM system as usual, with the difference that there will be an additional dataset entry in the “Spectral Clustering” section of the left navigation panel called “User’s dataset” (see figure 26).

28

Spectral Clustering

5

Appendix

Figure 26: User’s dataset entry when loading a custom dataset from the file system

The input data file should have the following format: "1","2","3" 0.462828,0.429717,0.496856 0.50166,0.43534,0.501993 0.494636,0.447637,0.518405 0.467981,0.434828,0.529833 0.451618,0.441443,0.544336 0.436667,0.457352,0.548572 0.426574,0.472266,0.5686 0.416835,0.45343,0.566875 ..

29

Spectral Clustering

References

References [1] Von Luxburg, U., A Tutorial on Spectral Clustering, Max-Planck-Institute for Biological Cybernetis, TR-149, August 2006. [2] Von Luxburg, U., Lecture slides on Clustering, Max-Planck-Institute for Biological Cybernetis, http://velblod.videolectures.net/2007/pascal/ bootcamp07_vilanova/luxburg_von_ulrike/luxburg_clustering_ lectures.pdf, July 2007.

30