Consensus Spectral Clustering in Near-Linear Time

Viewer
Transcript

Consensus Spectral Clustering in Near-Linear Time Dijun Luo, Chris Ding, Heng Huang, Feiping Nie Department of Computer Science and Engineering, The University of Texas at Arlington 701 S. Nedderman Drive, Arlington, Texas, USA [email protected] [email protected] [email protected] [email protected]

Abstract—This paper addresses the scalability issue in spectral analysis which has been widely used in data management applications. Spectral analysis techniques enjoy powerful clustering capability while suﬀer from high computational complexity. In most of previous research, the bottleneck of computational complexity of spectral analysis stems from the construction of pairwise similarity matrix among objects, which costs at least O(n2 ) where n is the number of the data points. In this paper, we propose a novel estimator of the similarity matrix using K-means accumulative consensus matrix which is intrinsically sparse. The computational cost of the accumulative consensus matrix is O(nlogn). We further develop a Non-negative Matrix Factorization approach to derive clustering assignment. The overall complexity of our approach remains O(nlogn). In order to validate our method, we (1) theoretically show the local preserving and convergent property of the similarity estimator, (2) validate it by a large number of real world datasets and compare the results to other state-of-the-art spectral analysis, and (3) apply it to large-scale data clustering problems. Results show that our approach uses much less computational time than other state-of-the-art clustering methods, meanwhile provides comparable clustering qualities. We also successfully apply our approach to a 5-million dataset on a single machine using reasonable time. Our techniques open a new direction for highquality large-scale data analysis.

I. Introduction Clustering is one of the most widely used techniques for data analysis, with applications ranging from statistics, computer science, biology to social sciences and psychology, and is typically the first step of unsupervised data analysis. Clustering is a procedure of partitioning objects into groups (clusters), such that objects in the same group are similar, while objects in diﬀerent groups are dissimilar, with certain subjective or objective criteria [1]. Among all the clustering approaches, graphbased methods are considered as the eﬀective ones due to their rich theoretical foundations, sound empirical performance, and easy implementations [2], [3], [4], [5], [6], [7], [8]. Because of the state-of-the-art clustering performance, spectral clustering has been applied in various areas[9], [2], [10]. Unfortunately, when the number of data instances (denoted as n) is large, spectral clustering approaches encounter a quadratic resource bottleneck in computing pairwise similarity among n data instances [5], [11], and in storing the large similarity matrix. Typically, graph-based spectral clustering includes four following processes: (1) construct similarity matrix, (2) calculate Laplacian matrix [12], (3) compute the eigenvectors of Laplacian matrix, and (4) perform K-means clustering on the eigenvectors of Laplacian matrix. Although the eigenvector

decomposition of sparse matrix can be eﬃciently computed [13], i.e. the cost of step (3) can be lowered down, the intrinsical diﬃculties of these approaches still lie on the construction of similarity matrix, which costs at least O(n2 ) in time. However, in recent years, many applications often come up with an intrinsically large-scale, e.g. protein families detection [14], information retrieval [15], [16], and other large machine learning and data mining applications [11]. The spectral clustering approaches are prohibited in such very large-scale datasets due to its high computational complexity (O(n2 )). Considering the sound qualities of spectral clustering and its computational diﬃculty, many researchers are interested in more eﬃcient spectral clustering algorithm, especially for solving large-scale problems? This paper, for the first time in spectral clustering research, oﬀers a feasible solution. To overcome the diﬃculty of computational complexity of spectral clustering, we focus on the key time-consuming issue, i.e. similarity matrix construction. We propose a novel estimator to approximate the similarity matrix using accumulative consensus matrix of a series of K-means clustering with random initializations. Intuitively our assumption is that two data points are more similar if they are clustered into the same class with higher probability. To be more specific, we perform a large number of trials of K-means clustering on data objects using random initializations. If two objects are often clustered in the same group, the accumulative consensus of these two objects is high, and if two objects are often clustered in the diﬀerent groups, the consensus is low. The derived estimator has the following advantages: (1) The overall computational complexity is O(nlogn). Compared to traditional spectral methods, which cost at least O(n2 ), our approach is substantially faster with large-scale data. (2) The obtained estimator of similarity matrix is intrinsically sparse. The sparsity is useful in the second step and third step of spectral clustering, (i.e. the computation of Laplcian matrix and the eigenvectors of Laplacian matrix), because the computational complexity is linearly proportional to the number of non-zeros in the similarity matrix. (3) Our estimator of similarity matrix preserves the local property, i.e. two objects in the same manifold tend to have stronger similarity than those in diﬀerent manifolds. We will demonstrate the local preserving property using both toy data example and theoretical analysis. We further develop a Non-negative Matrix Factorization

v1 v2

v1 v2

v3

v2

v3 v4

v1

v3 v4

(a) Three partitions of the four nodes Fig. 1.

v4

v1 v2 v3 v4

Π1 Π2 Π3 v1 1 v2 1 v3 2

1

1

2 2

1 1

v1 3 v2 2 v3 1

v4 2

2

2

v4 0

(b) Membership

2

1

0

3

2

1

2 1

3 2

2 3

(c) Consensus matrix

An illustrative example of consensus matrix on four nodes. (a): Three partitions of four nodes v1 , v2 , v3 , and v4 .

(NMF) algorithm to obtain clustering results based on the clustering accumulative consensus matrix. The total running time of our whole clustering algorithm remains O(nlogn). We first test our algorithm in various real world datasets, then three large-scale data (the number of samples is 40,960, 2,621,440, and 5,242,880, respectively) are used to verify the scalability of our algorithm. Results indicate that our approach is comparable to other state-of-the-art clustering methods while the computational complexity is much lower than traditional spectral clustering algorithms in large-scale data. II. Clustering Consensus Estimation A. Accumulative Consensus Matrix The key idea of accumulative consensus matrix is to estimate the similarities of objects by measuring how often they are clustered into the same group when multiple clustering procedures are applied. Figure 1 demonstrates a toy example in which only four data points are considered. Figure 1 (a) and (b) show three partitions (Π1 , Π2 , Π3 ) and the corresponding membership of the four nodes v1 , v2 , v3 , v4 . For any pair data points, if they are clustered into the same group, the accumulative consensus value between these two data points increases by 1. The accumulative consensus matrix counts the number of time they are clustered into the same group, see Figure 1 (c). In our study, we use K-means (which can also be substituted by other methods) as the clustering procedure to obtain diﬀerent partitions of data points. K-means clustering has several desirable properties as follows: (1) The K-means algorithm is guaranteed to converge to local solution (it is also the reason not to use K-means only one time). (2) There are large number of local solutions. One of the consequences is that two data points which are near to each other in the same manifold always have a chance to be clustered into the same group. (3) The computational time of K-means is linear to the number of data points. These properties of K-means oﬀer a fast way to estimate the similarity matrix among objects. To be more formal, assume that we have n data points X = {x1 , x2 , · · · , xn } in a p dimensional vector space: xi ∈ R p , i =

1, 2, · · · , n. Let V = {1, 2, · · · , n}, then a partition of the data points Π can be represented as Π = {C1 , C2 , · · · , C K }, where K = V and K is the number of clusters. For Ck ∩ Cl = Φ, ∪k=1 convenient discussion, we also use the following notations to represent the partition Π, 1 i ∈ Ck (1) = QΠ ik 0 Otherwise, or using the membership indicator, cΠ i = k,

(2)

if i ∈ Ck , i = 1, 2, · · · , n. Notice that QΠ is a n × K matrix and cΠ is a n×1 column vector. Given a partition Π, the consensus matrix is defined as, Π 1 cΠ Π i = cj (3) S ij = 0 Otherwise. i, j = 1, 2, · · · , n. Given a set of partitions Π1 , Π2 , · · · , ΠT , the accumulative consensus matrix C is defined as, C=

T

S iΠjt .

(4)

t=1

Figure 2 shows a dataset drawn from three Gaussian mixture models as well as the accumulative consensus values of four data points A, B, C, and D. In this example, we set T = 100 and K = 30. The Euclidean distance between A and B is dAB = 3.64 which is larger than dAC = 1.62. However the accumulative consensus value between A and B is 79, which is much higher than that between A and C, which is 0. Similar phenomenon occurs among A, D, and C. This indicates that the accumulative consensus matrix preserve the local connectivity among data points in the same manifold. More theoretical analysis can be found in the discussion of Theorem 1. B. Implementation Details In large-scale problems, the size of the accumulative consensus matrix can be very large. For example, in one of our experimental datasets, n = 5, 242, 880, the full size of C is 5, 242, 880 × 5, 242, 880, which is impossible to store in machine memories. Fortunately, the accumulative consensus matrix is intrinsically sparse. Therefore, in our implementation, we only store the non-zeros values in C.

In order to obtain the accumulative consensus matrix, we start K-means algorithm with random initializations. We repeat the K-means algorithm T times to obtain T partitions. C. Properties of Accumulative Consensus Matrix We first notice that following property of local solution of K-means: 4

B

3 2

B

C

D

A 100 79

A

0

0

B

79 100 0

0

C

0

0 100 22

D

0

0

1

A

0

D

C

−1 −2 −3 −4 −6

−4

−2

0

2

22 100

4

Fig. 2. The accumulative consensus matrix of three partitions on four nodes. The distance between A and C is less than that between A and B but the accumulative consensus value between A and C is zero which is much lower than that between A and B. Similar phenomenon occurs among A, D, and C.

Theorem 1: If Π is a local K-means solution of data X, and Π Π cΠ i = c j = k, then any point xi on the segment [xi , x j ], ci = k. By points xi on the segment [xi ,x j ], we mean those points which can be represented as: xi = αxi + (1 − α)x j , 0 ≤ α ≤ 1.

(5)

Theorem 1 suggests that if two points are far away from each other, they should be clustered into diﬀerent groups. Otherwise, all the points between them should be in the same class. Thus, the accumulative consensus matrix defined in Eq. (4) is theoretically an approximation of similarity matrix. Theorem 1 also indicates that K-means tends to cluster two points together if the density between the two points is high. Please see Figure 2 as an example. Points B and A are clustered 79 times out of 100 random trials, because the density between the A and B is high. On the other hand, points A and C are closer to each other, but the density between these two points is low. Thus, they are never clustered into the same group out of 100 random trials. Furthermore, we will show that the accumulative consensus matrix converges. Formally speaking, we have the following theorem, Theorem 2: Let C T be the accumulative consensus matrix over T partitions obtained using Eq. (4) and assume the T partitions are independent, then for any > 0, there exists a constant matrix C 0 such that the following holds T C − C 0 2F > 2 = 0, lim P (6) T →∞ T where · F denotes the Frobenius norm. Proof. Since the consensus matrix S is convergent solution of K-means with random initialization, S i j is a random variable taking value from {0, 1}. Without loss of generalization, we assume S i j ∼ Ber(pi j ) where Ber(pi j ) is the Bernoulli

Proof. Since Π = {C1 , C2 , ..., C K } is a local solution of KΠ means, and cΠ i = c j = k, the following holds:

distribution with probability pi j . The variant of ⎛ T⎞ ⎜⎜ Ci j ⎟⎟ pi j (1 − pi j ) Var ⎜⎜⎜⎝ ⎟⎟⎟⎠ = . T T

xi − μk < xi − μl ,

By applying the Chebyshev inequality, we have

CiTj T

is

x j − μk < x j − μl , ∀l k, where μk and μl are the center of cluster k and l. Thus

1

αxi − μk 2 < αxi − μl 2 ,

0.8

(1 − α)x j − μk < (1 − α)x j − μl . 2

0.7 Consensus Values

2

Cab Cac Cbc

0.9

for any 0 ≤ α ≤ 1. α(xiT xi − 2xiT μk + μTk μk ) < α(xiT xi − 2xiT μk + μTl μl ), (1 − α)(xTj x j − 2xTj μk + μTk μk ) < (1 − α)(xTj x j − 2xTj μk + μTl μl ).

0.6 0.5 0.4 0.3 0.2

By adding them together, we have μTk μk

− 2αxiT μk

− 2(1 − α)xTj μk

<

μTl μl

0.1

− 2αxiT μl

− 2(1 − α)xTj μl .

We add (αxi + (1 − α)x j ) on both sides,

0

0

200

400

600

800 1000 1200 # Random Trials T

1400

1600

1800

2000

2

Fig. 3. Convergent test of accumulative consensus matrix among three data points on data set shown in left panel of Figure 2 . The three points are a : (−2.7609, 0.7717), b : (−2.5783, 0.9635), and c : (−2.6913, 1.0913). The accumulative consensus values are close to the convergent values at around T = 200.

μTk μk − 2αxiT μk − 2βxiT μk + α2 xiT xi + β2 xTj x j + 2αβxiT x j <

μTl μl − 2αxiT μl − 2βxiT μl + α2 xiT xi + β2 xTj x j + 2αβxiT x j ,

where β = 1 − α. Thus we have αxi + βx j − μk < αxi + βx j − μl , ∀l k, indicating the closest center to xi is μk . Thus cΠ i = k.

–

⎛ T ⎞ ⎜⎜⎜ Ci j ⎟⎟ pi j (1 − pi j ) ⎜ − pi j | > ⎟⎟⎟⎠ ≤ P ⎜⎝| . T T 2

Let Ci0j = pi j , then

maxi j pi j (1 − pi j ) CT 0 2 2 − C F > ≤ n2 P , T T 2

Since n2 maxi j pi j (1 − pi j ) / 2 is a constant w.r.t. T , we have T C 0 2 2 − C F > ≤ 0, 0 ≤ lim P T →∞ T which completes the proof. – We use three points (a, b, c) on the dataset shown in left panel of Figure 2 to demonstrate the convergent property of accumulative consensus matrix. We run total 2000 random Kmeans trials, evaluate the normalized accumulative consensus matrix C T /T , and plot the three accumulative consensus values between a and b, a and c, b and c as a function of T in Figure 3. One can observe that after around 200 random trials, the accumulative consensus values are very close to the convergent results. For the reason, in our real world experiments, we always use T = 200. We are also interested in the statistical relationship between the accumulative consensus value Ci j and the similarity Wi j (defined in Eq. (7), see §III for details) for any data point pair i and j. We randomly generate 3000 Gaussian 2-dimension data points as shown in Figure 4 (a). After that, we perform 50 K-means with random initializations to get C using Eq. (4). For each pair (i, j), we plot Ci j versus Wi j in top right panel of Figure 4. The average Wi j corresponding to the same Ci j value is also plotted (in black line). These results indicate that, on average, the more similar two data points are, the larger accumulative consensus value they have. D. A Toy Example Here we demonstrate the accumulative process of consensus matrix on the Gaussian synthetic data in Figure 2. The clustering membership and the accumulative consensus matrix are shown in Figure 5, where the first and third rows are 10 diﬀerent K-means clustering results, and the second and bottom rows are the results of corresponding accumulative consensus matrix C. The data points are ordered according to their membership to the Gaussians, i.e. the first, second, and third 50 points are drawn from the first, second, and third Gaussian, respectively. From the last accumulative consensus (bottom right panel of Figure 5), we can see that the data points are well separated by the accumulative consensus matrix. III. Spectral Clustering via Accumulative Consensus Matrix A. Overview of Spectral Clustering Spectral clustering can be interpreted from diﬀerent point of views, e.g. graph cut[17] and random walk[18]. There are various theoretical foundations which provide diﬀerent understandings of spectral clustering. Luxburg [19], and Ding [20] independently showed overviews of this clustering techniques in terms of theories, interpretations, and implementation details.

Event though spectral clustering can be interpreted in various ways, they are in fact theoretically equivalent. For convenient discussion, we will give a brief introduction of spectral clustering in the point of view of graph cuts in this section. Especially, we use the Normalized Cut [21] as an example, which is also used in our experimental comparisons. Given n data points X = {x1 , x2 , · · · , xn }, we first construct the similarity matrix W ∈ ℝn×n as, −xi − x j 2 Wi j = exp i, j = 1, 2, · · · , n, (7) 2σ2 where σ = α¯r, r¯ is the average pairwise distance of data points r¯ = i j xi − x j /(n(n − 1)), and α is a parameter needs to be determined. Or using an adaptive way [22]: ⎧ xi −x j 2 ⎪ ⎪ xi and x j are neighbors; ⎨ exp − 2σ 2 Wi j = ⎪ (8) ij ⎪ ⎩ otherwise, 0 √ where σi j is an adaptive Gaussian parameter: σi j = mi m j . mi and m j are the average distance between i and j to their neighbors, respectively. Here we call W as a weighted graph with n nodes V in which Wi j represents the weight between node i and j. For convenient discussion, we set V = {1, 2, · · · , n}. From the graph cut point of view, the clustering task is to partition the data points into K groups, Π = {C1 , C1 , · · · , C K }. The objective function of multi-way graph spectral clustering is, J(Π) =

1≤p
K s(C p , Cq ) s(C p , Cq ) s(Ck , C¯ k ) + = , ρ(C p ) ρ(Cq ) ρ(Ck ) k=1

(9)

where ρ(Ck ) is a normalization term. If ρ(Ck ) is a constant, minimizing J(π) is equivalent to Min Cut problem [23]. When ρ(Ck ) = i∈Ck di , J(Π) becomes the normalized cut [21] objective. Here C¯ k is the complement of subset Ck in graph W, and s(A, B) = i∈A j∈B Wi j , di = j Wi j . Let qk (k = 1, 2, · · · , K) be the cluster indicators, where the i-th element of qk is 1 if the i-th data point xi belongs to cluster k, and 0 otherwise. For example, suppose data points within each cluster are adjacent, then nk

qk = (0, · · · , 0, 1, · · · , 1, 0, · · · , 0)T , (10) One can easily see that s(Ck , C¯ k ) = i∈Ck j∈C¯ k wi j = qTk (D − W)qk , i∈Ck di = qTk Dqk , s(Ck , Ck ) = qTk Wqk . We have Jncut =

K qT (D − W)qk

k

k=1

qTk Dqk

=

K qTk Lqk

k=1

qTk Dqk

,

(11)

or equivalently, Jncut =

TrQT LQ TrQT DQ

(12)

Matrix L = D − W is called graph Laplacian matrix [12]. Minimizing the normalized cut objective in Eq. (12) is equivalent to solve the following eigenvector problem: Lqk = λk Dqk .

(13)

4

1

1

0.8

0.8

0.6

0.6

2

Wij

1 0

Wij

3

0.4

0.4

0.2

0.2

−1 −2 −3

0 −4 −4

−3

−2

−1

0

1

2

(a) 1-Gaussian Toy Data

3

0

10

20

4

30

40

50

0

0

Cij

10

20

30

40

50

Cij

(b) Consensus vs similarity in 1-Gaussian

(c) Consensus vs similarity in 3-Gaussian

Fig. 4. Relationship between accumulative consensus matrix C and Gaussian similarity matrix W. (a): 1-Gaussian data set. (b): the relationship between Ci j and Wi j on 1-Gaussian dataset shown in (a). (c): the relationship between Ci j and Wi j on 3-Gaussian dataset shown in Figure 2. Shown are the average and standard deviation of Wi j corresponding to the same Ci j . Statistically speaking, higher accumulative consensus values lead to higher similarity.

Fig. 5. Accumulative process of consensus matrix for the first 10 random trials. The first and third rows are 10 diﬀerent K-means clustering results. And the second are the results of accumulative consensus matrix C, e.g. the first figure in the second row is S 1 , and the last figure in the bottom andt bottom rows t is computed using Eq. (3). row is 10 S , where S t=1

Spectral clustering then performs a K-means on the first K eigenvectors Q = q1 , q2 , · · · , qK to obtain clustering indicators.

B. Hierarchical K-means The bottleneck of computational cost for spectral clustering stems from the construction of similarity matrix. As the major contribution in this paper, we use the consensus matrix to

approximate the similarity matrix. However, the complexity of eigenvector problems relies on the sparseness of the Laplacian matrix, i.e. the complexity is linearly proportional to the number of non-zero entries in Laplacian matrix. To theoretically reduce the overall computational complexity, we need to bound the size of group. Specifically, we first observe the following upper bound of the sparsity of accumulative consensus matrix. Theorem 3: The non-zeros entries in accumulative consensus matrix C is at most κh2 T , where C is constructed using Eq. (4) with T K-means random trials and the K-means divides data into κ clusters in each random trial, and h = maxk Ck is the maximum number of data points in the largest cluster. In order to get a sparse consensus matrix, we need to control the maximum size of the groups. On the other hand, if we fix the maximum size of the groups, the number of groups would be large if the number of data points n is large. Notice that the computational complexity of K-means is O(nκp), where κ is the number of groups we want to cluster the data into. Thus, if we fix h = maxk Ck , κ ≈ n/h, the K-means algorithm takes O(nκp) = O(n2 p/h), which is still a O(n2 ) algorithm. To solve this problem, we present a hierarchical version of K-means. We first partition the input data into Km (Km is set to be 20 in all our experiments) groups. For groups size of which is larger than h, we further partition them. Please see Algorithm 1 for details. Algorithm 1 HierarchicalKmeans(X, h, Km ) Input: Data X, maximum group size h, maximum clusters Km . Output: Clustering partition Π. Initialization: Π = {C1 }, C1 = [1, 2, · · · , n], N(1) = n, K ← 1, K˜ ← 2. while true do if N(k) ≤ h, ∀k ≤ K then break end if for k = 1 : K do if N(k) > h then κ ← min(Km , N(k)/h) π ← Kmeans (X, κ) Ck ← π1 , Nk ← |π1 |, for l = 2 : κ do C K˜ ← πl , NK˜ ← |πl |, K˜ ← K˜ + 1, end for end if K ← K˜ − 1 end for end while Output: Π = {C1 , C2 , · · · , C K }

C. Algorithm for Accumulative Consensus Matrix Construction We summarize the algorithm for accumulative consensus matrix construction in Algorithm 2. Notice that Q is sparse, and QQT can be eﬃciently computed. Detailed computa = tional complexity analysis can be in §V. Here QQT ij (q ) (q ) = S , indicating whether data points i and j are ij k k i k j grouped into the same cluster. Algorithm 2 ConsensusConstruct(X, h, Km , T ) Input: Data X, maximum group size h, maximum clusters Km , the number of K-means random trials T . Output: Accumulative consensus matrix C. Initialization: C ← 0. for t = 1 : T do Πt ← HierarchicalKmeans(X, h, Km ) Compute Q from Πt using Eq. (1). C ← C + QQT end for Output: C.

IV. Non-negative Matrix Factorization (NMF) Solutions In typical spectral clustering, we compute the eigenvectors of graph Laplacian. However, in multi-way clustering, the eigenvectors of graph Laplacian remain diﬃcult in group membership assignment. As a result, we still need to perform K-means again to obtain grouping results. In this section, as another main contribution of this paper, we develop a robust clustering algorithm using Non-negative Matrix Factorization (NMF) technique, which directly solves the clustering assignment using similarity matrix. A. NMF Algorithm Consensus Clustering Notice that for any positive α, if we replace qk by αqk , Eq. (11) gives the same objective. Thus we can relax Eq. (11) and enforce qTk Dqk = 1, which lead to min Jncut (H) =

K

hTk Lhk s.t. hTk Dhk = 1, hk ≥ 0, ∀k, (14)

k=1

where H = [h1 , h2 , · · · , hK ]. Here the clustering membership indicator vector qk , which takes values from {0, 1}, is relaxed to a non-negative vector (taking values from {0, ℝ+ }). Instead of solving the eigenvector problem in Eq. (13), we minimize the objective in Eq. (14) with the explicit nonnegative constraints. The optimal solution then directly leads to clustering assignment: Ck = {i : arg max Hil = k}.

One can see that in the output of Algorithm 1, the largest group size is h. According to Theorem 3, the non-zero entries in consensus matrix is at most κh2 T , which is linear to total number of data points n.

l

Notice that L = D − C where C is the consensus

as matrix the approximation of similarity matrix, and Tr H T DH is a constant, we can further rewrite Eq. (14) into

K

max Jncut (H) =

Tr H T CH

s.t. H T DH = I.

(15)

k=1

Our algorithm starts with an initial guess H. It then iteratively updates H until convergence using the updating rule: [CH + DHΛ− ]ik Hik ← Hik , (16) [DHΛ+ ]ik

increases monotonically. Proof We use the auxiliary function approach [24]. An aux˜ of function L(H) satisfies G(H, H) = iliary function G(H, H) ˜ L(H), G(H, H) ≤ L(H). We define H (t+1) = arg max G(H, H (t) ).

(21)

H

Then by construction, we have L(H (t) ) = Z(H (t) , H (t) ) ≤ Z(H (t+1) , H (t) ) ≤ L(H (t+1) ).

(22)

(t)

where Λ = H T CH,

(17)

and Λ+ , Λ− are the positive and negative parts of Λ, respectively. We will show that the updating algorithm of Eq. (16) converges to correct solution. Notice that the feasible domain of Eq. (15) is non-convex, indicating that our algorithm can only reach local solutions. However, we show in empirical study that the whole algorithm yields reasonable results comparing with Normalized Cut spectral clustering. Since C is a sparse matrix, and D is a diagonal matrix, the computational cost of the updating algorithm is low, please see §V for detailed discussion.

This proves that L(H ) is monotonically increasing. The key steps in the remainder of the proof are: (1) Find an appropriate auxiliary function; (2) Calculate the global maxima of the auxiliary function. We write Eq. (20) as L = Tr[H T CH + Λ− H T DH − Λ+ H T DH]. We can show that one auxiliary function of L is ˜ = Z(H, H)

i jk

+ −

L = Tr[H CH − Λ(H DH − I) − ΣH],

(18)

where the Lagrange multiplier Λ enforces the orthogonality condition H T DH = I and the Lagrange multiplier Σ enforces the nonnegativity of H. The KKT complementary slackness condition (∂L/∂Hik )Hik = 0 becomes [CH − DHΛ]i j Hi j = 0.

(19)

Clearly, a fixed point of the update rule Eq. (16) satisfies [CH − DQΛ]i j Hi2j = 0. This equation is mathematically identical to Eq. (19). From Eq. (19), summing over j, we obtain Λii = [H T CH]ii . To find the oﬀ-diagonal elements of α, we ignore the nonnegativity requirement and set ∂L/∂H = 0 which leads to Λii = [H T CH]ii . Combining these immediately leads Eq.(17). – The convergence of our algorithm is assured by the following Theorem. Theorem 5: Under the update rule of Eq. (16), the Lagrangian function L = Tr[H T CH − Λ(H T DH − I)],

(20)

Hik H jk ) H˜ ik H˜ jk

(Λ− )kl di H˜ ik H˜ il (1 + log

di (HΛ ˜ + )ik H 2

ik

H˜ ik

ik

Here we show the correctness and convergence of the above algorithm. By correctness, we mean that the update yields a correct solution at convergence; the correctness of our algorithm is assured by the following theorem. Theorem 4: Fixed points of Eq. (16) satisfy the KKT condition of the optimization problem of Eq. (15). Proof We begin with the Lagrangian function T

ilk

B. Analysis of NMF Algorithm

T

Ci j H˜ ik H˜ jk (1 + log

Hik Hil ) H˜ ik H˜ il

(23)

,

using the inequality z ≥ 1 + logz, z = Hik H jk /H˜ ik H˜ jk , and a generic inequality n k (AS B) S 2 ip ip i=1 p=1

S ip

≥ Tr(S T AS B),

(24)

where A, B, S , S > 0, A = AT , B = BT . We now calculate the ˜ . The gradient is global maxima of Z(H) = G(H, H) ˜ i j H˜ ik ˜ − )kl H˜ ik ˜ + )ik Hik ˜ [C H] (DHΛ (DHΛ ∂Z(H, H) =2 +2 −2 ∂Hik Hik Hil H˜ ik The second derivative ˜ ∂2G(H, H) = −2Yik δi j δk , ∂Hik ∂H j

˜ − )ik H˜ ik (DHΛ ˜ + )ik ˜ ik H˜ ik (DHΛ [C H] Yik = + + , 2 2 Hik Hil H˜ ik is negative definite. Thus Z(H) is a concave function in H and has a unique global maximum. This maximum is obtained by setting the first derivative to zero, yielding: ˜ − CH + DHΛ 2 2 ik . (25) Hik = H˜ ik ˜ DHΛ+ ik

˜ we see that According to Eq. (21), H (t+1) = H and H (t) = H, Eq. (25) is the update rule of Eq. (16). Therefore, Eq. (22) always holds. –

C. Initialization In order to obtain more robust results, we seek for a more reasonable initializations instead of using random ones. Since calculating a single eigenvector of a sparse matrix is O(E) algorithm [13] where E is the number of non-zeros of the sparse matrix, and the second eigenvector is a good approximation of Normalized Cut optimal solution [2], we employ a hierarchical approach to obtain an initialization for our NMF algorithm. More explicitly, we partition the data as the following Algorithm 3. Algorithm 3 ConsensusInit(C, K) Input: Consensus matrix C, the desired number of clusters K. Output: Partition Π Initialization: Π = {C1 }, C1 = [1, 2, ..., n], N(1) = n. for m=1:K-1 do kˆ = arg maxk≤m N(k). Cˆ = C(Ckˆ , Ckˆ ), ˆ and compute the second Compute Laplacian matrix of C, eigenvector v of the Laplacian matrix. π1 = {i : vi ≤ 0}, π2 = {i : vi > 0}, Ckˆ = π1 , Cm+1 = π2 , ˆ = |π1 |, N(m + 1) = |π2 |. N(k) end for Output: Π = {C1 , C2 , ..., C K }.

D. Consensus Clustering We summarize our consensus clustering algorithm in Algorithm 4. Algorithm 4 ConsensusClustering(M, K, h, Km , T ) Input: Data X, desired number of clusters K, maximum group size h, maximum clusters Km , and the number of Kmeans random trials T . Output: Partition Π C = ConsensusConstruct(X, h, Km , T ). Π = ConsensusInit(C, K) Compute Q from Πt using Eq. (1). H ← Q + 0.2 while not converged do [CH+DHΛ− ]ik , Λ = H T CH, Hik ← Hik [DHΛ+ ]ik end while for k = 1 : K do Ck = {i : arg maxl Hil = k} , end for Output: Π = {C1 , C2 , · · · , C K }. Notice that there are several parameters in the whole algorithm. But the consensus clustering is not sensitive to these parameters. For example, when n is large, h does no eﬀect on the local structure. For the number of K-means random trials T , the larger is the better. But when T is large enough,

T becomes irrelevant, please see Theorem 2 for theoretical analysis. V. Complexity Analysis In this section, we will explore the computational complexity of our algorithms and related work. A. Complexity Analysis of Spectral Clustering Similarity matrix construction. In large-scale spectral clustering, the sparsity of similarity matrix could decrease the computation time dramatically. We reduce the matrix W to a sparse one by ignoring the data points which are far away and only consider those neighbors which have large enough similarities. This is useful for both eﬃciency and accuracy purposes. Typically one might keep only keep Wi j where j is among the r nearest neighbors of i e.g. [25], [20]. r can be set to be a small number, e.g. 10. Another approach is to make W sparse by thresholding: if Wi j is smaller than certain threshold, we set it to be zero. While these techniques eﬀectively conquer the memory diﬃculty, they still have to calculate all possible pairs of data points, and hence the computational time is still high. Here we focus on the r-nearest-neighbor approach. A typical implementation is as follows. By keeping a max heap with size r, we insert the distance that is smaller than the maximal value of the heap and then restructure the heap one by one. Since restructuring a max heap is on the order of logr, the complexity of generating a sparse matrix W is O(n2 p) + O(n2 logr) in time and O(nr) in storage. The O(n2 p) cost can be reduced to a smaller value using techniques such as KD-trees [26] and Metric trees [27]. However, these techniques are less suitable if p is large. To further reduce the cost, one can only find neighbors which are close but not the closest (approximate nearest neighbors). For example, it is possible that one only approximately finds the r nearest neighbors using techniques such as spill-tree [28] and LSH (Locality-Sensitive Hashing) [29]. The complexity depends on the level of the approximation, i.e. they need trade clustering quality for running time. In this paper, we focus only on a precise method to find r nearest neighbors. Computation of First K eigenvectors of Sparse Matrices. A typical eﬃcient approach of computing the first K eigenvector for a sparse matrix is by Lanczos/Arnoldi factorization. Once we have obtained a sparse similarity matrix S and its Laplacian matrix L, we can use sparse eigensolvers. More explicitly, we seek for a solver that can quickly obtain the first K eigenvectors of L. Some example solvers are SLEPc [30] and ARPACK [31]. Most existing approaches are variants of the Lanczos/Arnoldi factorization. These variants have similar time complexity. The overall cost of ARPACK is (O(m3 ) + (O(nm) + O(nr)) × O(m − K))× #Arnoldi restarts, where m is the number of step in each Arnoldi restart. m is often set to 2K. K-means on eigenvectors. After obtaining the first K eigenvector of Laplacian matrix L, K-means algorithm will be

applied. Since the eigenvectors are typically dense, the computational time for this step is O(nK × KT ) = O(nK 2 T ), where T is the number K-means iterations. Total complexity. To sum up, the total complexity of spectral clustering is O(n2 p). B. Computational Complexity of Consensus Clustering Algorithm 4 is the whole process of the consensus clustering. Thus we explore the complexity of this pseudo-code line by line. Construction of consensus matrix. In the construction of consensus matrix, we employ T hierarchical K-means which costs nlogn, hence the total complexity is nlognT . Consensus initialization. In consensus clustering initialization, we need to calculate the second eigenvector of Laplacian L K times. For the first time, we need compute the whole Laplacian matrix, which needs (O(m3 ) + (O(nm) + O(E)) × O(m − 1))× #Arnoldi restarts, where E is the number of nonzeros in Laplacian matrix, which is O(n), see §II-C. NMF updating. Since there are O(n) non-zeros in C, the computation of HC H, CH, DHΛ− , and DHΛ+ cost O(nK) time. Total complexity. When n is large, all other factors become constants, so the bottleneck is still the construction of consensus matrix. The total complexity of consensus clustering is O(nlogn). VI. Experimental Results We design several experiments to evaluate the consensus clustering algorithms and compare our results to the state-ofthe-art approaches. A. Comparing Algorithms K-means. We use the standard K-means algorithm with batchupdating strategy. Normalized Cut. We compare our algorithm to two versions of Normalized Cut. One uses the adaptive similarity matrix defined in Eq.(8) with 10-nearest neighbors (marked as NCutA in Figure 7). The second utilizes the global bandwidth Gaussian similarity matrix, defined in Eq.(7) with α = 0.5 (marked as NCut in Figure 7). The clustering algorithm implemented by the authors of [2], which is available at http://www.cis.upenn.edu/∼jshi/software/. Consensus Clustering. For our method, we set the parameters in Algorithm 4 as h = 20, Km = 20, T = 200. B. Clustering Quality Evaluation metrics. In this experiment, we evaluate the quality of clustering in three metrics: clustering accuracy, normalized mutual information, and clustering consensus. 1) Clustering Accuracy: Clustering accuracy (ACC) is defined as: n δ(li , map(ci )) , ACC = i=1 n where li is the true class label and ci is the obtained cluster label of xi , δ(x, y) is the delta function, and map(·) is the best mapping function. Note δ(x, y) = 1, if x = y; δ(x, y) = 0,

otherwise. The mapping function map(·) matches the true class label and the obtained cluster label and the best mapping is solved by Kuhn-Munkres algorithm. A larger ACC indicates a better performance. 2) Normalized Mutual Information: Normalized mutual information (NMI) is calculated by: N MI(Π, Π ) =

MI(Π, Π ) , max(H(Π), H(Π ))

(26)

where Π is a set of clusters obtained from the true labels and Π is a set of clusters obtained from the clustering algorithm. MI(Π, Π ) is the mutual information metric, and H(Π) and H(Π ) are the entropies of Π and Π respectively. NMI is between 0 and 1. Again, a larger NMI value indicates a better performance. 3) K-means Error: K-means error of a solution is defined as, K xi − μk 2 , (27) JKmeans (Π) = k=1 i∈Ck

where μk is the center of the k cluster: μk = i∈Ck xi /Ck . The lower K-means error is the better. 4) Clustering Consistency: Clustering Consistency (CC) evaluates how well of solutions agree with each other. Since K-means starts from random initializations and converges to diﬀerent local solutions. We try to compare how much these solutions deviate. i j i j N MI(Π , Π ) , (28) CC = N(N − 1) where N is the number of random trials and Πi is the clustering result of the i th random trials. N is set to 256 in our experiments. A larger Clustering Consistency value indicates a better performance. For clustering accuracy, normalized mutual information, and K-means error, we perform 256 random trials for all clustering methods and report the average values and standard deviations. Notice that the higher cluster accuracy and normalized mutual information are the better, while the lower K-means error is the better. Datasets Descriptions. For the first experiments, we use 10 real-world datasets, including 5 UCI datasets (Dermatology, Ecoli, Glass, Segment, and Vehicle), 4 image datasets (BinAlpha, JAFFE, MNIST, and UMIST), and one gene expression dataset (LUNG Cancer). All UCI datasets are downloaded at website1 . No further pre-processing is performed. MNIST Hand-written Digit Dataset MNIST hand-written digits dataset consists of 60,000 training and 10,000 test digits. The MNIST dataset can be downloaded from website2 with 10 classes, from digits “0” to “9”. In our experiments, we randomly pick up 15 images in the training set for each digit. The size of the images is 28 × 28. 1 http://archive.ics.uci.edu/ml/ 2 http://yann.lecun.com/exdb/mnist/

BinAlpha Hand-written English Letter Dataset. In original Binary Alpha-digits dataset, there are 1404 20 × 16 binary images, including digits of “0” through “9” and capital “A” through “Z”. Each category has 39 images. In our experiment, we use BinAlha which is a part of this dataset, consisting of the capital English letter “A”-“Z”. The data set can be downloaded from website3 . The size of the images is 28 × 16. UMIST faces is for multi-view face recognition, which is challenging in computer vision, because the variations between the images of the same face in viewing direction are almost always larger than image variations in face identity. A robust face recognition system should be able to recognize the person even though the testing image and training images have quite diﬀerent poses. This dataset contains 20 persons with 18 images for each. All these images of UMIST database are cropped and resized into 28 × 23 images. JAFFE The Japanese Female Facial Expression (JAFFE) database contains 213 images of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese female models. Each image has been rated on 6 emotion adjectives by 60 Japanese subjects. The database was planned and assembled by Miyuki Kamachi, Michael Lyons, and Jiro Gyoba. We thank Reiko Kubota for her help as a research assistant. The photos were taken at the Psychology Department in Kyushu University. The dataset can be downloaded from website4 . The size of the images is 32 × 32. LUNG data set contains in total 203 samples in five classes, adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, which have 139, 21, 20, 6,17 samples, respectively. Each sample has 12600 genes. The genes with standard deviations smaller than 50 expression units were removed and we obtained a data set with 203 samples and 3312 genes. KDDCup98 This is the data set used for The Second International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-98 The Fourth International Conference on Knowledge Discovery and Data Mining. The competition task is a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits. The dataset can be downloaded from5 . DNA is one of the datasets in Pascal Large Scale Learning Challenge, which can be downloaded from website6 . UMIST8M is downloaded from website7 . Table I summarizes the details of the datasets used in the experiments. Experimental Results We report the results in Figure 7 for all four measurements and four methods on the 10 datasets. For clustering accuracy and Normalized Mutual Information, we also perform the one-way ANOVA test to check the 3 http://www.cs.toronto.edu/

roweis/data.html

4 http://www.kasrl.org/jaﬀe.html 5 http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html 6 http://largescale.first.fraunhofer.de/instructions/ 7 http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/

TABLE I Datasets summary. Dataset Binalpha Dermatology Ecoli Glass JAFFE LUNG MNIST Segment Umist Vehicle KDDCup98 DNA MNIST8M

Size 1404 366 336 214 213 203 150 2310 575 1440 95413 50000000 8100000

Dimensions 320 34 343 9 1024 3312 784 19 644 1024 57 200 784

Classes 36 6 8 6 10 5 10 7 20 20 20 10 10

significance of the diﬀerence of performance among the four approaches. Our method is significantly better than the other approaches (with p < 0.001) in 7 out of 10 datasets in terms of both clustering accuracy and NMI. This suggests that the consensus matrix preserves more local connectivity among the dense manifold, compared to the Gaussian similarity measurement. C. Large Scale Experiments In this experiment, we apply our algorithm on three largescale datasets: KDDCup98, DNA, and MNIST8M. We try algorithm on subsets of diﬀerent number of data points. After that, we compute the running time and K-means error for each subset, and plot them versus number of data points in left panels of Figure 6. The largest sizes we try for our method on diﬀerent datasets (marked as ConCluster Figure 6) are 40,960 (KDDCup98), 2,621,440 (DNA), and 5,242,880 (MNIST8M). For Normalized Cut (marked as NCut in Figure 6), we try 40,960 (KDDCup98), 81920 (DNA), and 163840 (MNIST8M), due to the high computational cost of this method. The CPU running time reported is the total time of two algorithms on a single personal computer (3GHz with 8G memory). Detailed complexity analysis can be found in §V. In this experiment, we use Eq. (8) with 10-nearest neighbor to compute a sparse similarity matrix for Normalized Cut. For Figure 6, we can see that ConCluster is significantly faster than Normalized Cut, and two methods are comparable in terms of K-means errors. VII. Conclusions In this paper, we proposed both eﬃcient consensus matrix construction algorithm and eﬀective NMF based consensus clustering algorithm for large-scale data clustering, which decreases the regular computational cost of spectral clustering from O(n2 ) to O(nlogn). Our algorithm is comparable with the state-of-the-art clustering approaches in terms of clustering qualities, while using much less running time. We successfully applied our algorithm to several large-scale datasets (up to 5 million data points). Because the construction time for similarity matrix is O(nlogn), our consensus matrix can also

9

9

10

12

10

10

8

11

10

10

7

10

Kmeans Error

Kmeans Error

Kmeans Error

8

10

7

10

6

9

10 CSC NCut

6

10 2 10

3

4

10

10

10

CSC NCut

5

10 2 10

5

10

10

4

5

10 10 # data points

6

10

7

10

10

10

5

10 # data points

6

7

10

10

10

4

3

10

2

10

1

10

Total CPU time (s)

Total CPU time (s)

Total CPU time (s)

4

5

10

10

4

10

2

10

3

10

2

10

1

0

10

10

CSC NCut

−1

10

3

10

6

4

CSC NCut

8

3

# data points 10

10

10

2

10

3

4

10

10 # data points

CSC NCut

0

5

10

10 2 10

3

10

4

5

10 10 # data points

6

10

CSC NCut

0

7

10

10 3 10

4

10

5

10 # data points

6

10

7

10

Fig. 6. CPU Computational time and K-means errors for Normalized Cut (NCut) and Consensus Spectral Clustering (CSC) in diﬀerent sizes of subset on KDDCup98 (left), DNA (middle), and MNIST8M(right). The largest sizes we try for our method (CSC) are 40,960 (KDDCup98), 2,621,440 (DNA), and 5,242,880 (MNIST8M), respectively.

be applied to any other graph/kernel-based unsupervised/semisupervised approaches. Therefore, our algorithm opens a new direction for high-quality large-scale data analysis. References [1] M. Halkidi, D. Gunopulos, M. Vazirgiannis, N. Kumar, and C. Domeniconi, “A clustering framework based on subjective and objective validity criteria,” TKDD, vol. 1, no. 4, 2008. [2] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000. [3] L. Hagen and A. Kahng, “New spectral methods for ratio cut partitioning and clustering,” IEEE. Trans. on Computed Aided Desgin, vol. 11, pp. 1074–1085, 1992. [4] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A min-max cut algorithm for graph partitioning and data clustering,” ICDM, pp. 107– 114, 2001. [5] C. Fowlkes, S. Belongie, F. R. K. Chung, and J. Malik, “Spectral grouping using the nystr¨om method,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 26, no. 2, pp. 214–225, 2004. [6] D. Luo, C. Ding, H. Huang, and T. Li, “Non-negative Laplacian Embedding,” in Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 2009, pp. 337–346. [7] D. Luo, H. Huang, C. Ding, and F. Nie, “On the eigenvectors of pLaplacian,” Machine Learning, pp. 1–15, 2010. [8] F. Nie, D. Xu, I. Tsang, and C. Zhang, “Spectral embedded clustering,” in Proceedings of the 21st international jont conference on Artifical intelligence, 2009, pp. 1181–1186. [9] W. Xu and Y. Gong, “Document clustering by concept factorization,” SIGIR, pp. 202–209, 2004. [10] S. X. Yu and J. Shi, “Multiclass spectral clustering,” Int’l Conf. on Computer Vision, 2003. [11] R. Liu and H. Zhang, “Segmentation of 3D meshes through spectral clustering,” in Pacific Conference on Computer Graphics and Applications, 2004, pp. 298–305. [12] F. Chung, Spectral Graph Theory. Amer. Math. Society, 1997. [13] W. Gao, X. S. Li, C. Yang, and Z. Bai, “An implementation and evaluation of the AMLS method for sparse eigenvalue problems,” ACM Transactions on Mathematical Software, vol. 34, no. 4, pp. 1–27, Jul. 2008. [14] A. J. Enright, S. V. Dongen, and C. A. Ouzounis, “An eﬃcient algorithm for large-scale detection of protein families,” Nucleic Acids Research, vol. 30, pp. 1575–1584, 2002.

[15] C. Wang, M. Zhang, L. Ru, and S. Ma, “Automatic online news topic ranking using media focus and user attention based on aging theory,” in CIKM, 2008, pp. 1033–1042. [16] W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in SIGIR, 2003, pp. 267–273. [17] W. E. Donath and A. J. Hoﬀman, “Lower bounds for the partitioning of graphs,” IBM Journal of Research and Development, vol. 17, pp. 420–425, 1973. [18] I. S. Dhillon, Y. Guan, and B. Kulis, “A random walks view of spectral segmentation,” in International Workshop on Artificial Intelligence and Statistics, 2001. [19] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007. [20] C. Ding, “A tutorial on spectral clustering,” ICML, 2004. [21] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE. Trans. on Pattern Analysis and Machine Intelligence, vol. 22, pp. 888– 905, 2000. [22] L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” in NIPS, 2004. [23] V. Kolmogorov and Y. Y. Boykov, “An experimental comparison of mincut/max-flow algorithms for energy minimization in vision,” in CVPR, 2001, pp. 359–374. [24] D. Lee and H. Seung, “Algorithms for non-negative matrix factorization,” in Advances in Neural Information Processing Systems 13. Cambridge, MA: MIT Press, 2001. [25] F. R. Bach and M. I. Jordan, “Learning spectral clustering,” Neural Info. Processing Systems 16 (NIPS 2003), 2003. [26] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, Sep. 1975. [27] J. K. Uhlmann, “Satisfying general proximity/similarity queries with metric trees,” Inf. Process. Lett, vol. 40, no. 4, pp. 175–179, 1991. [28] T. Liu, A. W. Moore, A. G. Gray, and K. Yang, “An investigation of practical approximate nearest neighbor algorithms,” in NIPS, 2004. [29] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in VLDB. Morgan Kaufmann Publishers, 1999. [30] V. Hern´andez, J. E. Rom´an, and V. Vidal, “SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems,” ACM Trans. Math. Softw, vol. 31, no. 3, pp. 351–362, 2005. [31] K. J. Maschhoﬀ and D. C. Sorensen, “A portable implementation of ARPACK for distributed memory parallel computers,” in Proc. Copper Mountain Conf. on Iterative Methods, 1996.

0.12

0.35 0.5

0.65 ***

0.6

0.08 0.75

0.8

0.45

0.6 1

0.4 0.36 0.34

0.8

0.32 0.55

0.8

0.7

0.8

0.6

0.35

0.6 1

0.3 0.25

***

0.6 0.9 0.8 0.7 0.6 0.5 0.55 0.5 0.45 0.4

***

0.6 0.4 0.9 0.8 0.7

0.24

1

0.36 0.34 0.32 0.3 0.28 0.26

0.8 0.6 1

0.4

0.5

0.3

***

***

0.5

JAFFE

***

0.6

0.8

***

0.8

0.4 0.5 1

MNIST

0.4

0.4

LUNG

***

0.5

Segment

0.6

1

0.6

0.6

Glass

0.4

K−means Error 0.5

Vehicle

0.4 ***

0.16

Consistency 1

UMIST

NMI

***

Accuracy 0.45

0.45 0.5

0.4

1

0.5

0.35

1

0.37

Dermat

0.6

BinAlpha

0.38

CSC

0.8

NCut

0.45

CSC

NCut

NCutA

0.52

CSC ***

0.56 KM

NCut

NCutA

KM

0.4

CSC ***

0.6

0.5

0.6 1

NCut

0.6 0.64

0.55

0.8

NCutA

0.5

***

***

0.8

KM

0.9 0.8 0.7 0.6 0.5

NCutA

***

***

0.4

Ecoli

0.36

0.5

KM

0.65 0.6 0.55 0.5 0.45

Fig. 7. Box plot of clustering results of four measurements (clustering accuracy, Normalized Mutual Information (NMI), consistency, and K-means errors). K-means (KM) is the standard K-means method, NCut Adaptive (NCutA) is Normalized Cut using similarity matrix defined in Eq. (8), NCut is Normalized Cut with similarity matrix defined in Eq. (7), and CSC is our method (Consensus Spectral Clustering). One-way ANOVA is also performed to test the significance of accuracy and NMI among the four approaches. ‘***’ means the corresponding method is significantly better than all the other methods (p < 10−3 ). Out of the 10 datasets, our method is significantly batter than other methods in 7 data sets for accuracy and NMI.