1

A Simple Algorithm for Clustering Mixtures of Discrete Distributions Pradipta Mitra In this paper, we propose a simple, rotationally invariant algorithm for clustering mixture of distributions, including discrete distributions. This resolves a conjecture of McSherry (2001). This also substantially generalizes the distributions (both discrete and continuous) for which spectral clustering is known to work.

1

Introduction

In many applications in Information Retrieval, Data Mining and other fields, a collection of m “objects” is analyzed, where each object is a vector in nspace. The input can thus be represented as a m × n matrix A, each row representing an object and each column representing a “feature”. A common example of this would be Term-Document matrices (where the entries may stand for the number of occurrences of a term in a document). Given such a matrix, a important question is to cluster the data, that is, to partition the objects into a (small) number sets according to some notion of closeness. A versatile way to model such clustering problems in to use probabilistic mixture models. Here we start with k “simple” probability distributions D1 , D2 . . . Dk where each is a distribution on Rn . With Reach Dr we can associate a center µr of the distribution, defined by µr = v∈Rn vDr (v). A mixture of these probability distributions is aP density of the form w1 D1 + w2 D2 + . . . + wk Dk , where the wr ≥ 0 and r∈[k] wr = 1. A clustering question in this case would then be: What is the required condition on the probability distributions so that given matrix A generated from the mixture, we can group the objects into k clusters, where each cluster consists precisely of objects picked according to one of the component distributions of the mixture? This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

2 A related model is the “Planted partition model” for graphs, where the well-known Erdos-Renyi Random graphs are generalized to model clustering problems for unweighted graphs. These models are special cases of the mixture models, except for the added requirement of symmetry. Symmetry only adds very mild dependence to the model, and can be handled easily. For our purposes then, we will restrict ourselves to mixture models. In this paper, we present and analyze a very natural algorithm that works for quite general mixture models. The algorithm is rotation-invariant, in the sense that it is not coordinate specific in any way, are based on ordinary geometric projections to natural subspaces. Our algorithm is the first such algorithm shown to work for discrete distributions. The existence of such an algorithm was first conjectured in [12] by McSherry, and later in [6]. In fact, McSherry [12] proposed a specific algorithm as a candidate solution. Our algorithm is not that exact algorithm, but is equally natural. Why is this interesting? To answer this question, we need to introduce the common techniques used in the literature, and their limitations. The method very commonly used to cluster mixture models (discrete and continuous) is “spectral clustering” — which refers to a large class of algorithms where the singular vectors of A are used in some fashion. Works on spectral clustering for continuous distributions have focused on highdimensional gaussians and there generalizations [1, 9, 16]. On the other hand, the use of spectral methods for discrete, graph-based models was pioneered by Boppana [4]. The essential aspects of many techniques used now first appeared in [2, 3]. These results were generalized by McSherry [12] and a number of works followed ([5, 6] etc). Despite their variety, these algorithms have a common first step. Starting with well-known spectral norm bounds on kA − E(A)k (e.g. [17]), a by-now standard analysis can be used to shown that the the projection on best kdimensional approximation would find a good approximate partitioning of the data. The major effort, often, goes into “cleaning up” this approximate partitioning. It is here that algorithms for discrete and and continuous methods diverge. For continuous models, spectral clustering is often followed by projecting new samples on the spectral subspace obtained. This works, essentially, because a random Gaussian vector is (almost) orthogonal to a fixed vector. Unfortunately, such a statement is simply not true for discrete distributions. This has resulted in rather ad-hoc methods for cleaning up mixture of discrete distributions. A most pertinent example would be “combinatorial This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

3 projections”, proposed by McSherry [12], and all other algorithms for discrete distributions share the same central features we are concerned with. Here, one counts the number of edges from each vertex to the approximate clusters obtained. Though successful in that particular context, the idea of combinatorial projections is unsatisfactory for a number of inter-related reasons. First, such algorithms aren’t robust under many natural transformations of the data for which we expect the clustering to remain the same. The most important case is perhaps rotation. It is natural to expect that if we change the orthogonal basis the vectors are expressed in, the clustering would remain the same. And it is not totally unusual that the orthogonal basis will be changed; for example, various data processing techniques often involve a change of basis. Known algorithms for continuous distributions would continue to work in these cases, which combinatorial methods might not work, or might not even make sense. This also means that many of these conbinatorial algorithms will not work when presented with a mixture of continuous and discrete data. A related issue is this: the main idea used to cluster discrete distributions is that the feature space (in addition to the object space) has clearly delineated clusters. One way to state this condition is that E(A) can be divided into a small number of sub matrices, each of which contains the same value for each entry. This is implicit in graph-based models, as the objects and features are the same things — vertices. These results will extend to the rectangular matrices as long as some generalization of that condition holds, but for general cases where the centers don’t necessarily have such strong structure it is not clear how to extend these ideas. Our algorithm solves this problem. Happily, the algorithm is very natural. Our technical tools are well-known spectral norm bounds, and a Chernoff-type bound, we just have to apply them a bit carefully. In this paper, we will focus is on discrete distributions as they are the “hard” distributions in the present context. Our algorithm will infact work for quite general distributions – such as distrbutions with subgaussian tails, and distributions with limited independence.

2

Model

There are k probability distributions Dr , r = 1 to k on P Rn , and with each distribution a weight wr ≥ 0 is associated. We assume r∈[k] wr = 1. Let the center of Dr be µr . Then, there is value 0 ≤ σ 2 ≤ 1 such that µr (i) ≤ σ 2 This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

4 for all r, i. For each distribution Dr , a set Tr of wr m samples are chosen from it, P adding up to r∈[k] wr m = m total samples. Each n-dimensional sample v is generated by setting v(i) = 1 with probability µr (i) and 0 otherwise, independently for all i ∈ [n]. The m samples are arranged as rows of m × n matrix A, which is presented as the data. We will use A to mean both the matrix, and the set of vectors that are rows of A, where the particular usage will be clear from the context. Let Mi be the ith row of the matrix M . Then, E(A) is defined by rule: if Ai ∈ Tr , then E(A)i = µr . The algorithmic problem is: Given A, to partition the vectors into sets Pr , r = 1 to k, such that there exists a permutation π on [k] so that Pi = Tπ(i) . min Let mr = |Tr |, mmin = minr mr and wmin = minr wr = mm Separation condition: We assume, for all r, s ∈ [k], r 6= s  1  n kµr − µs k2 ≥ 1632ckσ 2 1+ + log m wmin m

(1)

for some constant c. The following can be proved easily using Chernoff-type bounds. For a sample v ∈ Tr , with high probability, |(v − µr ) · (µs − µr )| ≤

1 kµs − µr k2 10

(2)

for all s 6= r. This is a quantitative version of the fact that samples are closer to their own center than other centers, but this particular representation helps in our analysis. Intuitively, this is the “justifiability” property of the mixture, it means that if we were given the real centers, we would able to justify that all samples belonged to the appropriate Dr (See Figure 1). This is a property of our model, but might as well have been an assumption.

Related Work Let us discuss two examples to further illustrate the dichotomy between discrete and continuous distributions in the literature. Two related papers, [1] and [6] employ a very natural linkage algorithm to perform the clean-up phase, and share much of their algorithms. However, these papers respectively address discrete and continuous distributions, and the difference influences the algorithm used in telling ways. In [1], the authors generalize This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

5

Figure 1: Justifiability — if v ∈ Tr , the projection of v − µr on µr − µs will be closer to µr than it is to µs high-dimensional Gaussians by adopting the notion of f -concentrated distributions. Consequently, a simple projection algorithm suffices. In [6], on the other hand, the clean-up procedure uses the “combinatorial projection” method pioneered in [12]. In [5], a attempt was made to present an algorithm that would work in a more general case. This paper assumed “limited independence”, i.e. that the samples are independent, but the entries in each sample need not be. Though a generalization in some ways, unfortunately, the algorithm has the same limitations as other algorithms for discrete data: it’s description is tied to the coordinate system, and for discrete models, it is not clear that it would work for centers that are not block structured.

3

The Algorithm

Our algorithm, reduced to its essentials, is simple and natural. First, we randomly divide the rows into two equal parts — A1 and A2 . We will use information of one to part the other part, as this allows to utilize indepen(k) dence of these parts. For A1 , we will find its best rank k approximation A1 (by computing the Singular Value Decomposition of the matrix). A greedy (k) algorithm on A1 will give us approximately correct centers. Now a distance comparison of the rows of A2 (the other part) to the centers thus computed will reveal the real clusters. We represent this last step as projection to the affine span of the centers for technical reasons. This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

6 In fact, in this paper, we will replace the greedy algorithm mentioned above (and used in [12]) by a solution to the l22 -clustering problem (also known as the “k-means problem”), though the greedy algorithm will work perfectly well. The use of this approach allows us to get rid of the algorithmic randomness used in the greedy procedure. This is not a teribly important aspect of this paper. l22 clustering problem: Given a set of vectors S = {v1 , . . . , vl } in Rd and a positive integer k, the problem is to find k points f1 , . . . fk ∈ Rd (called “centers”) so as to minimize the sum of squared distances from each vector vi to its closest center. This defines a natural partitioning of the n points into k clusters. Quite efficient constant factor deterministic approximation algorithms (even PTAS’s) are available for this problem (See [7, 8, 10] and references therein). We will simply assume that we can find an exact solution, this assumption effects the bound by atmost a constant. Algorithm 1 Cluster(A, k) 1: Randomly divide the rows A into two equal parts A1 and A2 2: (θ1 , . . . θk ) = Centers(A1 , k) 3: (ν1 , . . . νk ) = Centers(A2 , k) 4: (P11 , . . . Pk1 ) = Project(A1 , ν1 , . . . νk ) 5: (P12 , . . . Pk2 ) = Project(A2 , θ1 , . . . θk ) 6: return (P11 ∪ P12 , . . . Pk1 ∪ Pk2 )

Algorithm 2 Centers(A, k) 1: Find A(k) , the best rank-k approximation of the matrix A. 2: Solve the l22 clustering problem for k centers where • The input vectors are rows of A(k) (k)

(k)

• The distance between points i and j is kAi − Aj k 2 Let the clusters computed by Pthe l2 algorithm be P1 . . . Pk 1 ∗ For all r, compute µr = |Pr | v∈Pr v 5: return (µ∗1 , . . . µ∗k )

3: 4:

This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

7 Algorithm 3 Project(A, µ∗1 , . . . µ∗k ) 1: For each v ∈ A 2: For each r ∈ [k] 3: If |(v − µ∗r ) · (µ∗s − µ∗r )| < |(v − µ∗s ) · (µ∗r − µ∗s )| for all s 6= r 4: Put v in Pr 5: return (P1 , . . . Pk ) Discussion This discussion will probably be most useful to readers who have some familiarity with the literature in this field. Looking at the algorithm, it is reasonable to ask if computing the centers towards the end of Centers is necessary. After all, the solution to the l22 -clustering algorithm spits out a number of centers around which the clusters are formed. This is a intriguing proposition, but we do not know how to prove the correctness of that algorithm. Indeed, this algorithm is close to an algorithm proposed by McSherry [12] as a candidate rotation-invariant algorithm. The difficulties of proving the correctness of both these algorithms is the same, we only seem to be able to control k · k2 error of the centers computed, which is not enough (as pointed out in the introduction). Proving the correctness of the these algorithms is a open question, and it seems that some interestion questions need to be answered to along the way. Let us hazard a guess and say the smallest eigenvalues of random matrices might prove useful in this regard.

4

Analysis

The following is well-known, and provable in many ways, from epsilon-net arguments to Talagrand’s inequality: Theorem 1 If σ 2 ≥

log6 n n

kA − E(A)k2 ≤ cσ 2 (m + n) for some (small) constant c. See [17] for a standard such result. The following was proved by [12]: Lemma 2 The rank-k approximation matrix A(k) satisfies kA(k) − E(A)k2F ≤ 4ckσ 2 (m + n) This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

8 Proof Omitted. 2 Versions of the following lemma have appeared before, but we do need this somewhat subtle particular form. It’s proof is deferred to the appendix. Lemma 3 Given an instance of the clustering problem, consider any vector y such that, for some r 1 ky − µr k2 ≤ kµr − µs k2 9 1 (k) 2 2 for s = {x ∈ Ts : kA (x) − yk ≤ 4 kµr − µs k }. Also let P all s(k)6= r. Let, B 2 Bs kA (x) − µs k = Es Then for all s 6= r,

|Bs | ≤

2Es kµr − µs k2

And specifically, |Bs | ≤

8ckσ 2 (m + n) kµr − µs k2

The important aspect of the previous Lemma is that the size of |Bs | goes down as kµr − µs k2 increases. This property will be used later. Now we prove that the clustering produced by the l22 clustering algorithm is approximately correct. The proof is provided in the Appendix. Lemma 4 Consider a clustering P1 , P2 . . . Pk produced by the algorithm Centers(A, k). We claim: • Each Pr can be identified with a unique and different Ts such that 4 |Pr ∩ Ts | ≥ ms 5 Without loss of generality, we shall assume s = r.

(3)

• We define P¯r = Tr ∩ Pr , Q0r = Pr − P¯r X Q0rs = Ts ∩ Q0r , Ers = kA(k) (v) − µr k2 v∈Q0rs

Then for all r, s ∈ [k] such that r 6= s 0 qrs = |Q0rs | ≤

Ers 8ckσ 2 (m + n) ≤ kµr − µs k2 kµr − µs k2

This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

(4)

9 We now focus on the centers µ∗r produced by Centers(A, k). First we would like to show that they are close to the real centers. Lemma 5 Let µ∗1 . . . µ∗k be the centers returned by the procedure Centers(A, k). Then for all r ∈ [k] kµ∗r − µr k2 ≤ 81ckσ 2

1  wmin

1+

n 1 ≤ kµr − µs k2 m 20

for all s 6= r Proof

We know, 1 X v pr v∈P r XX X X ∗ v v+ v= pr µr = µ∗r =



v∈Pr



pr (µ∗r − µr ) =

P¯r

X P¯r

s6=r Q0rs

(v − µr ) +

XX

(v − µr )

s6=r Q0rs

Then,





X X X







kpr (µr − µr )k ≤ (v − µr ) +

(v − µr )

P¯r

s6=r Q0rs

(5)

Let, the samples in P¯r be v 1 . . . v p¯r . Let’s define the matrix S with these samples as rows, and U be the matrix of same dimensions with all rows equal to µr     v1 µr  v2   µr      S =  ..  , U =  ..   .   .  v p¯r µr Also let 1 = {1, . . . 1}, with p¯r entries. Now, by Theorem 1, p kS − U k ≤ kA − E(A)k ≤ σ c(m + n) p p ⇒ k1(S − U )k ≤ k1kσ c(m + n) ≤ σ c(m + n)¯ pr

This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

10 The previous few lines contains one of the main observations in our analysis. Though it might not be clear from the argument, the reasoning is closely related with the concept of quasi-randomness [11]. As, X (v − µr ) = 1(S − U ) P¯r

k

X

(v − µr )k ≤ σ

p

c(m + n)¯ pr

(6)

P¯r

Now, for any s X X 0 0 k (v − µr )k ≤ k (v − µs )k + kqrs µs − qrs µr k Q0rs

Q0rs

But we know by Lemma 3 0 qrs ≤

40Ers kµr − µs k2

Then, 1

1



0 0 0 0 2 Ers ) 2 kqrs µs − qrs µr k = ((qrs ) kµs − µr k2 ) 2 ≤ (40qrs p 0 0 0 E kqrs µs − qrs µr k ≤ 40qrs rs

On the other hand, through an argument similar to the bound for k

X

(v − µs )k ≤ σ

(7) P

P¯r

p 0 (m + n) cqrs

(v − µr ) (8)

Q0rs

Combining equations (6)—(8), X p Xp p 0 (m + n) + 0 E kpr (µ∗r − µr )k ≤ σ c(m + n)¯ pr + σ cqrs 40qrs rs s



σ

p

s

qX Xp p 0 0 ck(m + n)pr + 40qrs Ers ≤ σ ck(m + n)pr + 40qrs

X s

s

Using the Cauchy-Schwartz inequality a few times. But we know that P 0 8ckσ 2 (m + n) and qrs ≤ 15 pr (Eqn (3) in Lemma 4). Hence,

s

P

s

This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

Ers ≤

Ers

11

p p p ck(m + n)pr + 8 ckσ 2 pr (m + n) ≤ 9σ ck(m + n)pr r (m + n) 1 n ck ≤ 9σ ck (1 + ) pr wmin m

kpr (µ∗r − µr )k ≤ σ s ⇒

kµ∗r − µr k ≤ 9σ

2 Finally we would like to show that Project(A, µ1 . . . µk ) returns an accurate partitioning of the data. To this end, we claim that (v − µ∗r ) · (µ∗r − µ∗t ) behaves essentially like (v − µr ) · (µr − µt ). Lemma 6 For each sample u, if u ∈ Tr , then for all t 6= r 2 |(u − µ∗r ) · (µ∗r − µ∗t )| ≤ kµr − µt k2 5 with high probability. Proof

=

Assume µ∗r = µr + δr ; ∀r. Then,

(u − µ∗r ) · (µ∗t − µ∗r ) = (u − µr − δr ) · (µt − µr − δr + δt ) (u − µr ) · (µr − µt ) − δr · (µt − µr ) − δr · (δt − δr ) + (u − µr ) · (δt − δr )

Let’s consider each term in the last sum separately. By assumption, (u − µr ) · (µr − µt ) ≤

1 kµr − µt k2 10

We have already shown (Lemma 5) that kδr k ≤

And

√1 kµr 20

(9) − µt k. Then

1 |δr · (µt − µr )| ≤ kδr kkµr − µt k ≤ √ kµr − µt k2 20 1 |δr · (δt − δr )| ≤ kµr − µt k2 20

(10) (11)

The remaining term is (u − µr ) · (δr + δt ). We prove in Lemma 8 (below) that 1 |(u − µr ) · (δr + δt )| ≤ kµr − µs k2 (12) 100 Combining equations (9)—(12), we get the proof. 2 For the proof of Lemma 8, we will need Bernstein’s inequality (see [13] for a reference): This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

12 Theorem 7 Let {Xi }ni=1 be a collection of independent, almost surely absolutely bounded random variables, that is there is a value M such that P {|Xi | ≤ M } = 1; ∀i. Then, for any ε ≥ 0 ) ! ( n X ε2  (Xi − E[Xi ]) ≥ ε ≤ exp − P 2 θ2 + M3 ε i=1

where θ =

P

EXi2

In the following, it is crucial to assume that the sample u is independent of δr and δt . We can assume this because we use centers from A1 on samples from A2 , and vice-versa, which are independent. Lemma 8 If u ∈ Dr is an sample independent of δr and δt , for all t 6= r  1  n 1 |(u − µr ) · (δr + δt )| ≤ 15ckσ 2 1+ + log m ≤ kµr − µt k2 wmin m 100 with high probability. Proof It suffices to prove a bound on (u − µr ) · δr . The case for δt is similar. As X X (u − µr ) · δr = (u(i) − µr (i))δr (i) = x(i) i∈[n]

i∈[n]

where x(i) = (u(i) − µr (i))δr (i). This is a sum of independent random variables. Note that n 1  1+ kδr k2 ≤ 81ckσ 2 wmin m Now,



x(i) has mean zero, E(x(i)) = E((u(i) − µr (i))δr (i)) = 0 And, E(x(i)2 ) ≤ 2δr (i)2 σ 2  X n 2 2 2 4 1 1+ E(x(i) ) ≤ 2σ kδr k ≤ 162ckσ wmin m i 2

Also note that |x(i)| ≤ |δi | ≤ w2σmin . This is simply because the number of 2 1’s in a column of A can be at most 1.1mσ 2 . Hence |δi | ≤ p1r 1.1mσ 2 ≤ w2σmin . This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

13 This is the second main observation in our analysis. Previous analyses fell through specifically because a bound on δi was not available. We are ready to apply Bernstein’s inequality, using which we get,

P{|

X

x(i)| ≥ 15ckσ

i∈[n]





2

1  wmin

 n (1 + ) + log m } m 2 n (1 + m ) + log m

−225c2 k 2 σ 4 w21  min exp  4 1 n 4 162ckσ wmin 1 + m + 30ck wσ2

min



(1 +

n ) m



 + log m

1 m4 2

The correctness of the algorithm follows from Lemma 6: Theorem 9 Cluster(A, k) successfully clusters the matrix A. Proof If suffices to show that Project(A, µ∗1 . . . µ∗k ) works. Merging the clusters from two calls of Project is easy just by comparing the respective centers. Now, Let u ∈ Tr . For all t 6= r, (u − µ∗r ) · (µ∗t − µ∗r ) + (u − µ∗t ) · (µ∗r − µ∗t ) (µ∗r − µ∗t ) · (u − µ∗t − u + µ∗r ) = (µ∗r − µ∗t ) · (−µ∗t + µ∗r ) = kµ∗r − µ∗t k2

=

Just by manipulation. By Lemma 5

⇒ ⇒

kµ∗r − µ∗t k2 ≥ 0.95kµr − µt k2 (v − µ∗r ) · (µ∗t − µ∗r ) + (v − µ∗t ) · (µ∗r − µ∗t ) ≥ 0.95kµr − µt k2 |(v − µ∗r ) · (µ∗t − µ∗r )| + |(v − µ∗t ) · (µ∗r − µ∗t )| ≥ 0.95kµr − µt k2

As |(v − µ∗r ) · (µ∗t − µ∗r )| ≤ 0.4kµr − µt k2 by Lemma 6 |(v − µ∗t ) · (µ∗r − µ∗t )| ≥ 0.55kµr − µt k2 > |(v − µ∗r ) · (µ∗t − µ∗r )| This proves our claim. This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

14

5

Conclusion

To summarize, our analysis depended on two observations. First, our algorithm implies, almost directly, an l∞ norm bound on the approximate centers computed (Lemma 8). Taking this as a starting point, a quasi-randomness type argument is used to show the l2 closeness of the computed centers to real centers (Lemma 5). The combination of l2 and l∞ bounds come together to complete the proof using Bernstein’s inequality. As mentioned in the introduction, our analysis extends beyond the case of Bernoulli distributions. Bernstein’s inequality works for subgaussian distributions [13], and similar bounds are available for vectors with limited independence as well (e.g. [15]). Given these bounds, all we need to complete the proof for these cases is a bound on the spectral norm of kA − E(A)k for such distributions, which is also available (e.g. [14]). Details are deferred to the full version of the paper. A harder problem seems to be a strengthened version of the McSherry conjecture, which doesn’t allow for cross-projections. This conjecture says: Centers(A, k) successfully partitions the data exactly. We believe this conjecture to be true.

References [1] D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In Conference on Learning Theory (COLT), pages 458– 469, 2005. [2] N. Alon and N. Kahale. A spectral technique for coloring random 3colorable graphs. SIAM J. Comput., 26(6):1733–1748, 1997. [3] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 594–598, 1998. [4] R. Boppana. Eigenvalues and graph bisection: an average case analysis. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 280–285, 1987.

This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

15 [5] A. Dasgupta, J. Hopcroft, R. Kannan, and P. Mitra. Spectral clustering with limited independence. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1036 – 1045, 2007. [6] A. Dasgupta, J. Hopcroft, and F. McSherry. Spectral analysis of random graphs with skewed degree distributions. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 602–610, 2004. [7] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering in large graphs and matrices. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 291–299, 1999. [8] K. Jain and V. Vazirani. Approximation algorithms for metric facility location and median problems using the primal-dual schema and lagrangian relaxation. J. ACM, 48(2):274–296, 2001. [9] R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In Conference on Learning Theory (COLT), pages 444–457, 2005. [10] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89–112, 2004. [11] M. Krivelevich and B. Sudakov. More Sets, Graphs and Numbers, chapter Pseudo-random graphs, pages 199–262. Springer, 2006. [12] F. McSherry. Spectral partitioning of random graphs. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 529–537, 2001. [13] V. Petrov. Sums of independent random variables. Springer, 1975. [14] M. Rudelson. Random vectors in the isotropic position. J. of Functional Analysis, 168(1):60–72, 1999. [15] J. Schmidt, A. Siegel, and A. Srinivasan. Chernoff-hoeffding bounds for applications with limited independence. SIAM J. Discret. Math., 8(2):223–250, 1995. [16] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841–860, 2004. This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

16 [17] V. Vu. Spectral norm of random matrices. In ACM Symposium on Theory of computing (STOC), pages 619–626, 2005.

Appendix Proof of Lemma 4 First we claim that there is a solution to the l22 clustering problem so that the cost of the solution ≤ 8ckσ 2 (m + n). Set the centers to be fr = µr for all r and let the cost of this solution be C. Now, X X kA(k) (i) − µr k2 = kA(k) − E(A)k2F ≤ 8ckσ 2 (m + n) C ≤ r

A(k) (i)∈[Tr ]

By Lemma 2. Remark The solution calls for fr to be in the span of vectors in A(k) , and µr might not be. But this simply strengthens the result. Accordingly, Cluster(G, k) gives us a solution with cost no more than 8ckσ 2 (m + n). We claim, for each Pr the center fr will be such that, for all s 6= r 1 (13) kµr − fr k2 ≤ kµr − µs k2 9 If this is not true for some r, then the cost of the solution is at least X X kA(k) (i) − fr k2 ≥ kA(k) (i) − µr + fr − µr k2 A(k) (i)∈Tr

≥ ≥

1 4

X

A(k) (i)∈Tr

kfr − µr k2 − 3

A(k) (i)∈Tr

X

kA(k) (i) − µr k2

A(k) (i)∈Tr

 n 1 2 1632ckσ (1 + ) log m mr − 24ckσ 2 (m + n) ≥ 9ckσ 2 (m + n) 36wmin m 

Which is a contradiction. We used the bound ku + vk2 ≥ 41 kuk2 − 3kvk2 in line 3. Assuming (13), Lemma 3 implies (4). Equation (3) follows from essentially the same argument. 2 Proof of Lemma 3 From the conditions of the Lemma, ky − µs k ≥

= ky − µr + µr − µs k ≥ kµr − µs k − ky − µr k 0.66kµr − µs k

This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

17 Now, for each x ∈ Bs kA(k) (x) − µs k ≥ (ky − µs k − kA(k) (x) − yk) ≥ 0.16 × kµr − µs k By assumption, X

kA(k) (x) − µs k2 = Es

Bs

⇒ ⇒

0.0256 × kµr − µs k2 |Bs | ≤ Es 40Es |Bs | ≤ kµr − µs k2

Now as Es =

X

kA(k) (x) − µs k2 ≤ kA − E(A)k2F ≤ 8ckσ 2 (m + n)

Bs

|Bs | ≤

320ckσ 2 (m + n) kµr − µs k2

This proves the Lemma.

This document is licensed under the Creative Commons License by http://creativecommons.org/licenses/by-sa/3.0/

2

A Simple Algorithm for Clustering Mixtures of Discrete ...

mixture? This document is licensed under the Creative Commons License by ... on spectral clustering for continuous distributions have focused on high- ... This has resulted in rather ad-hoc methods for cleaning up mixture of discrete ...

169KB Sizes 1 Downloads 298 Views

Recommend Documents

A High Performance Algorithm for Clustering of Large ...
Oct 5, 2013 - A High Performance Algorithm for Clustering of Large-Scale. Protein Mass Spectrometry Data using Multi-Core Architectures. Fahad Saeed∗ ...

Clustering with Gaussian Mixtures
Clustering with. Gaussian Mixtures. Andrew W. Moore. Professor. School of Computer Science. Carnegie Mellon University www.cs.cmu.edu/~awm.

A Clustering Algorithm for Radiosity in Complex ...
ume data structures useful for clustering. For the more accurate ..... This work was supported by the NSF grant “Interactive Computer. Graphics Input and Display ... Graphics and Scientific Visualization (ASC-8920219). The au- thors gratefully ...

A Clustering Algorithm for Radiosity in Complex Environments
Program of Computer Graphics. Cornell University. Abstract .... much of the analysis extends to general reflectance functions. To compute the exact radiance ...

A Distributed Clustering Algorithm for Voronoi Cell-based Large ...
followed by simple introduction to the network initialization. phase in Section II. Then, from a mathematic view of point,. derive stochastic geometry to form the algorithm for. minimizing the energy cost in the network in section III. Section IV sho

A Scalable Hierarchical Fuzzy Clustering Algorithm for ...
discover content relationships in e-Learning material based on document metadata ... is relevant to different domains to some degree. With fuzzy ... on the cosine similarity coefficient rather than on the Euclidean distance [11]. ..... Program, vol.

Mixtures of Inverse Covariances
class. Semi-tied covariances [10] express each inverse covariance matrix 1! ... This subspace decomposition method is known in coding ...... of cepstral parameter correlation in speech recognition,” Computer Speech and Language, vol. 8, pp.

A simple algorithm for decoding Reed-Solomon codes ...
relation to the Welch-Berlekamp [2] and Euclidean algorithms [3], [4] is given. II. DEFINITIONS AND NOTATIONS. Let us define the (n, k, d) Reed-Solomon (RS) code over GF(q) with length n = q − 1, number of information symbols k, designed distance d

A Simple Distributed Power Control Algorithm for ...
the following advantages: 1) the operations of each SU are simple and ... It is proved that, the CR network with this simple algorithm ...... Wireless Commun., vol.

Finite discrete Markov process clustering
Sep 4, 1997 - Microsoft Research. Advanced Technology Division .... about the process clustering that is contained in the data. However, this distribution is ...

Finite discrete Markov process clustering
Sep 4, 1997 - about the process clustering that is contained in the data. ..... Carlin, Stern, and Rubin, Bayesian Data Analysis, Chapman & Hall, 1995. 2.

a novel parallel clustering algorithm implementation ... - Varun Jewalikar
calculations. In addition to the 3D hardware, today's GPUs include basic 2D acceleration ... handling 2D graphics from Adobe Flash or low stress 3D graphics.

ClusTop: A Clustering-based Topic Modelling Algorithm ...
component from Apache OpenNLP library [24], which has been used by many researchers for similar natural language processing [25], [26], [27]. ...... 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining. (PAKDD'14), 2014, pp. 596–607.

SWCA: A Secure Weighted Clustering Algorithm in ...
MAC is message authenticating code. This full text paper was ... MAC for this packet. ..... In SWCA, the usage of TESLA prevents such attacks: receiver accepts a.

An Efficient Algorithm for Clustering Categorical Data
the Cluster in CS in main memory, we write the Cluster identifier of each tuple back to the file ..... algorithm is used to partition the items such that the sum of weights of ... STIRR, an iterative algorithm based on non-linear dynamical systems, .

a novel parallel clustering algorithm implementation ...
parallel computing course which flattened the learning curve for us. We would ...... handling 2D graphics from Adobe Flash or low stress 3D graphics. However ...

Clustering Finite Discrete Markov Chains
computationally efficient hybrid MCMC-constrained ... data. However, this distribution is rather complex and all of the usual distribution summary values (mean,.

Parallel Spectral Clustering Algorithm for Large-Scale ...
Apr 21, 2008 - Spectral Clustering, Parallel Computing, Social Network. 1. INTRODUCTION .... j=1 Sij is the degree of vertex xi [5]. Consider the ..... p ) for each computer and the computation .... enough machines to do the job. On datasets ...

Parallel Spectral Clustering Algorithm for Large-Scale ...
1 Department of ECE, UCSB. 2 Department of ... Apr. 22, 2008. Gengxin Miao Et al. (). Apr. 22, 2008. 1 / 20 .... Orkut is an Internet social network service run by.

Clustering by a genetic algorithm with biased mutation ...
It refers to visualization techniques that group data into subsets (clusters) ..... local and global results can make a big difference [38]. We implemented the ...

a novel parallel clustering algorithm implementation ...
In the process of intelligent grouping of the files and websites, clustering may be used to ..... CUDA uses a recursion-free, function-pointer-free subset of the C language ..... To allow for unlimited dimensions the process of loading and ... GPU, s

SWCA: A Secure Weighted Clustering Algorithm in ...
(WCA) for clustering and TELSA for efficiently authenticating packets. ...... [18] A. Perrig, R. Canetti, D. Tygar, and D. Song, “Efficient authentication and signature.

A Simple, Fast, and Effective Polygon Reduction Algorithm - Stan Melax
Special effects in your game modify the geometry of objects, bumping up your polygon count and requiring a method by which your engine can quickly reduce polygon counts at run time. G A M E D E V E L O P E R. NOVEMBER 1998 http://www.gdmag.com. 44. R

S-SEBI: A Simple Remote Sensing Algorithm to ...
Aug 23, 1997 - M SatlI”A IE:(j-OS:6 aql ql!M puodsano:, ... xnv waq I!os. PIS. SES b8b. 96b. 66b. Z8b ..... Daughtry, C.S.T., Kustas, W.P., Moran, M.S.. Pinter, P.J. ...