Parallel Spectral Clustering Algorithm for Large-Scale Community Data Mining Gengxin Miao1 , Yangqiu Song2 , Dong Zhang3 and Hongjie Bai3 1 Department of ECE, UCSB 2 Department of Automation, Tsinghua 3 Google Research China

Apr. 22, 2008

Gengxin Miao Et al. ()

Apr. 22, 2008

1 / 20

Outline Brief Introduction to Spectral Clustering Parallel Spectral Clustering Experiments Future Works

Gengxin Miao Et al. ()

Apr. 22, 2008

2 / 20

Brief Introduction to Spectral Clustering

Outline Brief Introduction to Spectral Clustering Parallel Spectral Clustering Experiments Future Works

Gengxin Miao Et al. ()

Apr. 22, 2008

3 / 20

Brief Introduction to Spectral Clustering

Spectral Clustering and Eigen Value Decomposition Normalized graph cut: JNCut (A, B) =

cut(B, A) cut(A, B) + assoc(A, V) assoc(B, V)

Corresponding relaxed problem: fL∗

f T (D − S)f = argmin f T Df f T f0 =0

where S is the similarity matrix defined Pn on graph, D is the degree matrix satisfied Dii = j=1 Sij , and L = D − S is the graph Laplacian Gengxin Miao Et al. ()

Apr. 22, 2008

4 / 20

Brief Introduction to Spectral Clustering

Spectral Clustering and Eigen Value Decomposition (Cont’) The optimization problem can be approximately solved by generalized eigen value decomposition (EVD) technique, which requires solving the eigen value decomposition problem of normalized graph ¯ = I − D − 12 SD − 12 . Laplacian L Two-class problem Analyze of the eigenvector corresponding the second smallest eigenvalue.

Multi-class problem Use k-means to cluster the points in the first k dimensions in the eigen space. Gengxin Miao Et al. ()

Apr. 22, 2008

5 / 20

Brief Introduction to Spectral Clustering

Large Scale Problem of EVD

Consider 1 million data A 100 nearest neighbor graph requires 106 ∗ 100 ∗ 8 = 800MB memory to store the similarity matrix. To cluster the data into 1000 clusters requires 106 ∗ 1000 ∗ 8 = 8GB memory to store the eigenvectors.

Gengxin Miao Et al. ()

Apr. 22, 2008

6 / 20

Parallel Spectral Clustering

Outline Brief Introduction to Spectral Clustering Parallel Spectral Clustering Experiments Future Works

Gengxin Miao Et al. ()

Apr. 22, 2008

7 / 20

Parallel Spectral Clustering

Parallel Spectral Clustering

Parallel computing of EVD problem Parallel k-means (See Inderjit S. Dhillon and Dharmendra S. Modha, 2000)

Gengxin Miao Et al. ()

Apr. 22, 2008

8 / 20

Parallel Spectral Clustering

Parallel Computing of EVD Problem EVD review * Multi-core and shared memory based EVD (F. Luk, 1985; E. Jessup and D. Sorensen, 1994)

* Distributed memory based EVD (K. Mashhoff and D. Sorensen, 1996; T. Konda et al., 2006)

We choose PARPACK (K. Mashhoff and D. Sorensen, 1996) since * ARPACK is a very commonly used tool for EVD * It can easily handle sparse matrix * It is not limited by the memory of single machine Gengxin Miao Et al. ()

Apr. 22, 2008

9 / 20

Parallel Spectral Clustering

Parallel Computing of EVD Problem (Cont’) PARPACK requires user provided matrix-vector multiplication program. * Sharding the similarity matrix in a Round-Robin fashion. * Without loss of generality, we show the matrix as blockwise.

Gengxin Miao Et al. ()

Apr. 22, 2008

10 / 20

Parallel Spectral Clustering

Parallel Computing of EVD Problem (Cont’) Distributed matrix-vector multiplication Obtain the whole eigen vector by message passing interface (MPI) “AllReduce” and “Allgather” are two different mechanisms to collect data from different local machines

(a) Gengxin Miao Et al. ()

(b) Apr. 22, 2008

11 / 20

Experiments

Outline Brief Introduction to Spectral Clustering Parallel Spectral Clustering Experiments Future Works

Gengxin Miao Et al. ()

Apr. 22, 2008

12 / 20

Experiments

Speedup Results Results using MPI “AllReduce” to get the whole vectors (Shown in workshop paper)

Gengxin Miao Et al. ()

Apr. 22, 2008

13 / 20

Experiments

Speedup Results (Cont’) Results using MPI “Allgather” to get the whole vectors (Further investigation, work in progress) Table: #class=100, #Arnoldi space=200, #data=600K.

Machines 1 2 5 10 20 50 100 Gengxin Miao Et al. ()

EVD Time (sec.) 6.822 × 103 3.904 × 103 1.789 × 103 1.033 × 103 6.103 × 102 5.118 × 102 6.580 × 102

Speedup 1 1.75 3.81 6.60 11.12 13.33 10.37

k-means Time (sec.) 5.041 × 102 2.782 × 102 1.166 × 102 7.082 × 101 4.099 × 101 2.672 × 101 2.349 × 101

Speedup. 1 1.81 4.32 7.12 12.30 18.87 21.46

Apr. 22, 2008

14 / 20

Experiments

Speedup Results (Cont’) Results using MPI “Allgather” to get the whole vectors (Further investigation, work in progress) Table: #class=1000, #Arnoldi space=2000, #data=600K. Machines 1 2 5 10 20 50 100 200 Gengxin Miao Et al. ()

EVD Time (sec.) − 8.465 × 104 4.046 × 104 1.667 × 104 8.626 × 103 5.199 × 103 4.697 × 103 4.897 × 103

k-means Speedup Time (sec.) − − 1 3.609 × 104 2.09 1.488 × 104 5.08 7.354 × 103 9.81 3.941 × 103 16.28 1.907 × 103 18.02 1.087 × 103 17.29 9.031 × 102

Speedup − 1 2.43 4.91 9.16 18.93 33.20 39.96

Apr. 22, 2008

15 / 20

Experiments

Experiments on Orkut Data Orkut is an Internet social network service run by Google Since October 2006, Orkut has more than 50 million users and 20 million communities We used Orkut’s user-by-community co-occurrence data Data pre-processing * Filter out non-English-language communities * Remove communities containing few users * Finally Obtain 151, 973 communities with more than 10-million users involved Gengxin Miao Et al. ()

Apr. 22, 2008

16 / 20

Experiments

Experiments on Orkut Data (Cont’) Table: Cluster examples Sample Cluster 1: Cars Community ID Community title 22527 Honda CBR 287892 Mercedes-Benz 35054 Valentino Rossi 5557228 Pulsar Lovers 2562120 Top Speed Drivers 19680305 The Art of DriftIng!!!!!! 3348657 I Love Driving 726519 Luxury & Sports Cars 2806166 Hero Honda Karizma 1162256 Toyota Supra

Community ID 622109 20876960 948798 1614793 1063561 970273 14378632 973612 16537390 1047220

Sample Community ID 15284191 7349400 1255346 13922619 2847251 6386593 4154 15549415 1179183 18963916

Sample Cluster4: Pets, animals and wildlife Community ID Community title 18341 Tigers 245877 German shepherd 40739 Naughty dogs 11782689 We Love Street Dogs 29527 Animal welfare 370617 Lion 11577 Arabian horses 2875608 Wildlife Conservation 12522409 I Care For Animals 1527302 I hate cockroaches

Gengxin Miao Et al. ()

Cluster3: Education Community title Bhatia Commerce Classes Inderprastha Engineering Cllge CCS University Meerut Visions - SIES college fest Rizvi College of Engg., Bandra Seedling public school, jaipur Pennsylvania State University N.M. College, Mumbai Institute of Hotel Management I Love Sleeping In Class

Sample Cluster 2: Food Community title Seafood Lovers Gol gappe I LOVE ICECREAM ...YUMMY!!!!!! Bounty Old Monk Rum Fast Food Lovers Maggi Lovers Kerala Sadya -Mind Blowing Baskin-Robbins Icecream Lovers Oreo Freax!!

Apr. 22, 2008

17 / 20

Future Works

Outline Brief Introduction to Spectral Clustering Parallel Spectral Clustering Experiments Future Works

Gengxin Miao Et al. ()

Apr. 22, 2008

18 / 20

Future Works

Future Works

More efficient implementation of parallel spectral clustering algorithm Good evaluation criterion for Orkut data clustering results Other data to demonstrate the effectiveness of parallel spectral clustering algorithm

Gengxin Miao Et al. ()

Apr. 22, 2008

19 / 20

Future Works

Thanks very much!

Gengxin Miao Et al. ()

Apr. 22, 2008

20 / 20

Parallel Spectral Clustering Algorithm for Large-Scale ...

1 Department of ECE, UCSB. 2 Department of ... Apr. 22, 2008. Gengxin Miao Et al. (). Apr. 22, 2008. 1 / 20 .... Orkut is an Internet social network service run by.

235KB Sizes 2 Downloads 360 Views

Recommend Documents

Parallel Spectral Clustering Algorithm for Large-Scale ...
Apr 21, 2008 - Spectral Clustering, Parallel Computing, Social Network. 1. INTRODUCTION .... j=1 Sij is the degree of vertex xi [5]. Consider the ..... p ) for each computer and the computation .... enough machines to do the job. On datasets ...

Parallel Spectral Clustering
Key words: Parallel spectral clustering, distributed computing. 1 Introduction. Clustering is one of the most important subroutine in tasks of machine learning.

Parallel Spectral Clustering - Research at Google
a large document dataset of 193, 844 data instances and a large photo ... data instances (denoted as n) is large, spectral clustering encounters a quadratic.

Spectral Clustering - Semantic Scholar
Jan 23, 2009 - 5. 3 Strengths and weaknesses. 6. 3.1 Spherical, well separated clusters . ..... Step into the extracted folder “xvdm spectral” by typing.

a novel parallel clustering algorithm implementation ... - Varun Jewalikar
calculations. In addition to the 3D hardware, today's GPUs include basic 2D acceleration ... handling 2D graphics from Adobe Flash or low stress 3D graphics.

a novel parallel clustering algorithm implementation ...
parallel computing course which flattened the learning curve for us. We would ...... handling 2D graphics from Adobe Flash or low stress 3D graphics. However ...

Spectral Clustering for Time Series
the jth data in cluster i, and si is the number of data in the i-th cluster. Now let's ... Define. J = trace(Sw) = ∑K k=1 sktrace(Sk w) and the block-diagonal matrix. Q =.... ..... and it may have potential usage in many data mining problems.

a novel parallel clustering algorithm implementation ...
In the process of intelligent grouping of the files and websites, clustering may be used to ..... CUDA uses a recursion-free, function-pointer-free subset of the C language ..... To allow for unlimited dimensions the process of loading and ... GPU, s

Spectral Clustering for Complex Settings
2.7.5 Transfer of Knowledge: Resting-State fMRI Analysis . . . . . . . 43 ..... web page to another [11, 37]; the social network is a graph where each node is a person.

Spectral Clustering for Medical Imaging
integer linear program with a precise geometric interpretation which is globally .... simple analytic formula to define the eigenvector of an adjusted Laplacian, we ... 2http://www-01.ibm.com/software/commerce/optimization/ cplex-optimizer/ ...

Spectral Embedded Clustering
2School of Computer Engineering, Nanyang Technological University, Singapore ... rank(Sw) + rank(Sb), then the true cluster assignment ma- trix can be ...

Spectral Embedded Clustering - Semantic Scholar
A well-known solution to this prob- lem is to relax the matrix F from the discrete values to the continuous ones. Then the problem becomes: max. FT F=I tr(FT KF),.

Active Spectral Clustering - Computer Science, UC Davis
tion, social network analysis and data clustering can be abstracted into a graph ... Previous research [5] showed that in batch constrained clustering, not all given ...

Spectral Clustering with Limited Independence
Oct 2, 2006 - data in which each object is represented as a vector over the set of features, ... and perhaps simpler “clean-up” phase than known algo- rithms.

Flexible Constrained Spectral Clustering
Jul 28, 2010 - H.2.8 [Database Applications]: Data Mining. General Terms .... rected, weighted graph G(V, E, A), where each data instance corresponds to a ...

A Clustering Algorithm for Radiosity in Complex ...
ume data structures useful for clustering. For the more accurate ..... This work was supported by the NSF grant “Interactive Computer. Graphics Input and Display ... Graphics and Scientific Visualization (ASC-8920219). The au- thors gratefully ...

A Clustering Algorithm for Radiosity in Complex Environments
Program of Computer Graphics. Cornell University. Abstract .... much of the analysis extends to general reflectance functions. To compute the exact radiance ...

An Efficient Algorithm for Clustering Categorical Data
the Cluster in CS in main memory, we write the Cluster identifier of each tuple back to the file ..... algorithm is used to partition the items such that the sum of weights of ... STIRR, an iterative algorithm based on non-linear dynamical systems, .

A Simple Algorithm for Clustering Mixtures of Discrete ...
mixture? This document is licensed under the Creative Commons License by ... on spectral clustering for continuous distributions have focused on high- ... This has resulted in rather ad-hoc methods for cleaning up mixture of discrete ...

A Distributed Clustering Algorithm for Voronoi Cell-based Large ...
followed by simple introduction to the network initialization. phase in Section II. Then, from a mathematic view of point,. derive stochastic geometry to form the algorithm for. minimizing the energy cost in the network in section III. Section IV sho

A Scalable Hierarchical Fuzzy Clustering Algorithm for ...
discover content relationships in e-Learning material based on document metadata ... is relevant to different domains to some degree. With fuzzy ... on the cosine similarity coefficient rather than on the Euclidean distance [11]. ..... Program, vol.

A High Performance Algorithm for Clustering of Large ...
Oct 5, 2013 - A High Performance Algorithm for Clustering of Large-Scale. Protein Mass Spectrometry Data using Multi-Core Architectures. Fahad Saeed∗ ...

A Parallel Encryption Algorithm for Block Ciphers ...
with respect to the circuit complexity, speed and cost. Figure 8 Single Block ... EDi=ith Binary Digit of Encrypted final Data .... efficient for software implementation.

An Efficient Parallel Dynamics Algorithm for Simulation ...
portant factors when authoring optimized software. ... systems which run the efficient O(n) solution with ... cated accounting system to avoid formulation singu-.