Posterior Probabilistic Clustering using NMF Chris Ding

Tao Li

CSE Department University of Texas at Arlington

School of Computer Science Florida International. University

[email protected] Dijun Luo

[email protected] Wei Peng

CSE Department University of Texas at Arlington

School of Computer Science Florida International University

[email protected]

[email protected]

ABSTRACT

2. STANDARD NMF CLUSTERING

We introduce the posterior probabilistic clustering (PPC), which provides a rigorous posterior probability interpretation for Nonnegative Matrix Factorization (NMF) and removes the uncertainty in clustering assignment. Furthermore, PPC is closely related to probabilistic latent semantic indexing (PLSI).

Suppose we have n documents and m words (terms). Let X = (Xij ) be the word-to-document matrix: Xij = X(wi , dj ) is the frequency of word wi in document dj . Standard NMF optimization is min

F ≥0,G≥0

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Clustering; I.2 [Artificial Intelligence]: Learning

(1)

where X has size m × n, F has size m × K, G has size n × K. Once the solution (F ∗ , G∗ ) is computed, standard approach is to assign dj to the cluster Ck where k = arg max(G∗j1 , · · · , G∗j K ),

General Terms

(2)

i.e., the largest element of j-th row of G. There is a fundamental problem with this approach. First, the solution to NMF is not unique. For an arbitrary positive diagonal matrix D = diag(d1 , · · · , dk ), we have

Algorithms, Experimentation, Measurement, Performance, Theory

Keywords Sparse, Posterior Probabilistic Clustering, NMF

1.

X − F GT 2 ,

F ∗ G∗T = (F ∗ D−1 )(G∗ D)T

INTRODUCTION

i.e., (F ∗ D−1 , G∗ D) is also an optimal solution. Thus the cluster assignment is modified to

Non-negative Matrix Factorization (NMF) [4] has been successfully applied to document clustering recently [5, 1]. However, in the standard NMF clustering, cluster assignment is rather ad hoc. In addition, matrix factors lack clear interpretations. In this work, we introduce the posterior probabilistic clustering (PPC), which has 3 benefits: (1) It provides a rigorous posterior probability interpretation for both matrix factors F, G in the factorization of input X: X  F GT . (2) It removes the uncertainty in clustering assignment. (3) Furthermore, when we perform simultaneous word and document clustering, the new model has a very close relation to probabilistic latent semantic indexing (PLSI) [3]: in PLSI, F, G are class conditional probabilities; in PPC, F, G are class posterior probabilities.

k = arg max(G∗j1 d1 , · · · , G∗jk dK ).

(3)

A different choice of D leads to different cluster assignment. An ad hoc solution is to choose D such that columns of F have unit length in L2 norm.

3. POSTERIOR PROBABILITY In this work, we present a principled way to resolve this problem. This is based on posterior probability interpretation of G. In fact, we can see from Eq.(3) that (roughly speaking) (G∗j1 d1 , · · · , G∗jk dK ), is the posterior probability that dj belongs to different clusters. Thus we wish to choose D such that G∗j1 d1 + · · · + G∗jk dK = 1, j = 1, · · · , n This requirement has no solution, because there are n constraints and K variables, but K is much less n. Therefore, in standard NMF, there is no way to enforce posterior probability normalization.

Copyright is held by the author/owner(s). SIGIR’08, July 20–24, 2008, Singapore. ACM 978-1-60558-164-4/08/07.

831

4.

POSTERIOR PROBABILISTIC CLUSTERING

min

F ≥0,G≥0

K

X − F G  , s.t.

Gjk = 1,

(4)

k=1

Using Lagrangian multipliers to enforce the constraints, we derive the following updating rules to solve this problem Gik

(X T F )ik + (GF T F GT )ii ← Gik (GF T F )ik + (X T F GT )ii

5.2 An Illustrative example We give a simple example to illustrate the PPC and SPPC results. The data matrix is given bellow. From inspection, first 3 columns belong to one cluster and the last 4 columns belong to another. For rows, first 3 rows belong to one cluster and the last 2 row belong to another. The resulting F, G recover the clustering correctly.

(5)

(XGT )ik (6) (F GT G)ik The correctness and convergence can be proved rigorously. In the updating process, the constraints should be enforced periodically. Fik ← Fik

5.

X=

SIMULTANEOUS WORD AND DOCUMENT CLUSTERING (SPPC)



min

F ≥0,S,G≥0

X − F SGT 2 , s.t.

T FSPPC =

GT SPPC =

K

Fik = 1, k=1



Gjk = 1,



k=1

(7) We derived the updating algorithm as follows. Let F = F S, G = GS T , the updating algorithm is Gik ← Gik

Fik ← Fik



  

T

    T

(X GT )ik + (F GT GF T )ii

(9)

(F GT G)ik + (X GF T )ii

(F T XG)kk (10) (F T F SGT G)kk We initialize F, G to the K-means clustering results on words F0 and on documents G0 . where F0 , G0 are cluster indicators. We set F = F0 + 0.2 and G = G0 + 0.2.



0.761 0.884 0.457 1.858 1.610

2.799 2.134 2.065 0.566 0.612

2.375 2.374 2.484 0.103 0.158

0.007

0.059 0.011

0.056 0.012

0.000 0.037

0.005 0.035

0.997

0.001 0.999

0.077 0.923

0.763 0.237

0.835 0.165

0.039

0.830 0.170

0.792 0.208

0.019 0.981

0.091 0.909

0.967

0.031 0.969

0.113 0.887

0.846 0.154

0.922 0.078

  0.068 0.003 0.961 0.033

Datasets CSTR WebKB4 Log Reuters WebAce

Skk ← Skk

5.1 Relation to PLSI

0.326 0.380 0.887 1.843 1.806

 

2.970 2.342 2.253 0.417 0.560

2.585 2.524 2.163 0.269 0.784

0.836 0.164

0.795 0.205

0.924 0.076

0.880 0.120

    

 

We compare the clustering performance of each method on 5 real-life datasets. More details of these datasets can be found in [2]. We use accuracy as the performance measure. The experimental results are shown in Table 1. We see that SPPC performs slightly better than NMF and PLSI.

(8)

(GF T F )ik + (X T F GT )ii

0.185 0.508 0.452 1.486 1.496

6. EXPERIMENTS

T

(X F )ik + (GF F G )ii

T FPPC =

GT PPC =

We generalize PPC to simultaneous word and document clustering. We use F as the posterior probability for word clustering, and the posterior probability normalization is K k=1 Fik = 1. The simultaneous PPC (SPPC) problem becomes K





In our approach, we enforce the posterior probability normalization directly. The posterior probabilistic clustering is to optimize T 2



Therefore, our SPPC is quite similar to PLSI, except SPPC has a K different normalization K k=1 Gjk = 1, k=1 Fik = 1. In other words, SPPC treats Gjk , Fik as posterior probabilities; PLSI treats Gjk , Fik as class-conditional probabilities. Note that in PLSI, the sum of probabilities of a document belong to different classes, K k=1 Gjk = P (docj |classk ) = 1. Intuitively for clustering, we would like the total probability adds up to 1. This deficiency is removed in SPPC.

K-Means 0.4256 0.3888 0.6876 0.4448 0.4001

NMF 0.5713 0.4418 0.7805 0.4947 0.4761

PLSI 0.587 0.503 0.778 0.4870 0.4890

SPPC 0.5945 0.4411 0.7915 0.5648 0.4953

Table 1: Clustering Results. Shown are the accuracy results for different methods.



In PLSI we view the word-document matrix X as the joint probability of word and documents. [We re-scale the term frequency Xij by Xij ← Xij /Tw , where Tw = ij Xij . With this, ij Xij = 1. ] The joint occurrence probability is p(wi , dj ) = Xij . PLSI decompose it as product of class-conditional probabilities:

7. REFERENCES

[1] C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. Proc. SIAM Data Mining Conf, 2005. [2] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix P (wordi |classk )P (classk )P (docj |classk ). Xij ≈ t-factorizations for clustering. In KDD’06, pages 126–135, 2006. k [3] T. Hofmann. Probabilistic latent semantic analysis. In ACM SIGIR-99, pages 289–296, 1999. Let Fik = P (wordi |classk ), Skk = P (classk ), Gjk = P (docj |classk ), [4] D.D. Lee and H. S. Seung. Algorithms for non-negative matrix PLSI optimization problem is: factorization. In Advances in Neural Information Processing Systems, m n volume 13. T Dist(X, F SG ), s.t. Fik = 1, Gik = 1, min [5] W. Xu, X. Liu, and Y. Gong. Document clustering based on F ≥0,S,G≥0 non-negative matrix factorization. In Proc. ACM conf. Research and i=1 j=1 development in IR(SIGIR), pages 267–273, Toronto, Canada, 2003. (11)

832

Posterior Probabilistic Clustering using NMF

Jul 24, 2008 - We introduce the posterior probabilistic clustering (PPC), which provides ... fully applied to document clustering recently [5, 1]. .... Let F = FS, G =.

87KB Sizes 1 Downloads 269 Views

Recommend Documents

Probabilistic Low-Rank Subspace Clustering
recent results in VB matrix factorization leading to fast and effective estimation. ...... edges the Beckman Institute Postdoctoral Fellowship. SN thanks the support ...

NMF
best parameters by 10-fold cross-validation over the corresponding training data set. References. [1] Daniel D. Lee and H. Sebastian Seung,. Algorithms for ...

Web page clustering using Query Directed Clustering ...
IJRIT International Journal of Research in Information Technology, Volume 2, ... Ms. Priya S.Yadav1, Ms. Pranali G. Wadighare2,Ms.Sneha L. Pise3 , Ms. ... cluster quality guide, and a new method of improving clusters by ranking the pages by.

Software Rectification using Probabilistic Approach
4.1.1 Uncertainties involved in the Software Lifecycle. 35. 4.1.2 Dealing ..... Life Cycle. The process includes the logical design of a system; the development of.

Distributed Average Consensus Using Probabilistic ...
... applications such as data fusion and distributed coordination require distributed ..... variance, which is a topic of current exploration. Figure 3 shows the ...

machine translation using probabilistic synchronous ...
merged into one node. This specifies that an unlexicalized node cannot be unified with a non-head node, which ..... all its immediate children. The collected ETs are put into square boxes and the partitioning ...... As a unified approach, we augment

TCSOM: Clustering Transactions Using Self ... - Springer Link
Department of Computer Science and Engineering, Harbin Institute of ... of data available in a computer and they can be used to represent categorical data.

Timetable Scheduling using modified Clustering - IJRIT
resources to objects being placed in space-time in such a way as to satisfy or .... timetable scheduling database that has the information regarding timeslots of college. .... Java is a computer programming language that is concurrent, class-based, o

Pattern Clustering using Cooperative Game Theory - arXiv
Jan 2, 2012 - subjectively based on its ability to create interesting clusters) such that the ... poses a novel approach to find the cluster centers in order to give a good start .... Start a new queue, let's call it expansion queue. 4: Start a .....

Timetable Scheduling using modified Clustering - IJRIT
timetable scheduling database that has the information regarding timeslots .... One for admin login, teacher registration, student registration and last one is exit.

Agglomerative Hierarchical Speaker Clustering using ...
news and telephone conversations,” Proc. Fall 2004 Rich Tran- ... [3] Reynolds, D. A. and Rose, R. C., “Robust text-independent speaker identification using ...

Efficient k-Anonymization using Clustering Techniques
ferred to as micro-data. requirements of data. A recent approach addressing data privacy relies on the notion of k-anonymity [26, 30]. In this approach, data pri- vacy is guaranteed by ensuring .... types of attributes: publicly known attributes (i.e

Modeling Posterior Probabilities Using the Linear ...
Function (3). The direct formula for the computation of the partition function requires O(n ..... resentation Phone Identification Features (SPIF) [8]. These posteriors ...

Active Learning for Probabilistic Hypotheses Using the ...
Department of Computer Science. National University of Singapore .... these settings, we prove that maxGEC is near-optimal compared to the best policy that ...

Robust Brain Registration using Adaptive Probabilistic ...
4. Return deformation field D(i+1) and label L(i+1). At Line 2, we initialize the label distribution p(L) based solely on the inten- sity information of image S, assuming a GMM. At Line 3.1, we run HAMMER to register the atlas to the subject. After r

Probabilistic Upscaling of Material Failure Using Random Field ...
Apr 30, 2009 - Random field (RF) models are important to address the ..... DE-FG02-06ER25732 of Early Career Principal Investigator Program. ... Computer Methods in Applied Mechanics and Engineering 2007, 196, 2723-2736.

Probabilistic Upscaling of Material Failure Using Random Field ...
Apr 30, 2009 - Gaussian distribution is proposed to characterize the statistical strength of ... upscaling process using smeared crack finite element analysis. ..... As shown in Fig 6, the sample data deviate from the linearity in the Weibull plot,.

Learning Hidden Markov Models Using Probabilistic ...
Abstract Hidden Markov Models (HMM) provide an excellent tool for building probabilistic graphical models to describe a sequence of observable entities. The parameters of a HMM are estimated using the Baum-Welch algorithm, which scales linearly with

Feature Selection using Probabilistic Prediction of ...
selection method for Support Vector Regression (SVR) using its probabilistic ... (fax: +65 67791459; Email: [email protected]; [email protected]).