A summary of

Incremental Learning of Nonparametric Bayesian Mixture Models Conference on Computer Vision and Pattern Recognition 2008

Ryan Gomes (CalTech) Piero Perona (CalTech) Max Welling (UCI)

Motivation • Unsupervised learning with very large datasets • Requirements • Model of evolving complexity • Limits on space and time

Overview Estimate Model Document 1 2

Document 1

1.5 1 0.5 0 !0.5 !1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2 !1.5

!1

!0.5

0

0.5

1

1.5

2

!1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 !1.5 !2 !2

Overview Estimate Model

Compression

Document 1 2

Document 1

Document 1

1.5 1 0.5 0 !0.5 !1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2

Document 2 !1.5

!1

!0.5

0

0.5

1

1.5

2

!1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 !1.5 !2 !2

Overview Estimate Model

Compression

Get more data, estimate model

Document 1

Document 1

Document 2

Document 2

Document 1 2

Document 1

1.5 1 0.5 0 !0.5 !1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2 !1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 Document 3

!2 !2

Document 3

2

!1.5

1.5

!1.5

!1

!0.5

0

0.5

1

1.5

2 1 0.5 0 !0.5 !1

Document 4

Document 4

2 !1.5 1.5

!2 !2 1

!1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 !1.5 !2 !2

!1.5

!1

!0.5

0

0.5

1

1.5

2

Overview Estimate Model

Compression

Get more data, estimate model

Compression

Document 1

Document 1

Document 1

Document 2

Document 2

Document 2

Document 1 2

Document 1

1.5 1 0.5 0 !0.5 !1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2 !1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 Document 3

!2 !2

Document 3

2

!1.5

Document 3

1.5

!1.5

!1

!0.5

0

0.5

1

1.5

2 1 0.5 0 !0.5 !1

Document 4

Document 4

2 !1.5

Document 4

1.5

!2 !2 1

!1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 !1.5 !2 !2

!1.5

!1

!0.5

0

0.5

1

1.5

2

Overview Model Building Phase Estimate Model

SMEM (Split & merge EM) algorithm

Document 1 2

Document 1

1.5

SMEM Algorithm for Mixture Models N.Ueda 1999

1 0.5 0 !0.5 !1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2 !1.5

!1

!0.5

0

0.5

1

1.5

2

!1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 !1.5 !2 !2

1. Rank splits and merges 2. Try 10 best splits 3. Try 10 best merges 4. Do best split or merge 5. Repeat 2-3-4 until free energy converges

Overview Compression Phase Estimate Model

Compression

Document 1 2

Document 1

Document 1

1.5 1 0.5 0 !0.5 !1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2

Document 2 !1.5

!1

!0.5

0

0.5

1

1.5

2

!1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 !1.5 !2 !2

1. Hard cluster data 2. Find the best cluster to split 3. Split cluster 4. Repeat 2-3 until memory constraint 5. Create clumps delete data points

Overview Model Building Phase Estimate Model

Compression

Get more data, estimate model

Document 1

Document 1

Document 2

Document 2

Document 1 2

Document 1

1.5 1 0.5 0 !0.5 !1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2 !1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 Document 3

!2 !2

Document 3

2

!1.5

1.5

!1.5

!1

!0.5

0

0.5

1

1.5

2 1 0.5 0 !0.5 !1

Document 4

Document 4

2 !1.5 1.5

!2 !2 1

!1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 !1.5 !2 !2

!1.5

!1

!0.5

0

0.5

1

1.5

2

Overview Compression Phase Estimate Model

Compression

Get more data, estimate model

Compression

Document 1

Document 1

Document 1

Document 2

Document 2

Document 2

Document 1 2

Document 1

1.5 1 0.5 0 !0.5 !1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2 !1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 Document 3

!2 !2

Document 3

2

!1.5

Document 3

1.5

!1.5

!1

!0.5

0

0.5

1

1.5

2 1 0.5 0 !0.5 !1

Document 4

Document 4

2 !1.5

Document 4

1.5

!2 !2 1

!1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1 !1.5 !2 !2

!1.5

!1

!0.5

0

0.5

1

1.5

2

Technical details Dirichlet mixture model





Define variables

Model Building







Inheriting clump constraints



Constrained free energy

Compression Phase •

Top-down clustering



Memory cost computation

Topic Model !

"

joint probability

p(x, z, η, π, α) =

! ij

z

! k

#

x

xij : zij : ηk : πj : α: β : a, b :

observation model

mixture weight

p(xij |zij ; η)πj,zij p(η k |β)G(αk ; a, b)

prior on mixture components

gamma prior on Dirichlet hyper-param

word i in document j topic assignment variable for word i in document j parameter for topic k mixture of topics for document j (with Dirichlet prior) topic mixture prior parameter (with Gamma priors) topic prior hyperparameter Gamma prior hyperparameters

"

j

D(πj ; α)

Dirichlet topic mixture prior

estimate mod

Overview

Document 1 2 1.5

Document 1

Document 1

Document 1

1

Estimate Model

0.5

Get more data, estimate model

Compression

0 !0.5 !1

Document 1 2 1.5

0.5

Document 1 Document 2

Document 1 Document 2

1.5 !2 !2 1

1

!1.5

!1

!0.5

0

0.5

1

1.5

Document Docum 2

2

0.5

0

0

!0.5

!0.5

!1

!1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2

2 !1.5

Document 1

Compr

Document 2 !1.5

!1

!0.5

0

0.5

1

!2 1.5 !2 2 !1.5

Document 3

Document 2

Document 2

!1.5

1.5

!1

!0.5

0

0.5

1

1.5

2 1

0.5

0.5

Model Building Phase 0

0

!0.5

!0.5

!1

!1

Document 3

Document 3

2

!1.5

!2 !2

Document Docum 3

2

1.5

1.5

!1.5

!1

!0.5

0

0.5

1

1.5

!2 !2 1

2 1 0.5

0.5

0

0

!1

!0.5

0

0.5

!1

!1

Document 4

Document 4

2 !1.5

!1.5

!1

!0.5

0

Docum

!1.5

1.5

0.5

1

1.5

!2 !2

0.5 0 !0.5 !1 !1.5 !2 !2

!1.5

!0.5

!0.5

!2 !2 1

Document 4

Document Docum 4

2 !1.5

!1.5

!1

!0.5

0

0.5

1

1.5

2

2

!1.5

!1

!0.5

0

0.5

1

Variational inference (general formulation)

observed variables

L(X) ≥ B(X)

marginal data likelihood

B(X) =

!

W

hidden variables

Lower bound (Free Energy)

joint probability

q(W ) log variational distribution (next page)

p(W,X) q(W )

mean-field variational approximation (generic truncated version)

q(v , η , z) = ∗



!T

t=1 qγt (vt ) stick lengths

Beta distribution In this paper→ qγt (vt ) = Beta(γt,1 , γt,2 )

!T

t=1 qτt (ηt ) mixture components

!N

Inverse Normal Wishart distribution

n q (z ) n=1 φn responsibilities (topic assignments)

multinomial distribution

Recall that stick lengths are used to compute mixing weights

πi (ν ) = νi ∗

!i−1

j=1 (1

− νj )

derivation of lower bound (DP mixture model)

log p(x; α, λ)

=

log ×

! " z

#N

v



"

∗ ∗ p(ν |α)p(η |λ) η Beta prior on Inverse Wishart prior ∗

mixture weights on mixture components

∗ ∗ ∗ p(x |η )p(z |ν )dν dη n z n n n=1 multinomial Gaussian observation mixture topic weight model

≥ Eq(ν ∗ ,η ∗ ,z) [log p(ν ∗ |α)p(η ∗ |λ) ×

#N

∗ p(x |η )p(z |ν )] n z n n n=1

Jensen’s inequality

Free Energy (unconstrained using all data points)

F

=

!T

"

t=1 Eqγt log

+ +

!T

"

qγt (νt ) p(νt |α)

t=1 Eqτt log

!N

"

qτt (ηt ) p(ηt |λ)

n=1 Eqφn log

* N: total number of data points * T: number of topics/latent states (T for “truncation”)

#

#

qφn (zn ) p(xn |ηzn )p(zn |ν ∗ )

#

“Clumps” if xij and xi! j! are in clump c: q(zij ) = q(zi! j ! ) = q(zc ) if xij and xi! j ! in clump c Document 1 2

Document 1

Document 1

1.5 1 0.5 0 !0.5 !1

Document 2

2 !1.5 1.5 !2 !2 1

Document 2

Document 2

!1.5

!1

!0.5

0

0.5

1

1.5

2

0.5 0 !0.5 !1

!

!1.5

Key assumption: !2 !2

!1.5

!1

p(xij |ηzij ) in exponential family with conjugate prior p(ηk |β)

!0.5

0

0.5

1

1.5

2

Constrained Free Energy “Lower bound on the lower bound” FC

= − −

!K

k=1

!K

+N T

k=1

!

s

Data multiplier

KL(qγk (νk )||p(νk |α)) KL(qτ (ηk )||p(ηk |λ)) ns log

!K

k=1

exp(Ssk )

s: clump index n_s: number of data points represented by clump s *Change in notation T → K (K: total number of mixture components now) * N and T have new meanings (N: number of expected data points in the future, T: number of data points now)

Update equations γk1

= α1 +

N T

γk2

= α2 +

N T

τk1

= λ1 +

N T

τk2

= λ2 +

N T

!

s

!

ns q(zs = k) !K

q(zs = j)

s

ns

s

ns q(zs = k)!F (x)"s

! !

s

j=k+1

ns q(zs = k)

q(zs = k) = !Kexp(Ssk ) j=1

exp(Ssj )

Ssk = Eq(V,φk ) log {p(zs = k|V )p(!F (x)"s |φk )}

generic procedure for computing variational parameter updates is given in [Blei & Jordan 2006]

estimate model

Overview

Document 1 2 1.5

Document 1

Document 1

Document 1

Document 1

1 0.5

Estimate Model

Get more data, estimate model

Compression

0 !0.5 !1

Document 1 2 1.5

0.5

Document 1 Document 2

Document 1 Document 2

1.5 !2 !2 1

1

!1.5

!1

!0.5

0

0.5

1

1.5

DocumentDocument 2 1

0

0 !0.5

!1

2

!1

Document 2

2 !1.5

Document 2 !1.5

!1

!0.5

0

0.5

1

1.5

!2 !2 2

Document 3

Document 2

Document 2

!1.5

DocumentDocument 3 2

2

Document 3

1.5

!1.5

!1

!0.5

0

0.5

1

1.5

2 1

0.5

0.5

Compression Phase

0

0

!0.5

!0.5

!1

!1

Document 3

Document 3

2

!1.5 !2 !2

Document 2

0.5

!0.5

1.5 !2 !2 1

Document 2

2 !1.5

Document 1

Compression

!1

!0.5

0

0.5

1

1.5

!2 !2 1

2 1 0.5

0.5

0

0

!1

!0.5

0

0.5

1

1.5

2

!1

!1

Document 4

Document 4

2 !1.5

!1.5

!1

!0.5

0

0.5

1

1.5

!2 !2

0.5 0 !0.5 !1 !1.5

!1.5

!1

!0.5

0

0.5

Document 4

!1.5

1.5

!2 !2

!1.5

!0.5

!0.5

!2 !2 1

Document 4

1.5

1.5

!1.5

Document 4

DocumentDocument 4 3

2 !1.5

1

1.5

2

2

!1.5

!1

!0.5

0

0.5

1

1.5

2

Agglomerative Clustering 1.

2. 3.

Hard cluster clumps (max responsibility) a. Split Ci along principle component b. Update parameters locally c. Cache change in free energy d. Repeat for each cluster Accept split with maximal change in energy If memory cost (MC) < memory bound (M) repeat else a. set clumps b. delete data points c. return

Memory Cost singlet

Number of points from document

diagonal elements

d2 −d 2 half covariance matrix

+d+d= mean

d2 +3d 2

Experiments • CalTech 256 image dataset • Corel image database • CalTech 101 - Face Easy dataset

CalTech 256

p(η ∗ |λ) Inverse Normal Wishart prior on mixture components (conjugate of the multivariate Gaussian distribution)

Kernel PCA + spatial pyramid match kernel single image = 20 dimensional vector

Corel dataset vector quantization 7x7 patches 30,000 data points

• •

Kurihara’s accelerated inference reaches memory limit Incremental algorithm processes data in 4 hours

CalTech101

results

clumps

baseline

30%

100%

References Gomes, Welling, Perona. Incremental Learning of Nonparameteric Bayesian Mixture Models. CVPR 2008. Gomes, Welling, Perona. Memory Bounded Inference in Topic Models. ICML 2008. Blei and Jordan. Variational Inference for Dirichlet Process Mixtures. Journal of Bayesian Analysis 2006 Kurihara, Welling,Vlassis. Accelerated Variational Dirichlet Process Mixtures. ICML 2006

Gamma distribution

Beta distribution

Dirichlet distribution

Incremental Learning of Nonparametric Bayesian ...

Jan 31, 2009 - Mixture Models. Conference on Computer Vision and Pattern Recognition. 2008. Ryan Gomes (CalTech). Piero Perona (CalTech). Max Welling ...

3MB Sizes 2 Downloads 336 Views

Recommend Documents

Incremental Learning of Nonparametric Bayesian ...
Jan 31, 2009 - Conference on Computer Vision and Pattern Recognition. 2008. Ryan Gomes (CalTech) ... 1. Hard cluster data. 2. Find the best cluster to split.

Nonparametric Hierarchical Bayesian Model for ...
results of alternative data-driven methods in capturing the category structure in the ..... free energy function F[q] = E[log q(h)] − E[log p(y, h)]. Here, and in the ...

Variational Nonparametric Bayesian Hidden Markov ...
[email protected], [email protected]. ABSTRACT. The Hidden Markov Model ... nite number of hidden states and uses an infinite number of Gaussian components to support continuous observations. An efficient varia- tional inference ...

Scalable Dynamic Nonparametric Bayesian ... - Research at Google
cation, social media and tracking of user interests. 2 Recurrent Chinese .... For each storyline we list the top words in the left column, and the top named entities ...

Nonparametric Hierarchical Bayesian Model for ...
employed in fMRI data analysis, particularly in modeling ... To distinguish these functionally-defined clusters ... The next layer of this hierarchical model defines.

Scalable Nonparametric Bayesian Multilevel Clustering
vided into actions, electronic medical records (EMR) orga- nized as .... timization process converge faster, SVI uses the coordinate descent ...... health research.

PDF Fundamentals of Nonparametric Bayesian Inference
Deep Learning (Adaptive Computation and Machine Learning Series) · Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science).

Bayesian nonparametric estimation and consistency of ... - Project Euclid
Specifically, for non-panel data models, we use, as a prior for G, a mixture ...... Wishart distribution with parameters ν0 + eK∗ j and. ν0S0 + eK∗ j. Sj + R(¯β. ∗.

Bayesian nonparametric estimation and consistency of ... - Project Euclid
provide a discussion focused on applications to marketing. The GML model is popular ..... One can then center ˜G on a parametric model like the GML in (2) by taking F to have normal density φ(β|μ,τ). ...... We call ˆq(x) and q0(x) the estimated

Scalable Dynamic Nonparametric Bayesian Models of Content and ...
Recently, mixed membership models [Erosheva et al.,. 2004], also .... introduce Hierarchical Dirichlet Processes (HDP [Teh et al., .... radical iran relation think.

A nonparametric hierarchical Bayesian model for group ...
categories (animals, bodies, cars, faces, scenes, shoes, tools, trees, and vases) in the .... vide an ordering of the profiles for their visualization. In tensorial.

Adaptive Incremental Learning in Neural Networks
structure of the system (the building blocks: hardware and/or software components). ... working and maintenance cycle starting from online self-monitoring to ... neural network scientists as well as mathematicians, physicists, engineers, ...

Bayesian Reinforcement Learning
2.1.1 Bayesian Q-learning. Bayesian Q-learning (BQL) (Dearden et al, 1998) is a Bayesian approach to the widely-used Q-learning algorithm (Watkins, 1989), in which exploration and ex- ploitation are balanced by explicitly maintaining a distribution o

The Sample Complexity of Self-Verifying Bayesian Active Learning
Appearing in Proceedings of the 14th International Conference on. Artificial Intelligence and ... practice for this verification is to hold out a random sample of labeled ..... returns the constant classifier that always predicts −1: call it h−;

An Evidence Framework For Bayesian Learning of ...
data is sparse, noisy and mismatched with test. ... In an evidence Bayesian framework, we can build a better regularized HMM with ... recognition performance.

Robustness of Bayesian Pool-based Active Learning ...
assumes data labels are generated from a prior distribu- tion. In theory, the true ..... By taking p1 = p0, Corollaries 1 and 2 can recover the ap- proximation ratios for .... robustness of the RAId algorithm for the adaptive informa- tive path plann

Robustness of Bayesian Pool-based Active Learning ...
3Department of Computer Science, National University of Singapore, Singapore, ... AL algorithms can achieve good approximation ratios com- pared to the ...