Dirichlet Process Sara Wade University of Cambridge

Charles University 8-19 April 2013, Prague

Sara Wade

Dirichlet Process

1 / 26

Categorical and multinomial distributions Consider a discrete random variable X taking one ofPk possible outcomes. Among n independent and identical trials, let nj = ni=1 1(xi = j). Note P that n = kj=1 nj . The distribution of Xi is given by the categorical P distribution, parametrized by p = (p1 , . . . , pk ) such that pj = 1, where 1(x=1)

p(x|p) = p1

1(x=k)

∗ . . . ∗ pk

.

The probability of observing counts (n1 , . . . , nk ) is given by the multinomial distribution, where p(n1 , . . . , nk |p) =

k Y n! n pj j . n1 ! · · · nk ! j=1

Ex. (n1 , . . . , nk ) is the frequency of words in a text. Sara Wade

Dirichlet Process

2 / 26

Dirichlet distribution The Dirichlet distribution is P defined on Sk = {(p1 , . . . , pk ) : pj ≥ 0, kj=1 pj = 1} with density P k Γ( kj=1 αj ) Y α −1 pj j . p((p1 , . . . , pk )|(α1 , . . . , αk )) = Qk j=1 Γ(αj ) j=1 It is the conjugate prior to the multinomial likelihood. Parameters: (α1 , . . . , αk ) such that αj ≥ 0, are often reparametrized as α=

k X

αj ;

p0 = (p0 1 , . . . , p0 k ) =

j=1



1

α

,...,

αk  . α

Properties: E[pj ] = p0 j ← prior guess. V(pj ) =

p0 j (1−p0 j ) α+1

Sara Wade

← α controls the variability. Dirichlet Process

3 / 26

Dirichlet densities from Wikipedia

Sara Wade

Dirichlet Process

4 / 26

Connections with other distributions ind

1. if zj ∼ Gam(αi , 1), then d

(p1 , . . . , pk ) =

z1 Pk

j=1 zj

zk

, . . . , Pk

j=1 zj

! .

This property is used to simulate from a Dirichlet distribution. P ind 2. if vj ∼ Beta(αj , j 0 >j αj 0 ) for j = 1, . . . , k − 1 and vk is degenerate at 1, then   Y d  (p1 , . . . , pk ) = v1 , v2 (1 − v1 ) . . . , vk (1 − vj ) . j
Note that vj =

Sara Wade

P pj 1− j 0
Dirichlet Process

5 / 26

Symmetric Dirichlet distribution The symmetric Dirichlet distribution is defined with p0 j = j = 1, . . . , k.

2

3

4

5

6

7

8

9

10

1.0 0.8 0.0

0.2

0.4

0.6

0.8 0.0

0.2

0.4

0.6

0.8 0.6 0.4 0.2 0.0

1

for

alpha= 10

1.0

alpha= 1

1.0

alpha= 0.1

1 k

1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

6

7

8

9

10

Densities p drawn at random from a symmetric Dirichlet distribution with various precision parameters.

Sara Wade

Dirichlet Process

6 / 26

Posterior and Predictive Categorical model: iid

Xi |p ∼ Cat(p). Dirichlet prior: p ∼ Dir(αp0 ). → Leads to a Dirichlet posterior p|x ∼ Dir(α ˆ pˆ), where α ˆ = α + n;

pˆj = α ˆ −1 (αp0 j + nj ).

→ Leads to a Categorical predictive with p(Xn+1 = j|x) = pˆj .

Sara Wade

Dirichlet Process

7 / 26

P´olya urn scheme

The P´ olya urn scheme describes the distribution of a sequence of random variables {Xn }n∈N taking values in {1, . . . , k}. Consider an urn with αp0 j balls of color j for j = 1, . . . , k. A ball is drawn from the urn and replaced along with another ball of the same color. The random variable Xn is set to j if the nth ball drawn is of color j. Formally, the law of {Xn }n∈N is given by P (X1 = j) = p0 j , P (Xn+1 = j|x) =

Sara Wade

αp0 j +nj α+n

for n > 1.

Dirichlet Process

8 / 26

Exchangeability and De Finetti’s Theorem The sequence of random variables {Xn }n∈N taking values in {1, . . . , k} is exchangeable if for any n and permutation π of {1, . . . , n} P (X1 = j1 , . . . , Xn = jn ) = P (Xπ(1) = j1 , . . . , Xπ(n) = jn ), for any ji ∈ {1, . . . , k}.

Theorem (De Finetti’s Theorem) A sequence of random variables {Xn }n∈N taking values in {1, . . . , k} is exchangeable if and only if there exists a unique probability measure Q on Sk such that for any n and measurable sets any ji ∈ {1, . . . , k}, Z P (X1 = j1 , . . . , Xn = jn ) =

n Y

pji dQ(p).

Sk i=1

Sara Wade

Dirichlet Process

9 / 26

P´olya urn scheme and the Dirichlet distribution

If the distribution of {Xn }n∈N is described by the P´olya urn scheme, then {Xn }n∈N is exchangeable. If Xi |p have categorical distribution and p ∼ Dir(αp0 ), then the marginal distribution of {Xn }n∈N is described by the P´olya urn scheme. The distribution of {Xn }n∈N is described by the P´ olya urn scheme if and iid

only if Xi |p ∼ Cat(p) and p ∼ Dir(αp0 ).

Sara Wade

Dirichlet Process

10 / 26

Dirichlet Process The Dirichlet process is an extension of the Dirichlet distribution on the space of probability measures on {1, . . . , k} to the space of probability measures on a complete and separable metric space X . Let P(X ) denote the set of probability measures on X , equipped with the Borel σ-algebra under weak convergence.

Definition P has a Dirichlet process prior with parameters α > 0 and P0 ∈ P(X ), denoted DP(αP0 ), if for any finite measurable partition (B1 , . . . , Bm ), (P (B1 ), . . . , P (Bm )) ∼ Dir(αP0 (B1 ), . . . , αP0 (Bm )). Parameters: the base measure P0 is the prior guess, E[P (B)] = P0 (B), the precision parameter α controls the variability, 0 (B)) V(P (B)) = P0 (B)(1−P . α+1 Sara Wade

Dirichlet Process

11 / 26

Existence of the DP

Marginal property of the Dirichlet P distribution: Let B1 , . . . , Bm be a partition {1, . . . , k} and p(Bi ) = j∈Bi pj , (p(B1 ), . . . , p(Bm )) ∼ Dir(αp0 (B1 ), . . . , αp0 (Bm )), P where p0 (Bi ) = j∈Bi p0 j . The marginal property of the Dirichlet distribution is a key property in showing existence of the Dirichlet process.

Sara Wade

Dirichlet Process

12 / 26

Stick-breaking construction Theorem (Sethuraman (1994)) P ∼ DP(αP0 ) is characterized by the stick-breaking construction P =

∞ X

pj δθj ,

j=1 iid

where θj ∼ P0 , p 1 = v1 ;

p j = vj

Y

(1 − vj 0 ),

j 0
and vj ∼ Beta(1, α) independent of (θj ). Notice: If P ∼ DP(αP0 ), P is discrete a.s.

Sara Wade

Dirichlet Process

13 / 26

DP prior samples

1.0

alpha= 10

1.0

alpha= 1

1.0

alpha= 0.1

0.8 0.4

p

0.6

0.8 0.6 p 0.4



0.2



0.2

0.2

0.4

p

0.6

0.8





● ●



−2

−1

●● ●

0 theta



1

2

3

● ● ● ●●●● ● ●● ● ●● ● ●● ●●●● ●●●

−3

−2

−1

0

● ● ●● ●● ●●

● ● ●

1

theta



2

3



● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●●● ● ●●● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ●● ●●● ●●

0.0



−3

0.0

0.0



−3

−2

−1

0

1

2

3

theta

Random draws of P ∼ DP(αN(0, 1)) with various precision parameters. Simulation is based on the stick-breaking construction.

Sara Wade

Dirichlet Process

14 / 26

Posterior and Predictive Model:

iid

Xi |P ∼ P. Dirichlet process prior: P ∼ DP(αP0 ). → Leads to a Dirichlet process posterior P |x ∼ DP(ˆ αPˆ ), where α ˆ = α + n;

Pˆ = α ˆ −1 (αP0 +

n X

δxi ).

i=1

→ Predictive distribution is P (Xn+1 ∈ B|x) = Pˆ (B), for any Borel set B ⊂ X . Sara Wade

Dirichlet Process

15 / 26

Blackwell and MacQueen urn scheme The Blackwell and MacQueen urn scheme describes the distribution of a sequence of random variables {Xn }n∈N taking values in X . Consider an urn with α black balls. Step 1: a black ball is drawn from the urn, and once drawn, its true color is revealed as θ1∗ from P0 ; it is replaced along with a black ball. Step n + 1: a ball is drawn from the urn. If the ball is black, once drawn, its true color is revealed as θk∗n +1 , and it is replaced along with a black ball. Otherwise, it is of color θj∗ for j = 1, . . . , kn , and it is replaced along with another ball of the same color. Here kn denotes the number of black balls drawn among the first n draws. We set Xn = θj∗ if the nth ball drawn is color θj∗ . Formally, the law of {Xn }n∈N is given by P (X1 ∈ B) = P0 (B), P (Xn+1 ∈ B|x) =

P αP0 (B)+ n i=1 δxi (B) α+n

for n > 1.

for any Borel set B ⊆ X . Sara Wade

Dirichlet Process

16 / 26

Exchangeability and De Finetti’s Theorem The sequence of random variables {Xn }n∈N is exchangeable if for any n and permutation π of {1, . . . , n} P (X1 ∈ B1 , . . . , Xn ∈ Bn ) = P (Xπ(1) ∈ B1 , . . . , Xπ(n) ∈ Bn ), for measurable sets Bi ⊆ X .

Theorem (De Finetti’s Theorem) A sequence of random variables {Xn }n∈N is exchangeable if and only if there exists a unique probability measure Q on P(X ) such that for any n and measurable sets Bi ⊆ X , Z P (X1 ∈ B1 , . . . , Xn ∈ Bn ) =

n Y

P (Bi )dQ(P ).

P(X ) i=1

Sara Wade

Dirichlet Process

17 / 26

B+M urn scheme and the DP

Theorem (Blackwell and MacQueen (1973)) The distribution of {Xn }n∈N is described by the Blackwell and MacQueen iid

urn scheme if and only if Xi |P ∼ P and P ∼ DP(αP0 ).

Sara Wade

Dirichlet Process

18 / 26

Clustering Since P is discrete a.s., there is a positive probability of ties among the sample (x1 , . . . , xn ). Let kn denote the number of unique values; (θ1∗ , . . . , θk∗n ) denote the unique values; and nj denote the cluster sizes. Assuming P0 is non-atomic, from the B+M urn scheme, we have

xn+1

x1 = θ1∗ ,  ∗ α θkn +1 with prob. α+n |x= , nj θj∗ with prob. α+n for j = 1, . . . , k

iid

where θj∗ ∼ P0 . The sample (x1 , . . . , xn ) can be represented in terms of the unique values (θ1∗ , . . . , θk∗n ) and the random partition (s1 , . . . , sn ) where si = j if xi = θj∗ . The predictive distribution of (s1 , . . . , sn ) is described by the Chinese restaurant process. Sara Wade

Dirichlet Process

19 / 26

DP Mixture Models Mixture models offer flexible density estimation: Z p(x|P ) = K(x|θ)dP (θ), for some parametric density K(x|θ) (ex. N(x|µ, σ 2 )). In a Bayesian setting, we define a prior for P , ex. P ∼ DP(αP0 ). ⇒ p(x|P ) =

∞ X

pj K(x|θj ).

j=1

Sara Wade

Dirichlet Process

20 / 26

DPM prior samples

−5

0

5

0.4 0.0

0.1

0.2

p(x|P)

0.3

0.4 0.3 p(x|P) 0.2 0.1 0.0

0.0

0.1

0.2

p(x|P)

0.3

0.4

0.5

alpha= 10 , c= 1

0.5

alpha= 1 , c= 1

0.5

alpha= 0.1 , c= 1

−5

0

5

−5

0

alpha= 0.1 , c= 10

alpha= 1 , c= 10

alpha= 10 , c= 10

0 x

5

0.4 0.0

0.1

0.2

p(x|P)

0.3

0.4 0.0

0.1

0.2

p(x|P)

0.3

0.4 0.3 p(x|P) 0.2 0.1 0.0

−5

5

0.5

x

0.5

x

0.5

x

−5

0 x

5

−5

0

5

x

Random draws of DP location-scale mixture of normals with base measure N(µ|0, cσ 2 )IG(σ 2 |1, 1) with various values of α and c. Sara Wade

Dirichlet Process

21 / 26

Inference in DPMs The DPM model can be hierarchically defined as: ind

Xi |θi ∼ K(x|θi ), iid

θi |P ∼ P, P ∼ DP(αP0 ). Marginal MCMC methods are based on the idea of marginalizing over P and carrying out posterior inference on (θ1 , . . . , θn ) using Gibbs sampling based on the urn scheme characterization of the DP. Other methods include truncation, slice sampling, and retrospective sampling.

Sara Wade

Dirichlet Process

22 / 26

Marginal Inference in DPMs In, marginal MCMC methods the parameters (θ1 , . . . , θn ) are represented as s = (s1 , . . . , sn ) and θ∗ = (θ1∗ , . . . , θk∗n ). The algorithm then proceeds by for i = 1, . . . , n, sample si |x, s−i , θ∗ where  1 R for j = kn + 1 −i ∗ Z α K(xi |θ)dP0 (θ) . p(si = j|x, s , θ ) = 1 ∗ for j = 1, . . . , kn Z nj K(xi |θj ) sample θ∗ |x, s where p(θ∗ |x, s) =

kn Y

p(θj∗ |xj ),

j=1

for xj = (xi )i:si =j , and Y

p(θj∗ |xj ) ∝ P0 (dθj∗ )

K(xi |θj∗ ).

i:i=sj Sara Wade

Dirichlet Process

23 / 26

Dependent Dirichlet Process

Dependent Dirichlet process priors define a distribution over a collection of random probability measures {Px }x∈X such that the Px ’s are dependent and marginally Px is a Dirichlet process. If the input x is discrete and categorical, examples include iid

Hierarchical DP (Teh et al. 2006): Pm |P ∼ DP(αP ) for m = 1, . . . , M ; P ∼ DP(βP0 ). iid

Nested DP (Rodriguez and Dunson 2011): Pm |Q ∼ Q for m = 1, . . . , M ; Q ∼ DP(αDP(βP0 )).

Sara Wade

Dirichlet Process

24 / 26

Dependent Dirichlet Process MacEachern’s (1999) general class of dependent Dirichlet processes are defined based on the stick-breaking representation: Px =

∞ X

pj (x)δθj (x) ,

j=1

where θj (x) are independent stochastic processes (ex. θj (x) ∼ GP(0, k(x, x0 ))), and Y p1 (x) = v1 (x); pj (x) = vj (x) (1 − vj 0 (x)) for j > 1, j 0
for independent stochastic processes vj (x) such that marginally vj (x) ∼ Beta(1, α(x)).

Sara Wade

Dirichlet Process

25 / 26

References

Ghosh, J.K. and Ramamoorthi, R.V. (2003). Bayesian nonparametrics. Springer Series in Statistics. Rasmussen, C.E. and Ghahramani, Z. (2013). Machine learning course. http://mlg.eng.cam.ac.uk/teaching/4f13/1213

Sara Wade

Dirichlet Process

26 / 26

Dirichlet Process

Dirichlet densities from Wikipedia. Sara Wade. Dirichlet Process. 4 / 26 .... Borel σ-algebra under weak convergence. Definition. P has a Dirichlet process prior with parameters α > 0 and P0 ∈ P(X), denoted DP(αP0), if for any finite measurable partition (B1,...,Bm),. (P(B1),...,P(Bm)) ∼ Dir(αP0(B1),...,αP0(Bm)). Parameters:.

761KB Sizes 2 Downloads 191 Views

Recommend Documents

The Smoothed Dirichlet distribution - Semantic Scholar
for online IR tasks. We use the new ... class of language models for information retrieval ... distribution requires iterative gradient descent tech- niques for ...

MONOTONICITY RESULTS FOR DIRICHLET L ...
0 e−stdγ(s). Lately, the class of completely monotonic functions have been greatly expanded to .... Define an equivalence relation ∼ on B by g ∼ h if and only if ...

The Smoothed Dirichlet distribution - Semantic Scholar
for online IR tasks. We use the .... distribution requires iterative gradient descent tech- .... ous degrees of smoothing: dots are smoothed-document models. )10.

Prior-based Dual Additive Latent Dirichlet Allocation for ... - Wei Zhang
site and user tips in location-based social networks, ... in online shopping websites (e.g., Amazon1), users prefer to ..... 10 20 30 40 50 60 70 80 90 100.

ON DIRICHLET-TO-NEUMANN MAPS AND SOME ... - CiteSeerX
We consider Dirichlet-to-Neumann maps associated with (not necessarily ..... this context, in particular, for the precise definition of the uniform exterior ball ...

On the Dirichlet-Neumann boundary problem for scalar ...
Abstract: We consider a Dirichlet-Neumann boundary problem in a bounded domain for scalar conservation laws. We construct an approximate solution to the ...

Process-Mapping-Process-Improvement-And-Process-Management ...
There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps.

The diffuse Nitsche method: Dirichlet constraints on ...
diffuse domain, phase-field, fat boundary or spread interface methods, provide ...... Ra = 100, Young's modulus E = 10, 000, Poisson ratio ν = 0.3, a traction-free ...

the smoothed dirichlet distribution: understanding cross ...
of the requirements for the degree of. DOCTOR OF ... Computer Science ..... 3.3 Domain of smoothed proportions бгв for various degrees of smoothing: ..... In chapter 6, we implement the SD based classifier to the online task of topic tracking.

ON DIRICHLET-TO-NEUMANN MAPS AND SOME ...
we introduce the perturbed Schrödinger operators HD ... To appear in Proceedings of the conference on Operator Theory, Analysis in Mathematical Physics - ...

On the Supremum of Random Dirichlet Polynomials ...
On the Supremum of Random Dirichlet Polynomials. Mikhail Lifshits and Michel Weber. We study the supremum of some random Dirichlet polynomials. DN (t) =.

Series expansions for the solution of the Dirichlet ...
power series expansions of the solutions of such systems of integral equations. .... We introduce here the operator M ≡ (Mo,Mi,Mc) which is related to a specific ...

Prior-based Dual Additive Latent Dirichlet Allocation for ... - Wei Zhang
site and user tips in location-based social networks, ... corresponding user or item concentrates more on that topic. Then an exponential .... Luckily, if the assignments of latent top- ics can be .... 12 Calculate θ and β through Equation (9) and

A singularly perturbed Dirichlet problem for the Laplace ...
(ii) What can be said on the map (ϵ, g) ↦→ ∫. Q\clΩϵ. |Dxu[ϵ, g](x)|2 dx around (ϵ, g) = (0,g0)?. Questions of this type have long been investigated, e.g., for problems on a bounded domain with a small hole with the methods of asymptotic a

A singularly perturbed Dirichlet problem for the Poisson ...
[8] M. Dalla Riva and M. Lanza de Cristoforis, A singularly perturbed nonlinear trac- tion boundary value problem for linearized elastostatics. A functional analytic approach. Analysis (Munich) 30 (2010), 67–92. [9] M. Dalla Riva, M. Lanza de Crist

process - GitHub
The Linux Scheduling Implementation 50. Time Accounting 50. The Scheduler Entity Structure 50. The Virtual Runtime 51. From the Library of Wow! eBook ... Interrupt Control 127. Disabling and Enabling Interrupts 127. Disabling a Specific Interrupt Lin

Leading process management company cleans up its own process for ...
Business. Emerson Process Management is a leading global supplier of products, services, and ... and customers to easily search the company's data repositories and quickly find answers to ... information with the Google Search Appliance.

FREE [PDF] Business Process Outsourcing: Process ...
... on business process outsourcing BPO success This paper argues that there is a direct impact of In this paper we empirically investigate what motivates firms to ...

Leading process management company cleans up its own process for ...
optimization services that enable companies to run efficient, profit maximizing plants. Emerson .... or website search engine into a system that is as relevant and ...

Process Consulting
Apr 16, 2013 - If the business is attempting to implement an ERP Software as part of the process streamlining, then we ... include the new formats in hard copy (ERP software implemented in case of electronic formats). Having ... We would always sugge