Dirichlet Process

Viewer
Transcript

Dirichlet Process Sara Wade University of Cambridge

Charles University 8-19 April 2013, Prague

Sara Wade

Dirichlet Process

1 / 26

Categorical and multinomial distributions Consider a discrete random variable X taking one ofPk possible outcomes. Among n independent and identical trials, let nj = ni=1 1(xi = j). Note P that n = kj=1 nj . The distribution of Xi is given by the categorical P distribution, parametrized by p = (p1 , . . . , pk ) such that pj = 1, where 1(x=1)

p(x|p) = p1

1(x=k)

∗ . . . ∗ pk

.

The probability of observing counts (n1 , . . . , nk ) is given by the multinomial distribution, where p(n1 , . . . , nk |p) =

k Y n! n pj j . n1 ! · · · nk ! j=1

Ex. (n1 , . . . , nk ) is the frequency of words in a text. Sara Wade

Dirichlet Process

2 / 26

Dirichlet distribution The Dirichlet distribution is P defined on Sk = {(p1 , . . . , pk ) : pj ≥ 0, kj=1 pj = 1} with density P k Γ( kj=1 αj ) Y α −1 pj j . p((p1 , . . . , pk )|(α1 , . . . , αk )) = Qk j=1 Γ(αj ) j=1 It is the conjugate prior to the multinomial likelihood. Parameters: (α1 , . . . , αk ) such that αj ≥ 0, are often reparametrized as α=

k X

αj ;

p0 = (p0 1 , . . . , p0 k ) =

j=1

α

1

α

,...,

αk . α

Properties: E[pj ] = p0 j ← prior guess. V(pj ) =

p0 j (1−p0 j ) α+1

Sara Wade

← α controls the variability. Dirichlet Process

3 / 26

Dirichlet densities from Wikipedia

Sara Wade

Dirichlet Process

4 / 26

Connections with other distributions ind

1. if zj ∼ Gam(αi , 1), then d

(p1 , . . . , pk ) =

z1 Pk

j=1 zj

zk

, . . . , Pk

j=1 zj

! .

This property is used to simulate from a Dirichlet distribution. P ind 2. if vj ∼ Beta(αj , j 0 >j αj 0 ) for j = 1, . . . , k − 1 and vk is degenerate at 1, then   Y d  (p1 , . . . , pk ) = v1 , v2 (1 − v1 ) . . . , vk (1 − vj ) . j
Note that vj =

Sara Wade

P pj 1− j 0
Dirichlet Process

5 / 26

Symmetric Dirichlet distribution The symmetric Dirichlet distribution is defined with p0 j = j = 1, . . . , k.

2

3

4

5

6

7

8

9

10

1.0 0.8 0.0

0.2

0.4

0.6

0.8 0.0

0.2

0.4

0.6

0.8 0.6 0.4 0.2 0.0

1

for

alpha= 10

1.0

alpha= 1

1.0

alpha= 0.1

1 k

1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

6

7

8

9

10

Densities p drawn at random from a symmetric Dirichlet distribution with various precision parameters.

Sara Wade

Dirichlet Process

6 / 26

Posterior and Predictive Categorical model: iid

Xi |p ∼ Cat(p). Dirichlet prior: p ∼ Dir(αp0 ). → Leads to a Dirichlet posterior p|x ∼ Dir(α ˆ pˆ), where α ˆ = α + n;

pˆj = α ˆ −1 (αp0 j + nj ).

→ Leads to a Categorical predictive with p(Xn+1 = j|x) = pˆj .

Sara Wade

Dirichlet Process

7 / 26

P´olya urn scheme

The P´ olya urn scheme describes the distribution of a sequence of random variables {Xn }n∈N taking values in {1, . . . , k}. Consider an urn with αp0 j balls of color j for j = 1, . . . , k. A ball is drawn from the urn and replaced along with another ball of the same color. The random variable Xn is set to j if the nth ball drawn is of color j. Formally, the law of {Xn }n∈N is given by P (X1 = j) = p0 j , P (Xn+1 = j|x) =

Sara Wade

αp0 j +nj α+n

for n > 1.

Dirichlet Process

8 / 26

Exchangeability and De Finetti’s Theorem The sequence of random variables {Xn }n∈N taking values in {1, . . . , k} is exchangeable if for any n and permutation π of {1, . . . , n} P (X1 = j1 , . . . , Xn = jn ) = P (Xπ(1) = j1 , . . . , Xπ(n) = jn ), for any ji ∈ {1, . . . , k}.

Theorem (De Finetti’s Theorem) A sequence of random variables {Xn }n∈N taking values in {1, . . . , k} is exchangeable if and only if there exists a unique probability measure Q on Sk such that for any n and measurable sets any ji ∈ {1, . . . , k}, Z P (X1 = j1 , . . . , Xn = jn ) =

n Y

pji dQ(p).

Sk i=1

Sara Wade

Dirichlet Process

9 / 26

P´olya urn scheme and the Dirichlet distribution

If the distribution of {Xn }n∈N is described by the P´olya urn scheme, then {Xn }n∈N is exchangeable. If Xi |p have categorical distribution and p ∼ Dir(αp0 ), then the marginal distribution of {Xn }n∈N is described by the P´olya urn scheme. The distribution of {Xn }n∈N is described by the P´ olya urn scheme if and iid

only if Xi |p ∼ Cat(p) and p ∼ Dir(αp0 ).

Sara Wade

Dirichlet Process

10 / 26

Dirichlet Process The Dirichlet process is an extension of the Dirichlet distribution on the space of probability measures on {1, . . . , k} to the space of probability measures on a complete and separable metric space X . Let P(X ) denote the set of probability measures on X , equipped with the Borel σ-algebra under weak convergence.

Definition P has a Dirichlet process prior with parameters α > 0 and P0 ∈ P(X ), denoted DP(αP0 ), if for any finite measurable partition (B1 , . . . , Bm ), (P (B1 ), . . . , P (Bm )) ∼ Dir(αP0 (B1 ), . . . , αP0 (Bm )). Parameters: the base measure P0 is the prior guess, E[P (B)] = P0 (B), the precision parameter α controls the variability, 0 (B)) V(P (B)) = P0 (B)(1−P . α+1 Sara Wade

Dirichlet Process

11 / 26

Existence of the DP

Marginal property of the Dirichlet P distribution: Let B1 , . . . , Bm be a partition {1, . . . , k} and p(Bi ) = j∈Bi pj , (p(B1 ), . . . , p(Bm )) ∼ Dir(αp0 (B1 ), . . . , αp0 (Bm )), P where p0 (Bi ) = j∈Bi p0 j . The marginal property of the Dirichlet distribution is a key property in showing existence of the Dirichlet process.

Sara Wade

Dirichlet Process

12 / 26

Stick-breaking construction Theorem (Sethuraman (1994)) P ∼ DP(αP0 ) is characterized by the stick-breaking construction P =

∞ X

pj δθj ,

j=1 iid

where θj ∼ P0 , p 1 = v1 ;

p j = vj

Y

(1 − vj 0 ),

j 0
and vj ∼ Beta(1, α) independent of (θj ). Notice: If P ∼ DP(αP0 ), P is discrete a.s.

Sara Wade

Dirichlet Process

13 / 26

DP prior samples

1.0

alpha= 10

1.0

alpha= 1

1.0

alpha= 0.1

0.8 0.4

p

0.6

0.8 0.6 p 0.4

●

0.2

●

0.2

0.2

0.4

p

0.6

0.8

●

●

● ●

●

−2

−1

●● ●

0 theta

●

1

2

3

● ● ● ●●●● ● ●● ● ●● ● ●● ●●●● ●●●

−3

−2

−1

0

● ● ●● ●● ●●

● ● ●

1

theta

●

2

3

●

● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●●● ● ●●● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ●● ●●● ●●

0.0

●

−3

0.0

0.0

●

−3

−2

−1

0

1

2

3

theta

Random draws of P ∼ DP(αN(0, 1)) with various precision parameters. Simulation is based on the stick-breaking construction.

Sara Wade

Dirichlet Process

14 / 26

Posterior and Predictive Model:

iid

Xi |P ∼ P. Dirichlet process prior: P ∼ DP(αP0 ). → Leads to a Dirichlet process posterior P |x ∼ DP(ˆ αPˆ ), where α ˆ = α + n;

Pˆ = α ˆ −1 (αP0 +

n X

δxi ).

i=1

→ Predictive distribution is P (Xn+1 ∈ B|x) = Pˆ (B), for any Borel set B ⊂ X . Sara Wade

Dirichlet Process

15 / 26

Blackwell and MacQueen urn scheme The Blackwell and MacQueen urn scheme describes the distribution of a sequence of random variables {Xn }n∈N taking values in X . Consider an urn with α black balls. Step 1: a black ball is drawn from the urn, and once drawn, its true color is revealed as θ1∗ from P0 ; it is replaced along with a black ball. Step n + 1: a ball is drawn from the urn. If the ball is black, once drawn, its true color is revealed as θk∗n +1 , and it is replaced along with a black ball. Otherwise, it is of color θj∗ for j = 1, . . . , kn , and it is replaced along with another ball of the same color. Here kn denotes the number of black balls drawn among the first n draws. We set Xn = θj∗ if the nth ball drawn is color θj∗ . Formally, the law of {Xn }n∈N is given by P (X1 ∈ B) = P0 (B), P (Xn+1 ∈ B|x) =

P αP0 (B)+ n i=1 δxi (B) α+n

for n > 1.

for any Borel set B ⊆ X . Sara Wade

Dirichlet Process

16 / 26

Exchangeability and De Finetti’s Theorem The sequence of random variables {Xn }n∈N is exchangeable if for any n and permutation π of {1, . . . , n} P (X1 ∈ B1 , . . . , Xn ∈ Bn ) = P (Xπ(1) ∈ B1 , . . . , Xπ(n) ∈ Bn ), for measurable sets Bi ⊆ X .

Theorem (De Finetti’s Theorem) A sequence of random variables {Xn }n∈N is exchangeable if and only if there exists a unique probability measure Q on P(X ) such that for any n and measurable sets Bi ⊆ X , Z P (X1 ∈ B1 , . . . , Xn ∈ Bn ) =

n Y

P (Bi )dQ(P ).

P(X ) i=1

Sara Wade

Dirichlet Process

17 / 26

B+M urn scheme and the DP

Theorem (Blackwell and MacQueen (1973)) The distribution of {Xn }n∈N is described by the Blackwell and MacQueen iid

urn scheme if and only if Xi |P ∼ P and P ∼ DP(αP0 ).

Sara Wade

Dirichlet Process

18 / 26

Clustering Since P is discrete a.s., there is a positive probability of ties among the sample (x1 , . . . , xn ). Let kn denote the number of unique values; (θ1∗ , . . . , θk∗n ) denote the unique values; and nj denote the cluster sizes. Assuming P0 is non-atomic, from the B+M urn scheme, we have

xn+1

x1 = θ1∗ , ∗ α θkn +1 with prob. α+n |x= , nj θj∗ with prob. α+n for j = 1, . . . , k

iid

where θj∗ ∼ P0 . The sample (x1 , . . . , xn ) can be represented in terms of the unique values (θ1∗ , . . . , θk∗n ) and the random partition (s1 , . . . , sn ) where si = j if xi = θj∗ . The predictive distribution of (s1 , . . . , sn ) is described by the Chinese restaurant process. Sara Wade

Dirichlet Process

19 / 26

DP Mixture Models Mixture models offer flexible density estimation: Z p(x|P ) = K(x|θ)dP (θ), for some parametric density K(x|θ) (ex. N(x|µ, σ 2 )). In a Bayesian setting, we define a prior for P , ex. P ∼ DP(αP0 ). ⇒ p(x|P ) =

∞ X

pj K(x|θj ).

j=1

Sara Wade

Dirichlet Process

20 / 26

DPM prior samples

−5

0

5

0.4 0.0

0.1

0.2

p(x|P)

0.3

0.4 0.3 p(x|P) 0.2 0.1 0.0

0.0

0.1

0.2

p(x|P)

0.3

0.4

0.5

alpha= 10 , c= 1

0.5

alpha= 1 , c= 1

0.5

alpha= 0.1 , c= 1

−5

0

5

−5

0

alpha= 0.1 , c= 10

alpha= 1 , c= 10

alpha= 10 , c= 10

0 x

5

0.4 0.0

0.1

0.2

p(x|P)

0.3

0.4 0.0

0.1

0.2

p(x|P)

0.3

0.4 0.3 p(x|P) 0.2 0.1 0.0

−5

5

0.5

x

0.5

x

0.5

x

−5

0 x

5

−5

0

5

x

Random draws of DP location-scale mixture of normals with base measure N(µ|0, cσ 2 )IG(σ 2 |1, 1) with various values of α and c. Sara Wade

Dirichlet Process

21 / 26

Inference in DPMs The DPM model can be hierarchically defined as: ind

Xi |θi ∼ K(x|θi ), iid

θi |P ∼ P, P ∼ DP(αP0 ). Marginal MCMC methods are based on the idea of marginalizing over P and carrying out posterior inference on (θ1 , . . . , θn ) using Gibbs sampling based on the urn scheme characterization of the DP. Other methods include truncation, slice sampling, and retrospective sampling.

Sara Wade

Dirichlet Process

22 / 26

Marginal Inference in DPMs In, marginal MCMC methods the parameters (θ1 , . . . , θn ) are represented as s = (s1 , . . . , sn ) and θ∗ = (θ1∗ , . . . , θk∗n ). The algorithm then proceeds by for i = 1, . . . , n, sample si |x, s−i , θ∗ where 1 R for j = kn + 1 −i ∗ Z α K(xi |θ)dP0 (θ) . p(si = j|x, s , θ ) = 1 ∗ for j = 1, . . . , kn Z nj K(xi |θj ) sample θ∗ |x, s where p(θ∗ |x, s) =

kn Y

p(θj∗ |xj ),

j=1

for xj = (xi )i:si =j , and Y

p(θj∗ |xj ) ∝ P0 (dθj∗ )

K(xi |θj∗ ).

i:i=sj Sara Wade

Dirichlet Process

23 / 26

Dependent Dirichlet Process

Dependent Dirichlet process priors define a distribution over a collection of random probability measures {Px }x∈X such that the Px ’s are dependent and marginally Px is a Dirichlet process. If the input x is discrete and categorical, examples include iid

Hierarchical DP (Teh et al. 2006): Pm |P ∼ DP(αP ) for m = 1, . . . , M ; P ∼ DP(βP0 ). iid

Nested DP (Rodriguez and Dunson 2011): Pm |Q ∼ Q for m = 1, . . . , M ; Q ∼ DP(αDP(βP0 )).

Sara Wade

Dirichlet Process

24 / 26

Dependent Dirichlet Process MacEachern’s (1999) general class of dependent Dirichlet processes are defined based on the stick-breaking representation: Px =

∞ X

pj (x)δθj (x) ,

j=1

where θj (x) are independent stochastic processes (ex. θj (x) ∼ GP(0, k(x, x0 ))), and Y p1 (x) = v1 (x); pj (x) = vj (x) (1 − vj 0 (x)) for j > 1, j 0
for independent stochastic processes vj (x) such that marginally vj (x) ∼ Beta(1, α(x)).

Sara Wade

Dirichlet Process

25 / 26

References

Ghosh, J.K. and Ramamoorthi, R.V. (2003). Bayesian nonparametrics. Springer Series in Statistics. Rasmussen, C.E. and Ghahramani, Z. (2013). Machine learning course. http://mlg.eng.cam.ac.uk/teaching/4f13/1213

Sara Wade

Dirichlet Process

26 / 26

The Smoothed Dirichlet distribution - Semantic Scholar

MONOTONICITY RESULTS FOR DIRICHLET L ...

The Smoothed Dirichlet distribution - Semantic Scholar

Prior-based Dual Additive Latent Dirichlet Allocation for ... - Wei Zhang

ON DIRICHLET-TO-NEUMANN MAPS AND SOME ... - CiteSeerX

On the Dirichlet-Neumann boundary problem for scalar ...

Process-Mapping-Process-Improvement-And-Process-Management ...

The diffuse Nitsche method: Dirichlet constraints on ...

the smoothed dirichlet distribution: understanding cross ...

ON DIRICHLET-TO-NEUMANN MAPS AND SOME ...

On the Supremum of Random Dirichlet Polynomials ...

Series expansions for the solution of the Dirichlet ...

Prior-based Dual Additive Latent Dirichlet Allocation for ... - Wei Zhang

A singularly perturbed Dirichlet problem for the Laplace ...

A singularly perturbed Dirichlet problem for the Poisson ...

process - GitHub

Leading process management company cleans up its own process for ...

FREE [PDF] Business Process Outsourcing: Process ...

Leading process management company cleans up its own process for ...

Process Consulting

Dirichlet densities from Wikipedia. Sara Wade. Dirichlet Process. 4 / 26 .... Borel Ï-algebra under weak convergence. Definition. P has a Dirichlet process prior with parameters Î± > 0 and P0 â P(X), denoted DP(Î±P0), if for any finite measurable partition (B1,...,Bm),. (P(B1),...,P(Bm)) â¼ Dir(Î±P0(B1),...,Î±P0(Bm)). Parameters:.

Download PDF

761KB Sizes 2 Downloads 191 Views

Report

The Smoothed Dirichlet distribution - Semantic Scholar

MONOTONICITY RESULTS FOR DIRICHLET L ...

The Smoothed Dirichlet distribution - Semantic Scholar

Prior-based Dual Additive Latent Dirichlet Allocation for ... - Wei Zhang

ON DIRICHLET-TO-NEUMANN MAPS AND SOME ... - CiteSeerX

On the Dirichlet-Neumann boundary problem for scalar ...

Process-Mapping-Process-Improvement-And-Process-Management ...

The diffuse Nitsche method: Dirichlet constraints on ...

the smoothed dirichlet distribution: understanding cross ...

ON DIRICHLET-TO-NEUMANN MAPS AND SOME ...

On the Supremum of Random Dirichlet Polynomials ...

Series expansions for the solution of the Dirichlet ...

Prior-based Dual Additive Latent Dirichlet Allocation for ... - Wei Zhang

A singularly perturbed Dirichlet problem for the Laplace ...

A singularly perturbed Dirichlet problem for the Poisson ...

process - GitHub

Leading process management company cleans up its own process for ...

FREE [PDF] Business Process Outsourcing: Process ...

Leading process management company cleans up its own process for ...

Process Consulting

Dirichlet Process

Recommend Documents