Dirichlet Process Sara Wade University of Cambridge
Charles University 8-19 April 2013, Prague
Sara Wade
Dirichlet Process
1 / 26
Categorical and multinomial distributions Consider a discrete random variable X taking one ofPk possible outcomes. Among n independent and identical trials, let nj = ni=1 1(xi = j). Note P that n = kj=1 nj . The distribution of Xi is given by the categorical P distribution, parametrized by p = (p1 , . . . , pk ) such that pj = 1, where 1(x=1)
p(x|p) = p1
1(x=k)
∗ . . . ∗ pk
.
The probability of observing counts (n1 , . . . , nk ) is given by the multinomial distribution, where p(n1 , . . . , nk |p) =
k Y n! n pj j . n1 ! · · · nk ! j=1
Ex. (n1 , . . . , nk ) is the frequency of words in a text. Sara Wade
Dirichlet Process
2 / 26
Dirichlet distribution The Dirichlet distribution is P defined on Sk = {(p1 , . . . , pk ) : pj ≥ 0, kj=1 pj = 1} with density P k Γ( kj=1 αj ) Y α −1 pj j . p((p1 , . . . , pk )|(α1 , . . . , αk )) = Qk j=1 Γ(αj ) j=1 It is the conjugate prior to the multinomial likelihood. Parameters: (α1 , . . . , αk ) such that αj ≥ 0, are often reparametrized as α=
k X
αj ;
p0 = (p0 1 , . . . , p0 k ) =
j=1
α
1
α
,...,
αk . α
Properties: E[pj ] = p0 j ← prior guess. V(pj ) =
p0 j (1−p0 j ) α+1
Sara Wade
← α controls the variability. Dirichlet Process
3 / 26
Dirichlet densities from Wikipedia
Sara Wade
Dirichlet Process
4 / 26
Connections with other distributions ind
1. if zj ∼ Gam(αi , 1), then d
(p1 , . . . , pk ) =
z1 Pk
j=1 zj
zk
, . . . , Pk
j=1 zj
! .
This property is used to simulate from a Dirichlet distribution. P ind 2. if vj ∼ Beta(αj , j 0 >j αj 0 ) for j = 1, . . . , k − 1 and vk is degenerate at 1, then Y d (p1 , . . . , pk ) = v1 , v2 (1 − v1 ) . . . , vk (1 − vj ) . j
Note that vj =
Sara Wade
P pj 1− j 0
Dirichlet Process
5 / 26
Symmetric Dirichlet distribution The symmetric Dirichlet distribution is defined with p0 j = j = 1, . . . , k.
2
3
4
5
6
7
8
9
10
1.0 0.8 0.0
0.2
0.4
0.6
0.8 0.0
0.2
0.4
0.6
0.8 0.6 0.4 0.2 0.0
1
for
alpha= 10
1.0
alpha= 1
1.0
alpha= 0.1
1 k
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
Densities p drawn at random from a symmetric Dirichlet distribution with various precision parameters.
Sara Wade
Dirichlet Process
6 / 26
Posterior and Predictive Categorical model: iid
Xi |p ∼ Cat(p). Dirichlet prior: p ∼ Dir(αp0 ). → Leads to a Dirichlet posterior p|x ∼ Dir(α ˆ pˆ), where α ˆ = α + n;
pˆj = α ˆ −1 (αp0 j + nj ).
→ Leads to a Categorical predictive with p(Xn+1 = j|x) = pˆj .
Sara Wade
Dirichlet Process
7 / 26
P´olya urn scheme
The P´ olya urn scheme describes the distribution of a sequence of random variables {Xn }n∈N taking values in {1, . . . , k}. Consider an urn with αp0 j balls of color j for j = 1, . . . , k. A ball is drawn from the urn and replaced along with another ball of the same color. The random variable Xn is set to j if the nth ball drawn is of color j. Formally, the law of {Xn }n∈N is given by P (X1 = j) = p0 j , P (Xn+1 = j|x) =
Sara Wade
αp0 j +nj α+n
for n > 1.
Dirichlet Process
8 / 26
Exchangeability and De Finetti’s Theorem The sequence of random variables {Xn }n∈N taking values in {1, . . . , k} is exchangeable if for any n and permutation π of {1, . . . , n} P (X1 = j1 , . . . , Xn = jn ) = P (Xπ(1) = j1 , . . . , Xπ(n) = jn ), for any ji ∈ {1, . . . , k}.
Theorem (De Finetti’s Theorem) A sequence of random variables {Xn }n∈N taking values in {1, . . . , k} is exchangeable if and only if there exists a unique probability measure Q on Sk such that for any n and measurable sets any ji ∈ {1, . . . , k}, Z P (X1 = j1 , . . . , Xn = jn ) =
n Y
pji dQ(p).
Sk i=1
Sara Wade
Dirichlet Process
9 / 26
P´olya urn scheme and the Dirichlet distribution
If the distribution of {Xn }n∈N is described by the P´olya urn scheme, then {Xn }n∈N is exchangeable. If Xi |p have categorical distribution and p ∼ Dir(αp0 ), then the marginal distribution of {Xn }n∈N is described by the P´olya urn scheme. The distribution of {Xn }n∈N is described by the P´ olya urn scheme if and iid
only if Xi |p ∼ Cat(p) and p ∼ Dir(αp0 ).
Sara Wade
Dirichlet Process
10 / 26
Dirichlet Process The Dirichlet process is an extension of the Dirichlet distribution on the space of probability measures on {1, . . . , k} to the space of probability measures on a complete and separable metric space X . Let P(X ) denote the set of probability measures on X , equipped with the Borel σ-algebra under weak convergence.
Definition P has a Dirichlet process prior with parameters α > 0 and P0 ∈ P(X ), denoted DP(αP0 ), if for any finite measurable partition (B1 , . . . , Bm ), (P (B1 ), . . . , P (Bm )) ∼ Dir(αP0 (B1 ), . . . , αP0 (Bm )). Parameters: the base measure P0 is the prior guess, E[P (B)] = P0 (B), the precision parameter α controls the variability, 0 (B)) V(P (B)) = P0 (B)(1−P . α+1 Sara Wade
Dirichlet Process
11 / 26
Existence of the DP
Marginal property of the Dirichlet P distribution: Let B1 , . . . , Bm be a partition {1, . . . , k} and p(Bi ) = j∈Bi pj , (p(B1 ), . . . , p(Bm )) ∼ Dir(αp0 (B1 ), . . . , αp0 (Bm )), P where p0 (Bi ) = j∈Bi p0 j . The marginal property of the Dirichlet distribution is a key property in showing existence of the Dirichlet process.
Sara Wade
Dirichlet Process
12 / 26
Stick-breaking construction Theorem (Sethuraman (1994)) P ∼ DP(αP0 ) is characterized by the stick-breaking construction P =
∞ X
pj δθj ,
j=1 iid
where θj ∼ P0 , p 1 = v1 ;
p j = vj
Y
(1 − vj 0 ),
j 0
and vj ∼ Beta(1, α) independent of (θj ). Notice: If P ∼ DP(αP0 ), P is discrete a.s.
Sara Wade
Dirichlet Process
13 / 26
DP prior samples
1.0
alpha= 10
1.0
alpha= 1
1.0
alpha= 0.1
0.8 0.4
p
0.6
0.8 0.6 p 0.4
●
0.2
●
0.2
0.2
0.4
p
0.6
0.8
●
●
● ●
●
−2
−1
●● ●
0 theta
●
1
2
3
● ● ● ●●●● ● ●● ● ●● ● ●● ●●●● ●●●
−3
−2
−1
0
● ● ●● ●● ●●
● ● ●
1
theta
●
2
3
●
● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●●● ● ●●● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ●● ●●● ●●
0.0
●
−3
0.0
0.0
●
−3
−2
−1
0
1
2
3
theta
Random draws of P ∼ DP(αN(0, 1)) with various precision parameters. Simulation is based on the stick-breaking construction.
Sara Wade
Dirichlet Process
14 / 26
Posterior and Predictive Model:
iid
Xi |P ∼ P. Dirichlet process prior: P ∼ DP(αP0 ). → Leads to a Dirichlet process posterior P |x ∼ DP(ˆ αPˆ ), where α ˆ = α + n;
Pˆ = α ˆ −1 (αP0 +
n X
δxi ).
i=1
→ Predictive distribution is P (Xn+1 ∈ B|x) = Pˆ (B), for any Borel set B ⊂ X . Sara Wade
Dirichlet Process
15 / 26
Blackwell and MacQueen urn scheme The Blackwell and MacQueen urn scheme describes the distribution of a sequence of random variables {Xn }n∈N taking values in X . Consider an urn with α black balls. Step 1: a black ball is drawn from the urn, and once drawn, its true color is revealed as θ1∗ from P0 ; it is replaced along with a black ball. Step n + 1: a ball is drawn from the urn. If the ball is black, once drawn, its true color is revealed as θk∗n +1 , and it is replaced along with a black ball. Otherwise, it is of color θj∗ for j = 1, . . . , kn , and it is replaced along with another ball of the same color. Here kn denotes the number of black balls drawn among the first n draws. We set Xn = θj∗ if the nth ball drawn is color θj∗ . Formally, the law of {Xn }n∈N is given by P (X1 ∈ B) = P0 (B), P (Xn+1 ∈ B|x) =
P αP0 (B)+ n i=1 δxi (B) α+n
for n > 1.
for any Borel set B ⊆ X . Sara Wade
Dirichlet Process
16 / 26
Exchangeability and De Finetti’s Theorem The sequence of random variables {Xn }n∈N is exchangeable if for any n and permutation π of {1, . . . , n} P (X1 ∈ B1 , . . . , Xn ∈ Bn ) = P (Xπ(1) ∈ B1 , . . . , Xπ(n) ∈ Bn ), for measurable sets Bi ⊆ X .
Theorem (De Finetti’s Theorem) A sequence of random variables {Xn }n∈N is exchangeable if and only if there exists a unique probability measure Q on P(X ) such that for any n and measurable sets Bi ⊆ X , Z P (X1 ∈ B1 , . . . , Xn ∈ Bn ) =
n Y
P (Bi )dQ(P ).
P(X ) i=1
Sara Wade
Dirichlet Process
17 / 26
B+M urn scheme and the DP
Theorem (Blackwell and MacQueen (1973)) The distribution of {Xn }n∈N is described by the Blackwell and MacQueen iid
urn scheme if and only if Xi |P ∼ P and P ∼ DP(αP0 ).
Sara Wade
Dirichlet Process
18 / 26
Clustering Since P is discrete a.s., there is a positive probability of ties among the sample (x1 , . . . , xn ). Let kn denote the number of unique values; (θ1∗ , . . . , θk∗n ) denote the unique values; and nj denote the cluster sizes. Assuming P0 is non-atomic, from the B+M urn scheme, we have
xn+1
x1 = θ1∗ , ∗ α θkn +1 with prob. α+n |x= , nj θj∗ with prob. α+n for j = 1, . . . , k
iid
where θj∗ ∼ P0 . The sample (x1 , . . . , xn ) can be represented in terms of the unique values (θ1∗ , . . . , θk∗n ) and the random partition (s1 , . . . , sn ) where si = j if xi = θj∗ . The predictive distribution of (s1 , . . . , sn ) is described by the Chinese restaurant process. Sara Wade
Dirichlet Process
19 / 26
DP Mixture Models Mixture models offer flexible density estimation: Z p(x|P ) = K(x|θ)dP (θ), for some parametric density K(x|θ) (ex. N(x|µ, σ 2 )). In a Bayesian setting, we define a prior for P , ex. P ∼ DP(αP0 ). ⇒ p(x|P ) =
∞ X
pj K(x|θj ).
j=1
Sara Wade
Dirichlet Process
20 / 26
DPM prior samples
−5
0
5
0.4 0.0
0.1
0.2
p(x|P)
0.3
0.4 0.3 p(x|P) 0.2 0.1 0.0
0.0
0.1
0.2
p(x|P)
0.3
0.4
0.5
alpha= 10 , c= 1
0.5
alpha= 1 , c= 1
0.5
alpha= 0.1 , c= 1
−5
0
5
−5
0
alpha= 0.1 , c= 10
alpha= 1 , c= 10
alpha= 10 , c= 10
0 x
5
0.4 0.0
0.1
0.2
p(x|P)
0.3
0.4 0.0
0.1
0.2
p(x|P)
0.3
0.4 0.3 p(x|P) 0.2 0.1 0.0
−5
5
0.5
x
0.5
x
0.5
x
−5
0 x
5
−5
0
5
x
Random draws of DP location-scale mixture of normals with base measure N(µ|0, cσ 2 )IG(σ 2 |1, 1) with various values of α and c. Sara Wade
Dirichlet Process
21 / 26
Inference in DPMs The DPM model can be hierarchically defined as: ind
Xi |θi ∼ K(x|θi ), iid
θi |P ∼ P, P ∼ DP(αP0 ). Marginal MCMC methods are based on the idea of marginalizing over P and carrying out posterior inference on (θ1 , . . . , θn ) using Gibbs sampling based on the urn scheme characterization of the DP. Other methods include truncation, slice sampling, and retrospective sampling.
Sara Wade
Dirichlet Process
22 / 26
Marginal Inference in DPMs In, marginal MCMC methods the parameters (θ1 , . . . , θn ) are represented as s = (s1 , . . . , sn ) and θ∗ = (θ1∗ , . . . , θk∗n ). The algorithm then proceeds by for i = 1, . . . , n, sample si |x, s−i , θ∗ where 1 R for j = kn + 1 −i ∗ Z α K(xi |θ)dP0 (θ) . p(si = j|x, s , θ ) = 1 ∗ for j = 1, . . . , kn Z nj K(xi |θj ) sample θ∗ |x, s where p(θ∗ |x, s) =
kn Y
p(θj∗ |xj ),
j=1
for xj = (xi )i:si =j , and Y
p(θj∗ |xj ) ∝ P0 (dθj∗ )
K(xi |θj∗ ).
i:i=sj Sara Wade
Dirichlet Process
23 / 26
Dependent Dirichlet Process
Dependent Dirichlet process priors define a distribution over a collection of random probability measures {Px }x∈X such that the Px ’s are dependent and marginally Px is a Dirichlet process. If the input x is discrete and categorical, examples include iid
Hierarchical DP (Teh et al. 2006): Pm |P ∼ DP(αP ) for m = 1, . . . , M ; P ∼ DP(βP0 ). iid
Nested DP (Rodriguez and Dunson 2011): Pm |Q ∼ Q for m = 1, . . . , M ; Q ∼ DP(αDP(βP0 )).
Sara Wade
Dirichlet Process
24 / 26
Dependent Dirichlet Process MacEachern’s (1999) general class of dependent Dirichlet processes are defined based on the stick-breaking representation: Px =
∞ X
pj (x)δθj (x) ,
j=1
where θj (x) are independent stochastic processes (ex. θj (x) ∼ GP(0, k(x, x0 ))), and Y p1 (x) = v1 (x); pj (x) = vj (x) (1 − vj 0 (x)) for j > 1, j 0
for independent stochastic processes vj (x) such that marginally vj (x) ∼ Beta(1, α(x)).
Sara Wade
Dirichlet Process
25 / 26
References
Ghosh, J.K. and Ramamoorthi, R.V. (2003). Bayesian nonparametrics. Springer Series in Statistics. Rasmussen, C.E. and Ghahramani, Z. (2013). Machine learning course. http://mlg.eng.cam.ac.uk/teaching/4f13/1213
Sara Wade
Dirichlet Process
26 / 26