CLUSTERING FINITE DISCRETE MARKOV CHAINS Greg Ridgeway, University of Washington and Steven Altschuler, Microsoft Corp. Greg Ridgeway, Box 354322, University of Washington, Seattle, WA 98195-4322 Keywords: clustering, Markov chain, mixture modeling

Abstract In problem situations where observations consist of a sequence of events, Markov models often prove useful. However, when there is suspected heterogeneity among the Markov transition kernels generating the observed sequences, more refined methods become necessary. In this paper we describe a probabilistic method for clustering Markov processes with a pre-specified number of clusters. We derive a Gibbs sampler and a computationally efficient hybrid MCMC-constrained EM algorithm.

Introduction Consider an s-state discrete Markov process (Ross [1993]) where the transition matrix for the process is unknown. Further assume that a dataset of N such processes, possibly of different length, exists in which each process came from one of m transition matrices and an associated initial state distribution. Therefore, we should be able to cluster together those processes that share the same underlying Markov transition structure. In this problem setting we do not know the elements of the m transition matrices and their associated initial state distributions, the proportion of processes in each cluster, nor the cluster membership of each process. Since the cluster membership is an unobservable or latent variable, closed form maximum likelihood estimators are unobtainable. Ridgeway [1997] describes an application to modeling user traversal of web sites. Although unobservable, site analysts might believe that certain classes of customers visit their site, developers, investors, etc. In order to learn how users traverse their site, to improve site design, and for collaborative filtering, the analyst needs to consider the heterogeneity of the population when clustering users.

Likelihood and posterior distribution Let ()Pij be the (i,j) element of the th probability transition matrix, or the probability that a process in cluster  would transition from state i to state j. Also let th ()pi be the i element of the initial state distribution of processes from cluster . For each of the N Markov processes, indexed by a superscript (k), we observe an initial state, i0, and the number of times the process transitioned from state i to state j, nij. Lastly, δ(k) is the unobserved 0/1 indicator that process k belongs to cluster . Therefore, the likelihood function is

s s  n( k )  f (n | p, P, δ ) = ∏∏  ( l ) pi ( k ) ∏∏ ( l ) Pij ij  0 k =1 l =1  i =1 j =1  N

m

δ l( k )

In the absence of prior information, specifying uninformative priors in straightforward. The rows of every cluster’s probability transition matrix and every cluster’s initial state distribution all receive uninformative, s dimensional Dirichlet priors. Let α be vector of length m of the mixture proportions so that δ(k) will be distributed multinomial(1,α). Lastly, we assign an uninformative Dirichlet hyperprior to α. Therefore, the posterior distribution of the unknown model parameters follows by Bayes’ Theorem.

f ( p, P, δ , α | n) ∝   ( l ) pi ∏ ∏ k =1 l =1  N

m

(k ) 0

s

s

i =1

j =1

∏ ∏ Pij

n ij(

k)

  

(k )

δl

N

m

k =1

l =1

• ∏ ∏ αl

(k )

δl

Assuming that the first order Markov assumption is correct, this distribution captures all of the information about the process clustering that is contained in the data. However, this distribution is rather complex and all of the usual distribution summary values (mean, variance, etc.) are extremely difficult to extract. Appealing to a Markov Chain Monte Carlo approach (Hastings [1970], Gelman, et al [1995]) to sample from this distribution can avoid this problem with a degree of computational cost. In this paper we use a Gibbs sampling algorithm that partitions the parameters into blocks for which sampling from the conditional distribution of any block given the remaining blocks is easy. Each row of every cluster’s probability transition matrix, each cluster’s initial state distribution, the mixture proportions, and each δ(k) form the blocks. The Gibbs sampling algorithm draws updates for each block in turn conditional on the current values of the other blocks, denoted by a superscript minus. N

N

N

s

k =1

i =1

s ∑ δ l( k ) • I( i0( k ) =i ) = ∏ ( l ) p ik =1

f ( (l ) p| (l ) p − , n ) ∝ ∏ (l ) p i (lk ) = ∏∏ (l ) p i l k =1

δ (k ) 0

N ≡ Dirichlet 1 + ∑ δ l( k ) I(i0( k ) = 1),  k =1

δ ( k ) • I( i0( k ) = i )

, 1 + ∑ δ N

k =1

δ l( k )

i =1

(k ) l

I(i 0( k ) = s )   N

N s ∑δ  s  n −   = ∏ (l ) Pij ( l ) Pi • , n) ∝ ∏  ∏ ( l ) Pij k =1  j =1 j =1  N N (k ) (k ) (k ) (k )   ≡ Dirichlet1 + ∑ δ l ni1 , ,1 + ∑ δ l nis  k =1  k =1  (k) ij

f ( (l ) Pi • |



k =1

(k) (k) nij l

N

3.

m ∑ δ l( k ) f (α | α − , n) ∝ ∏α lk =1 l =1

N ≡ Dirichlet1 + ∑ δ 1( k ) ,  k =1

,1 + ∑δ N

k =1

(k ) m

  δ l( k )

m s s   n f (δ ( k ) | δ ( k ) − , n) ∝ ∏ α l •( l ) pi ∏∏ ( l ) Pij  l =1  i =1 j =1  s s  1 n ≡ Mult1, z α1 •(1) pi ∏∏ (1) Pij , , α m •( m ) pi i =1 j =1   (k ) ij

4.

(k) 0

(k ) ij

(k ) 0



(k ) 0

s

s

i =1

j =1

∏∏

nij( k )

(m)

Pij

   

Where Z is the appropriate normalizing constant. These distributions have a rather intuitive interpretation as well. The row updates come from a distribution where the expected value is approximately the MLE for the row if the cluster assignments, δ, were known. The vector α come from a distribution where the expected value is approximately the mixture proportions if, again, the cluster assignments were known. Lastly, the cluster assignments are drawn such that the probability of each cluster is proportional to the mixture probability times the likelihood of the observation coming from the associated transition matrix. As with all MCMC implementations of parameter estimation for mixture models, this method can suffer from the “label switching” problem. The posterior density for a particular labeling of the clusters is equal for any other permutation of the labels. If the clusters are “far apart” then it is unlikely that label switching would occur. However, with weak data or clusters that are very close label switching can be common place. In normal mixture models constraints are often imposed to insure identifiability. However, constraints such as these alter the posterior distribution and in a problem such as this one might not even be possible. A more appropriate method would detect switches in the labels and make corrections. Stephens [1996] proposes such a method. Furthermore, MCMC algorithms tend to be slow to converge. Here we propose a hybrid MCMCconstrained EM algorithm that has shown substantial computational improvement.

A Constrained EM algorithm The block conditional distributions shown in the previous section have a particularly nice feature. All of the probability parameters depend only on the cluster assignments and the observable data. The reassigning of processes to clusters then depends only upon the probabilities. This observation leads to the following algorithm. 1. Randomly assign the processes to clusters. 2. Rather than sampling from a Dirichlet to update the probability estimates, estimate the probabilities using the expected value of the block conditional.

Reassign each process to the cluster that most likely generated it. The vector of probabilities for a process belonging to each cluster is exactly the multinomial probability parameter for the δ(k) block conditional. If none of the processes have been assigned to a different cluster then stop. Otherwise, go to step 2.

Hartigan’s k-means algorithm (Hartigan, et al [1979], Forgy [1967]) is analogous to this hard cluster assignment formulation for normal mixture models. The constrained EM approach lacks accuracy and detail but has the advantage of speed. The Gibbs sampler on the other hand can be used to compute arbitrary functionals of the distribution but takes several orders of magnitude longer to iterate to reasonable accuracy. Naturally a hybrid algorithm may be useful to borrow from the strengths and diminish the affect of the weaknesses of both algorithms. A hybrid algorithm iterates the constrained EM algorithm to convergence. The cluster assignments from the constrained EM algorithm provide initial assignments for the Gibbs sampler. Then, with little or no burn-in, the Gibbs algorithm runs until it obtains decent estimates for the posterior means and variance of the parameters.

Summary Analysis of sequences of events in which homogeneity of transition probabilities is suspect might benefit from this method. This algorithm not only segments the sequences but also gives interpretable results from which the analyst can readily draw application specific conclusions.

References Gelman, Carlin, Stern, and Rubin [1995]. Bayesian Data Analysis, Chapman & Hall. Hastings, W.K. [1970]. “Monte Carlo sampling methods using Markov chains and their applications.” Biometrika 57:97-109. Hartigan, J.A. and Wong, M.A. [1979]. “A k-means clustering algorithm.” Applied Statistics 28, 100-108. Forgy, E. [1965]. “Cluster Analysis of multivariate data: efficiency vs. interpretability of classifications.” Biometrics 21:768. Ridgeway, Greg. [1997] “Finite discrete Markov process clustering.” Technical Report MSR-TR-97-24, Microsoft Research. Ross, Sheldon M. [1993]. Probability Models 5th Edition, Academic Press. Stephens, Matthew [1996]. “Dealing with the multimodal distributions of mixture model parameters.” At http://www.stats.ox.ac.uk/~stephens/identify.ps.

Clustering Finite Discrete Markov Chains

computationally efficient hybrid MCMC-constrained ... data. However, this distribution is rather complex and all of the usual distribution summary values (mean,.

24KB Sizes 0 Downloads 241 Views

Recommend Documents

Finite discrete Markov process clustering
Sep 4, 1997 - Microsoft Research. Advanced Technology Division .... about the process clustering that is contained in the data. However, this distribution is ...

Finite discrete Markov process clustering
Sep 4, 1997 - about the process clustering that is contained in the data. ..... Carlin, Stern, and Rubin, Bayesian Data Analysis, Chapman & Hall, 1995. 2.

A Martingale Decomposition of Discrete Markov Chains
Apr 1, 2015 - Email: [email protected] ... may be labelled the efficient price and market microstructure noise, respectively. ... could be used to compare the long-run properties of the approximating Markov process with those of.

Lumping Markov Chains with Silent Steps
a method for the elimination of silent (τ) steps in Markovian process ...... [8] M. Bravetti, “Real time and stochastic time,” in Formal Methods for the. Design of ...

FINITE STATE MARKOV-CHAIN APPROXIMATIONS ...
discount, and F( h' 1 h) is the conditional distribution of the dividend. The law of motion for the ... unique solution for the asset price as a function p(v) of the log of the dividend. ... If the range space of state variables is small, then one ca

FINITE STATE MARKOV-CHAIN APPROXIMATIONS ...
The paper develops a procedure for finding a discrete-valued. Markov chain whose .... of the residuals, indicating that the distance in .... Permanent income in general equilibrium, Journal of Monetary Economics 13, 279-305. National Science ...

Lumping Markov Chains with Silent Steps
U = 1 0 0 0 0. 0 1. 2. 1. 2. 0 0. 0 0 0 1. 2. 1. 2. ˆ. Q = −(λ+µ) λ+µ 0. 0. −λ λ ρ. 0. −ρ. DSSG, Saarbrücken - Feb. 1; PAM, CWI Amsterdam - Feb. 22; PROSE - TU/e, ...

Lumping Markov Chains with Silent Steps - Technische Universiteit ...
VE. 0. VT E VT. ) . Con- dition 2 written in matrix form is now VEUE ( QE QET ) V = .... [9] H. Hermanns, Interactive Markov chains: The Quest for Quantified.

Using hidden Markov chains and empirical Bayes ... - Springer Link
Page 1 ... Consider a lattice of locations in one dimension at which data are observed. ... distribution of the data and we use the EM-algorithm to do this. From the ...

Phase transitions for controlled Markov chains on ...
Schrödinger operators, J. Math-for-Ind., 4B (2012), pp. 105–108. [12] F. den Hollander, Large deviations, Fields Institute Monographs 14, AMS, Prov- idence, RI, 2000. [13] A. Hordijk, Dynamic Programming and Markov Potential Theory, Math. Centre. Tra

Mixing Time of Markov Chains, Dynamical Systems and ...
has a fixed point which is a global attractor, then the mixing is fast. The limit ..... coefficients homogeneous of degree d in its variables {xij}. ...... the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS '07, pages 205–214,.

evolutionary markov chains, potential games and ...
... of Computer Science. Georgia Institute of Technology ... him a lot, he is one of the greatest researchers in theoretical computer science. His book on ..... live in high dimensional spaces, i.e., they exhibit many degrees of freedom. Under- ... c

Ranking policies in discrete Markov decision processes - Springer Link
Nov 16, 2010 - Springer Science+Business Media B.V. 2010. Abstract An ... Our new solution to the k best policies problem follows from the property: The .... Bellman backup to create successively better approximations per state per iteration.

Identification in Discrete Markov Decision Models
Dec 11, 2013 - written as some linear combination of elements in πθ. In the estimation .... {∆πθ0,θ : θ ∈ Θ\{θ0}} and the null space of IKJ + β∆HMKP1 is empty.

A Simple Algorithm for Clustering Mixtures of Discrete ...
mixture? This document is licensed under the Creative Commons License by ... on spectral clustering for continuous distributions have focused on high- ... This has resulted in rather ad-hoc methods for cleaning up mixture of discrete ...

Combined finite-discrete element modelling of surface ...
tool for preliminary estimates of the angle of break, it is too general to rely solely upon in design. 2.2 Analytical. The limit equilibrium technique is the most commonly used analytical method for subsidence assessment in caving settings. The initi

Combined finite-discrete element modelling of surface ...
ABSTRACT: The ability to predict surface subsidence associated with block caving mining is important for ..... applications of the code for the analysis of block.

Safe Markov Chains for ON/OFF Density Control with ...
tonomous systems operating in complex dynamic environ- ments like ... Note that, the sets of actions for the ON and OFF modes are the same in this example. II. RELATED RESEARCH. The proposed Markov model is applicable to both decision- making for ...