Non-Parametric Bayesian Sum-Product Networks

Sang-Woo Lee School of Computer Sci. and Eng., Seoul National University, Seoul 151-744, Korea

[email protected]

Christopher J. Watkins [email protected] Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK Byoung-Tak Zhang [email protected] School of Computer Sci. and Eng., Seoul National University, Seoul 151-744, Korea

Abstract We define two non-parametric models for Sum-Product Networks (SPNs) (Poon & Domingos, 2011). The first is a tree structure of Dirichlet Processes; the second is a dag of hierarchical Dirichlet Processes. These generative models for data implicitly define a prior distribution on SPN of tree and of dag structure. They allow MCMC fitting of data to SPN models, and the learning of SPN structure from data.

1. Introduction We describe two non-parametric Bayesian models for sum-product networks (SPNs). SPNs are tractable probabilistic graphical models and a new deep architecture (Poon & Domingos, 2011). It is claimed that SPNs show remarkable performance for computer vision, including image classification tasks (Gens & Domingos, 2012) and video learning (Amer & Todorovic, 2012), and it is reported (Gens & Domingos, 2013; Rooshenas & Lowd, 2014) that SPNs outperformed other probabilistic graphical models in variable retrieval problems. SPNs have a specialized structure for fast inference. They are constrained to have rooted directed acyclic graphs, and their leaves represent univariate distributions such as the multinomial distribution for discrete variables, and the Gaussian distribution for continuous variables. Internal nodes in the structure are either ‘product-nodes’ or ‘sum-nodes’. Definition 1: SPNs are defined inductively as follows (Poon & Domingos, 2011): st

Proceedings of the 31 International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

1. A tractable univariate distribution is an SPN. 2. A product of SPNs with disjoint scopes (i.e. variable sets) is an SPN. 3. A weighted sum of SPNs with the same scope is an SPN, provided that all weights are positive and sum to 1. 4. Nothing else is an SPN. First, let us define the probability distribution corresponding to an SPN. Let there be D variables X1 , . . . , XD ; the SPN specifies a joint probability distribution over these variables. We assume throughout that D ≥ 2. Each node in a SPN defines a joint probability distribution over a subset of the variables; we refer to this subset of the variables as the scope of the sum or product node. We use s to denote a scope as a subset of indices; in other words a scope with k indices is s = {s1 , . . . , sk } ⊆ {1, . . . , D}. The indexed set of the corresponding variables is written Xs = (Xs1 , . . . , Xsk ). The number of elements in the scope of a node is termed the level of the node. In this paper we consider only binary data. Each node of level 1 (a univariate node) defines a Bernoulli distribution for the (binary) variable that is its scope. Univariate nodes have no children. We require that each product node and each sum node must have level at least 2.1 We write Ps and Ss to denote a product or sum node respectively with scope s, where s ⊆ {1, . . . , D}, and write children(Ps ) and children(S−s) for the (indexed) sets of children of product and sum nodes. 1

For non-binary variables – for example, for real variables – it could be meaningful to have sum-nodes of level one, which would represent mixtures of univariate distributions, but this is not useful for binary data considered here.

Non-Parametric Bayesian Sum-Product Networks

A product-node defines the p.d. over its scope that is the product of the p.d.s of its children, which are sum-nodes or univariate nodes, with disjoint scopes. Each product node P defines a partition of its scope; we require this partition to have at least two elements. For a product node P of level k, let partitition(P) = (s1 , . . . , sk ) children(Ps ) = (Ss1 , . . . , Ssk ) where the si are non-empty, disjoint, and s1 ∪· · ·∪sk = {1, . . . , D},and the children (Ss1 , . . . , Ssk ) are sum nodes or univariate nodes. If |si |, which is level(Ssi ), is equal to 1, then Ssi is a univariate distribution node, otherwise Ssi is a sum-node. Each product node P only has |partitition(P )| child nodes. The probability distribution denoted by a product note P is denoted P (·); and the p.d. of a sum-node or univariate node S as S(·). Then product node Ps with children (S1 , . . . , Sk ) defines a joint p.d. over Xs , which is k Y Ps (Xs ) = S(Xsi ) i=1

Note that a product node always has D or fewer children, because the scopes of its children are required to be disjoint. A sum-node is either a univariate node, or else it defines a mixture distribution over its children, which are product-nodes, all with the same scope. In our model, each sum-node Ss may have a countable infinity of children (Ps1 , Ps2 , Ps3 , . . .) For each child there is a corresponding non-negative weight w1 , w2 , w3 , . . ., P∞ such that i=1 wi = 1. The distribution defined Ss is: ∞ X Ss (Xs ) = wi Psi (Xs ) i=1

The top node of the SPN is the sum-node S{1,...,D} , which we also denote Stop .

2. A prior distribution for SPN trees We describe two non-parametric generative models for multivariate data, which indirectly specify nonparametric prior distributions over SPNs. In the first model, the SPN is always a tree, in that no product node is ever a child of two sum-nodes, and no sumnode is ever a child of two product nodes. In the second model, the SPN is a dag, in that every product node is potentially the child of many sum-nodes, but no sumnode is ever the child of two product nodes. The first model is simpler; the second model is more general.

The prior distributions are defined recursively, node by node. In both the tree and the dag models, the prior distribution of each sum-node Ss is a Dirichlet process (Teh et al., 2004), with concentration parameter αs and base distribution denoted by GP (s). The base distribution GP (s) is a probability distribution over the set of possible product nodes with scope s. The distribution GP (s) is a basic input to the model which expresses prior beliefs about the form of product nodes. In principle, any distribution over product nodes could be used, but a simple and in some sense elegant choice is specify GP (s) as a probability distribution over the partitions of s. Only the partition of s is chosen at this level; the prior distribution for each child sumnode of the product node is defined recursively, in the same manner, at its own level. if the partition generated contains singletons, GP (s) must also specify the a p.d. over the possible univariate distributions; for binary data, a natural choice is a beta distribution parametrised as a vague prior, such as Beta( 21 , 12 ). Each sum-node Ss has a countably infinite number of child product-nodes (note that a sum-node can have multiple child product nodes with the same partition). The probability distribution over these children is a Dirichlet Process over base distribution GP (scope(S), αs ). In this model, sum-nodes with identical scopes are distinct and independent. In short, the prior distribution over SPNs is specified as a tree of Dirichlet Processes over product-nodes; each product-node has a finite branching factor, and each sum-node has an infinite branching factor. The tree has maximal depth D, and there is a unique top sum-node ST op = S{1,...,D} . The prior is parametrised by: • For each s, a concentration parameter αs > 0 . These may, for example, plausibly be a function of |s|. • For each s, a probability distribution GP (s) over partitions of s. These distributions may express significant prior knowledge concerning the desired shape of the tree, and also which partitions of the particular variables indexed by s are plausible. For example, different variables may have different meanings, and there may be prior beliefs that some partitions of s are more probable than others. • For each i ∈ {1, . . . , D}, a prior distribution on the univariate distribution S{i} (·). For binary data, this is plausibly a vague beta prior.

Non-Parametric Bayesian Sum-Product Networks

2.1. Generating data with the tree model Sampling and inference with the tree model are rather straightforward: we place a Blackwell-MacQueen urn process (Blackwell & MacQueen, 1973) at each sum node, and sample from each such process as required, starting from the top sum-node ST op . Each sample is generated recursively, as a tree of data requests: the root request, for all D variables is sent to ST op . The recursion is as follows. For a node of type: univariate-node : a sample from the univariate distribution defined by the node is returned. sum-node : A child product-node P is sampled from the Blackwell-MacQueen urn process at the sumnode, and the sample request is then sent to that product node. The sample received from the product node is returned. product-node : Let the node partition be (s1 , . . . , sk ). The sample request is partitioned into k requests with scope s1 to sk , and these requests are sent to the corresponding child nodes. When the samples from the k child nodes are returned, they are recombined to form a sample for the scope requested, and returned to the parent node. This recursive sampling process generates an exchangeable sequence of samples. To carry out this sampling procedure, each sum-node (and each distribution-node) must maintain state information sufficient for its urn process: this information consists of the indexed sequence of samples taken from that node’s generative process. The univariate distribution nodes may simply be fixed probability distributions on one variable, or the may have an arbitrarily complex structure. To ensure exchangeability of the samples from the entire structure, it is sufficient that each univariate distribution node should provide an exchangeable sequence of samples, separately from all other nodes. This generative model, with urn-processes, can be used in several ways: • to generate an exchangeable sequence of samples

Algorithm 1 Update Instance Full MH Input: instances {x1 , ..., xN } Initialize M , l for lo = 1 to nLoop do for n = 1 to N do M 0 ← U pdate Instance One M H(M, root, xn , n) temp l ← likelihood(M 0 , xn ) if rand() < min(1, temp l/l(n)) then l(n) ← temp l M ← M0 end if end for end for Algorithm 2 Update Instance Full Gibbs Input: SPN M , instances {x1 , ..., xN } Initialize M for lo = 1 to nLoop do for n = 1 to N do M ← U pdate Instance One Gibbs(M, root, xn , n) end for end for

• to fit a distribution to given data, using Gibbs sampling, Metropolis-Hastings, or many other MCMC methods. Gibbs and Metropolis-Hastings can be performed recursively throughout the tree, and in parallel on different branches. • for any given state of the sampling system, an SPN is defined implicitly by the predictive sampling probabilities for the next data item at each node. This SPN can then be used for inference as described in (Poon & Domingos, 2011) 2.2. Learning algorithm for the tree structure −23

−23.5

Tree with MH Tree with Gibbs

−24 loglikelihood

Computation is greatly simplified if the predictive distribution for a new product node Ps , not yet associated with any data, is of a simple known form. For example, a newly generated product node Ps with no data might predict the uniform distribution over Xs .

−24.5

−25

−25.5

−26

0

5

10

15 20 25 #instances (hundreds)

30

35

40

Figure 1. Loglikelihoods with different sampling schemes

Non-Parametric Bayesian Sum-Product Networks

Algorithm 3 Update Instance One MH Tree Input: SPN M , sumnode index i, instances {x1 , ..., xN }, instance index n Output: SPN M 0 M0 ← M {pidx1 , ..., pidxK } ← indexes of child nodes of sum node M 0 .Si if M.Si .allocate(n) 6= empty then M 0 .P[M.Si .allocate(n)] .w − = 1 end if for k = 1 to K do (M 0 .P .w) p(k) = [(PK M 0 .P[pidxk ] .w)+α] [pidxk ]

k

end for p(K + 1) =

[(

PK k

α M 0 .P[pidxk ] .w)+α]

select k ∼ p(k) if k ≤ K then M 0 .Si .allocate(n) = pidxk P[M 0 .Si .allocate(n)] .w + = 1 {sidx1 , ..., sidxL } ← indexes of child nodes of product node M 0 .Ppidxk for l = 1 to L do M0 ← U pdate Instance One M H T ree (M, sidxl , xn , n) end for else M 0 ← make prodnode in sumnode(M 0 , i, xn , n) end if

In this section, we explain two sampling methods for learning SPN topology. Algorithm 1 explains a Metropolis-Hastings (MH) sampling method, whereas Algorithm 2 explains a Gibbs sampling method. Algorithm 3 explains the MH learning rule used for updating the tree structure with one instance at a time. α is parameter of DP. Gibbs learning rule for updating one instance is a little difficult compared to MH method, because likelihood of choosing one child should be calculated in each step. It is more difficult to use the Gibbs learning rule for updating one instance at a time compared to the learning rule of the MH method. This is because the Gibbs sampling requires the calculation of the likelihood of choosing one child in each step. However, as calculating the likelihood in SPN is tractable, this step can be performed in a less complicated manner. The MH and Gibbs sampling scheme was evaluated using a “hand-written optic digits” dataset which includes 8x8 pixels and digit classes. Only the image pixels with binarization used in the experiments. Partitions were randomly split every time a new product node was made. In figure 1, Algorithm 1 is used for the ‘Tree with MH’ condition, whereas Algorithm 2

is used for the ‘Tree with Gibbs’. In this graph, the x-axis represents the number of instances used by the model. This simulation only used the instances once. We found the Gibbs sampling scheme performs better than the MH sampling scheme.

3. Model 2: A prior distribution for a class of dag-SPNs Model 1 is a tree, and this may be undesirable in some applications: this is because sampling requests follow the tree, and nodes deep in the tree must therefore typically handle only a small fraction of all sample requests. Models of this type have been used (Gens & Domingos, 2013), but (Bengio, 2009) shows that deep dag-structured SPNs can express a more interesting class of distributions. It is straightforward to alter the model to allow the sampling-request paths to form a dag. Each individual sampling request will take the form of a tree within the dag, but the totality of sampling requests will lie on the dag. In such a model, even nodes deep in the dag may handle a large fraction of sampling requests, so that ‘deep learning’ becomes possible. We alter model 1 as follows. For each of the 2D − D − 1 possible scopes s of level at least 2, we set up a ‘sum-node-group’ Ss , which consists of a hierarchical Dirichlet Process (Teh et al., 2004). A hierarchical DP consists of a set of ‘layer 1’ DPs which share a common ‘layer 0’ base distribution, which is itself a DP. For Ss , the base distribution for the layer 0 DP is GP (s), which is the same as the base-distribution of a sum-node with scope s in the tree-SPN model above. For layer 1 of Ss , we set up a separate layer 1 DP for each possible parent node from which sampling requests may come. More specifically, for any productnode with a child sum-node with scope s, then that child sum-node is placed as a DP in layer 1 of Ss . In other words, all sum-nodes with scope s share the same base-distribution of product-nodes, which is itself a DP with base distribution GP (s). The effect of this is that all sum-nodes with scope s in the dag will tend to share, and route requests to, common child product-nodes. Hence sampling requests from different sum-nodes can be routed to the same child product-node. Samples from a hierarchical DP are exchangeable, hence samples from the entire model are exchangeable, as before. Remarkably, the only new parameters required are the concentration parameters of the layer 1 and layer 2 Dirichlet processes for each scope: once again, these may plausibly be functions only of the size of the scope.

Non-Parametric Bayesian Sum-Product Networks

Algorithm 4 Update Instance One MH DAG Input: SPN M , sumnode index i, instances {x1 , ..., xN }, instance index n Output: SPN M 0 M0 ← M {pidx1 , ..., pidxK } ← indexes of product nodes which have same scope to sum node M 0 .Si if M.Si .allocate(n) 6= empty then M 0 .P[M.Si .allocate(n)] .wi − = 1 end if for k = 1 to 0 K do (M .P .wi +α×M 0 .Si .βk ) k] p(k) = (P[pidx K M 0 .P .w )+α] k

end for p(K + 1) =

[(

[pidxk ]

i

(α×M 0 .Si .βK+1 ) PK 0 k M .P[pidxk ] .wi )+α]

select k ∼ p(k) if k ≤ K then M 0 .Si .allocate(n) = idxk P[M 0 .Si .allocate(n)] .wi + = 1 {sidx1 , ...sidxL } ← indexes of child nodes of product node M 0 .Ppidxk for l = 1 to L do M0 ← U pdate Instance One M H DAG (M, sidxl , xn , n) end for else M 0 ← make prodnode in sumnode(M 0 , i, xn , n) M 0 .Si .β ← sample beta(M 0 .Si .allocate, α, γ) end if

Algorithm 1 and 2 can also be used for learning both the tree and dag structure. Algorithm 4 explains the MH learning rule at updating dag structure with one instance. α and γ are parameters of the hierarchical DP. Algorithm 4 uses the inference scheme explained in Ch 5.3 of (Teh et al., 2004). Experiments are in progress. There are a number of MCMC sampling techniques available for individual HDPs (Teh et al., 2004; Neal, 2000), and many possible schemes for managing the MCMC sampling for both models recursively.

4. Discussion There are relatively few effective algorithms for learning the structures of graphical models: MCMC sampling using the dag-SPN (our model 2) appears to be a particularly elegant possibility, which is as far as we know unexplored. The construction appears both modular and generic, and could be applied to the generation of many types of structured object, provided the generative decisions take the form of a tree, and node-scopes respect the same partial ordering for all

objects. Learning in product nodes is, however, problematic because there are many possible partitions and it is hard to find a good partition by Gibbs sampling, proposing random partitions. This is slow because the merit of a good partition only becomes apparent when many data-requests have been transferred to the new product node: when a product node is first generated, even if its partition is optimal, it has no data assigned to it, and so at first it predicts a uniform distribution over its scope variables. This means that every new product node is at an initial disadvantage compared to existing competitor product nodes which have data currently assigned to them. An effective sampling method, therefore, should make proposals of altering the product partitions directly: however, such proposals are expensive, since if high-level partition elements are altered in this way, then lower level partition elements need to be changed as well. We are investigating such algorithms. In the above experiment, random partition priors were explored, but other partition priors could also be used. It would be possible to have a hybrid method, in which the ‘prior’ for partition of variables would be made sensitive to the number of allocated instances; when few instances are allocated to some product node, the partition of this node would be fully separated into univariate nodes. To solve this problem, we are examining other SPN structure leraning methods to find good initialisations. One possibility is the LearnSPN algorithm in (Gens & Domingos, 2013). When a tree structure is made with the DP, the whole algorithm becomes a non-parametric Bayesian biclustering. If a dag-SPN is made with the HDP, the algorithm becomes more interesting and is an idea we will explore in the future. Additionally, (Lowd & Domingos, 2013; Peharz et al., 2013) study the bottom-up learning of SPN and the arithmetic circuit. These ideas can be used for making good candidates of product nodes. It is also interesting to note that (Rooshenas & Lowd, 2014) uses the hybrid approach of top-down and bottom-up. In summary, we have defined prior distributions on SPNs that specify only mixture-distribution concentration parameters, priors on partitions of scopes, and vague priors on univariate distributions. However, finding effective sampling methods is a challenging problem.

Non-Parametric Bayesian Sum-Product Networks

Acknowledgments This research was conducted while C Watkins was visiting the Biointelligence Lab., School of Computer Science and Engineering at Seoul National University, with support from the MSIP (Ministry of Science, ICT & future Planning), Korea. SW Lee and BT Zhang were were supported by the National Research Foundation (NRF) Grant funded by the Korea government (MSIP) (NRF-2013M3B5A2035921)

References Poon, H. and Domingos, P. Sum-product networks: A new deep architecture. Uncertainty in Artificial Intelligence 27 (UAI11), 2011. Poon, H. and Domingos, P. Discriminative learning of sum-product networks. Advances in Neural Information Processing Systems 25 (NIPS12), 2012. Amer, M. R. and Todorovic, S. Sum-product networks for modeling activities with stochastic structure. Computer Vision and Pattern Recognition (CVPR12), 2012. Gens, R. and Domingos, P. Learning the structure of sum-product networks. International Conference on Machine Learning 30 (ICML13), 2013. Teh, Y. W., Jordan, M. I., Beal, M. J, and Blei, D. M. Hierarchical Dirichlet Process. Journal of the American Statistical Association, 101(476): 1566–1581, 2004. Blackwell, D. and MacQueen, J. B. Ferguson Distributions Via Polya Urn Schemes. Annals of Statistics, 1(2):353–355, 1973. Bengio, Y. Learning deep architectures for AI. Foundation and Trends in Machine Learning, 2(1):1–127, 2009. Neal, R. M. Markov Chain Sampling Methods for Dirichlet Process Mixture Models . Journal of Computational and Graphical Statistics, 9(2):249–265, 2000. Rooshenas, A. and Lowd, D. Learning Sum-Product Networks with Direct and Indirect Variable Interactions. International Conference on Machine Learning 31 (ICML14), 2014. Lowd, D. and Domingos, P. Learning Markov networks with arithmetic circuits. International Conference on Artificial Intelligence and Statistics 16 (AISTATS13), 2013.

Peharz, R., Geiger, B. C., and Pernkopf, F. Greedy Part-Wise Learning of Sum-Product Networks. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD13), 2013.

Non-Parametric Bayesian Sum-Product Networks

Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK .... The top node of the SPN is the sum-node S{1,...,D},.

275KB Sizes 0 Downloads 166 Views

Recommend Documents

Nonparametric Hierarchical Bayesian Model for ...
results of alternative data-driven methods in capturing the category structure in the ..... free energy function F[q] = E[log q(h)] − E[log p(y, h)]. Here, and in the ...

Incremental Learning of Nonparametric Bayesian ...
Jan 31, 2009 - Mixture Models. Conference on Computer Vision and Pattern Recognition. 2008. Ryan Gomes (CalTech). Piero Perona (CalTech). Max Welling ...

Variational Nonparametric Bayesian Hidden Markov ...
[email protected], [email protected]. ABSTRACT. The Hidden Markov Model ... nite number of hidden states and uses an infinite number of Gaussian components to support continuous observations. An efficient varia- tional inference ...

Scalable Dynamic Nonparametric Bayesian ... - Research at Google
cation, social media and tracking of user interests. 2 Recurrent Chinese .... For each storyline we list the top words in the left column, and the top named entities ...

Nonparametric Hierarchical Bayesian Model for ...
employed in fMRI data analysis, particularly in modeling ... To distinguish these functionally-defined clusters ... The next layer of this hierarchical model defines.

Scalable Nonparametric Bayesian Multilevel Clustering
vided into actions, electronic medical records (EMR) orga- nized as .... timization process converge faster, SVI uses the coordinate descent ...... health research.

Incremental Learning of Nonparametric Bayesian ...
Jan 31, 2009 - Conference on Computer Vision and Pattern Recognition. 2008. Ryan Gomes (CalTech) ... 1. Hard cluster data. 2. Find the best cluster to split.

Dynamic Bayesian Networks
M.S. (University of Pennsylvania) 1994. A dissertation submitted in partial ..... 6.2.4 Modelling freeway traffic using coupled HMMs . . . . . . . . . . . . . . . . . . . . 134.

PDF Fundamentals of Nonparametric Bayesian Inference
Deep Learning (Adaptive Computation and Machine Learning Series) · Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science).

A nonparametric hierarchical Bayesian model for group ...
categories (animals, bodies, cars, faces, scenes, shoes, tools, trees, and vases) in the .... vide an ordering of the profiles for their visualization. In tensorial.

Bayesian nonparametric estimation and consistency of ... - Project Euclid
Specifically, for non-panel data models, we use, as a prior for G, a mixture ...... Wishart distribution with parameters ν0 + eK∗ j and. ν0S0 + eK∗ j. Sj + R(¯β. ∗.

Bayesian nonparametric estimation and consistency of ... - Project Euclid
provide a discussion focused on applications to marketing. The GML model is popular ..... One can then center ˜G on a parametric model like the GML in (2) by taking F to have normal density φ(β|μ,τ). ...... We call ˆq(x) and q0(x) the estimated

Scalable Dynamic Nonparametric Bayesian Models of Content and ...
Recently, mixed membership models [Erosheva et al.,. 2004], also .... introduce Hierarchical Dirichlet Processes (HDP [Teh et al., .... radical iran relation think.

Probabilistic inferences in Bayesian networks
tation of the piece of evidence that has or will have the most influence on a given hypothesis. A detailed discussion of ... Causal Influences in A Bayesian Network. ... in the network. For example, the probability that the sprinkler was on, given th

Dialogue Act Recognition with Bayesian Networks for ...
Dialogue Act Recognition with Bayesian Networks for Dutch Dialogues ... Department of Computer Science ... An important part of our dialogue systems for.

An Application of Bayesian Networks in Predicting Form ...
commerce and e-service domains where filling out infor- mation is strongly required. For instance, when we have to buy a book or a plane ticket, application may require to fill out many ... over the Internet. A recent ... business (B2B) aspects [1].

A Primer on Learning in Bayesian Networks for ...
Introduction. Bayesian networks (BNs) provide a neat and compact representation for expressing joint probability distributions. (JPDs) and for inference. They are becoming increasingly important in the biological sciences for the tasks of inferring c

Extending Bayesian Networks to the Open-Universe Case
alarm system that may be set off by a burglary or an earthquake, and two ... authors, venues, and so on are not known in advance, nor is the mapping between ... language processing, data association in multitarget tracking, and record linkage ...