Poon, H. and Domingos, P.
in: Proceedings of the 27th conference on Uncertainty in Artificial Intelligence (UAI 2011), 2011
URL, Google Scholar
Abstract
The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sumproduct networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with standard deep networks. For example, SPNs perform image completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex.
Review
The authors present a new type of graphical model which is hierarchical (rooted directed acyclic graph) and has a sum-product structure, i.e., the levels in the hierarchy alternately implement a sum or product operation of their children. They call these models sum-product networks (SPNs). The authors define conditions under which SPNs represent joint probability distributions over the leaves in the graph efficiently where efficient means that all the marginals can be computed efficiently, i.e., inference in SPNs is easy. They argue that SPNs subsume all previously known tractable graphical models while being more general.
When inference is tractable in SPNs, so is learning. Learning here means to update weights in the SPN which can also be used to change the structure of an SPN by pruning connections with 0 weights after convergence of learning. They suggest to use either EM or gradient-based learning, but note that for large hierarchies (very deep networks) you’ll have a gradient diffusion problem, as in general in deep learning. To overcome this problem they use the maximum posterior estimator which effectively updates only a single edge of a node instead of all edges dependent on the (diffusing) gradient.
The authors introduce the properties of SPNs using only binary variables. Leaves of the SPNs then are indicators for values of these variables, i.e., there are 2*number of variables leaves. It is straight forward to extend this to general discrete variables where the potential number of leaves then rises to number of values * number of variables. For continuous variables sum nodes become integral nodes (so you need distributions which you can easily integrate) and it is not so clear to me what leaves in the tree then are. In general, I didn’t follow the technical details well and can hardly comment on potential problems. One question certainly is how you initialise your SPN structure before learning (it will matter whether you start with a product or sum level at the bottom of your hierarchy and where the leaves are positioned).
Anyway, this work introduces a promising new deep network architecture which combines a solid probabilistic interpretation with tractable exact computations. In particular, in comparison to previous models (deep belief networks and deep Boltzmann machines) this leads to a jump in performance in both computation time and inference results as shown in image completion experiments. I’m looking forward to seeing more about this.