Probabilistic reasoning by neurons.

Yang, T. and Shadlen, M. N.
Nature, 447:1075–1080, 2007
DOI, Google Scholar


Our brains allow us to reason about alternatives and to make choices that are likely to pay off. Often there is no one correct answer, but instead one that is favoured simply because it is more likely to lead to reward. A variety of probabilistic classification tasks probe the covert strategies that humans use to decide among alternatives based on evidence that bears only probabilistically on outcome. Here we show that rhesus monkeys can also achieve such reasoning. We have trained two monkeys to choose between a pair of coloured targets after viewing four shapes, shown sequentially, that governed the probability that one of the targets would furnish reward. Monkeys learned to combine probabilistic information from the shape combinations. Moreover, neurons in the parietal cortex reveal the addition and subtraction of probabilistic quantities that underlie decision-making on this task.


The authors argue that the brain reasons probabilistically, because they find that single neuron responses (firing rates) correlate with a measure of probabilistic evidence derived from the probabilistic task setup. It is certainly true that the monkeys could learn the task (a variant of the weather prediction task) and I also find the evidence presented in the paper generally compelling, but the authors note themselves that similar correlations with firing rate may result from other quantitative measures with similar properties as the one considered here. May, for example, firing rates correlate similarly with a measure of expected value of a shape combination as derived from a reinforcement learning model?

What did they do in detail? They trained monkeys on a task in which they had to predict which of two targets will be rewarded based on a set of four shapes presented on the screen. Each shape contributed a certain weight to the probability of rewarding a target as defined by the experimenters. The monkeys had to learn these weights. Then they also had to learn (implicitly) how the weights of shapes are combined to produce the probability of reward. After about 130,000 trials the monkeys were good enough to be tested. The trick in the experiment was that the four shapes were not presented simultaneously, but appeared one after the other. The question was whether neurons in lateral intraparietal (LIP) area of the monkeys’ brains would represent the updated probabilities of reward after addition of each new shape within a trial. That the neurons would do that was hypothesised, because results from previous experiments suggested (see Gold & Shalden, 2007 for review) that neurons in LIP represent accumulated evidence in a perceptual decision making paradigm.

Now Shadlen seems convinced that these neurons do not directly represent the relevant probabilities, but rather represent the log likelihood ratio (logLR) of one choice option over the other (see, e.g., Gold & Shadlen, 2001 and Shadlen et al., 2008). Hence, these ‘posterior’ probabilities play no role in the paper. Instead all results are obtained for the logLR. Funnily the task is defined solely in terms of the posterior probability of reward for a particular combination of four shapes and the logLR needs to be computed from the posterior probabilities (Yang & Shadlen don’t lay out this detail in the paper or the supplementary information). I’m more open about the representation of posterior probabilities directly and I wondered how the correlation with logLR would look like, if the firing rates would respresent posterior probabilities. This is easy to simulate in Matlab (see Yang2007.m). Such a simulation shows that, as a function of logLR, the firing rate (representing posterior probabilities) should follow a sigmoid function. Compare this prediction to Figures 2c and 3b for epoch 4. Such a sigmoidal relationship derives from the boundedness of the posterior probabilities which is obviously reflected in firing rates of neurons as they cannot drop or rise indefinitely. So there could be simple reasons for the boundedness of firing rates other than that they represent probabilities, but in any case it appears unlikely that they represent unbounded log likelihood ratios.

A healthy fear of the unknown: perspectives on the interpretation of parameter fits from computational models in neuroscience.

Nassar, M. R. and Gold, J. I.
PLoS Comput Biol, 9:e1003015, 2013
DOI, Google Scholar


Fitting models to behavior is commonly used to infer the latent computational factors responsible for generating behavior. However, the complexity of many behaviors can handicap the interpretation of such models. Here we provide perspectives on problems that can arise when interpreting parameter fits from models that provide incomplete descriptions of behavior. We illustrate these problems by fitting commonly used and neurophysiologically motivated reinforcement-learning models to simulated behavioral data sets from learning tasks. These model fits can pass a host of standard goodness-of-fit tests and other model-selection diagnostics even when the models do not provide a complete description of the behavioral data. We show that such incomplete models can be misleading by yielding biased estimates of the parameters explicitly included in the models. This problem is particularly pernicious when the neglected factors are unknown and therefore not easily identified by model comparisons and similar methods. An obvious conclusion is that a parsimonious description of behavioral data does not necessarily imply an accurate description of the underlying computations. Moreover, general goodness-of-fit measures are not a strong basis to support claims that a particular model can provide a generalized understanding of the computations that govern behavior. To help overcome these challenges, we advocate the design of tasks that provide direct reports of the computational variables of interest. Such direct reports complement model-fitting approaches by providing a more complete, albeit possibly more task-specific, representation of the factors that drive behavior. Computational models then provide a means to connect such task-specific results to a more general algorithmic understanding of the brain.


Nassar and Gold use tasks from their recent experiments (e.g. Nassar et al., 2012) to point to the difficulties of interpreting model fits of behavioural data. The background is that it has become more popular to explain experimental findings (often behaviour) using computational models. But how reliable are those computational interpretations and how to ensure that they are valid? I will briefly review what Nassar and Gold did and point out that researchers investigating reward learning using computational models should think about learning rate adaptation in their experiments, because, in the light of the present paper, their results may else not be interpretable. Further, I will argue that Nassar and Gold’s appeal to more interaction between modelling and task design is just how science should work in principle.


The considered tasks belong to the popular class of reward learning tasks in which a subject has to learn which choices are rewarded to maximise reward. These tasks may be modelled by a simple delta-rule mechanism which updates current (learnt) estimates of reward by an amount proportional to a prediction error where the exact amount of update is determined by a learning rate. This learning rate is one of the parameters that you want to fit to data. The second parameter Nassar and Gold consider is the ‘inverse temperature’ which tells how a subject trades off exploitation (choose to get reward) against exploration (choose randomly).

Nassar and Gold’s tasks are special, because at so-called change points during an experiment the underlying rewards may abruptly change (in addition to smaller variation of reward between single trials). The experimental subject then has to learn the new reward values. Importantly, Nassar and Gold have found that subjects use an adaptive learning rate, i.e., when subjects encounter small prediction errors they tend to reduce the learning rate while they tend to increase learning rate when experiencing large prediction errors. However, typical delta-rule learning models assume a fixed learning rate.

The issue

The issue discussed in the paper is that it will not be easily possible to detect a problem when fitting a fixed learning rate model to choices which were produced with an adaptive learning rate. As shown in the present paper, this issue results from a redundancy between learning rate adaptiveness (a hyperparameter, or hidden factor) and the inverse temperature with respect to subject choices, i.e., a change in learning rate adaptiveness can equivalently be explained by a change in inverse temperature (with fixed learning rate adaptiveness) when such a change is only measured by the choices a subject makes. Statistically, this means that, if you were to fit learning rate adaptiveness with inverse temperature to subject choices, then you should find that the two parameters are highly correlated given the data. Even better, if you were to look at the posterior distribution of the two parameters given subject choices, you should observe a large variance of them together with a strong covariance between them. As a statistician you would then report this variance and acknowledge that interpretation may be difficult. But learning rate adaptiveness is not typically fitted to choices. Instead only learning rate itself is fitted given a particular adaptiveness. Then, the relation between adaptiveness and inverse temperature is hidden from the analysis and investigators may be fooled into thinking that the combination of fitted learning rate and inverse temperature comprehensively explains the data. Well, it does explain the data, but there are potentially many other explanations of this kind which become apparent when the hidden factor learning rate adaptiveness is taken into account.

What does it mean?

The discussed issue exemplifies a general problem of cognitive psychology: that you try to investigate (computational) mechanisms, e.g., decision making, by looking at quite impoverished data, e.g., decisions, which only represent the final product of the mechanisms. So what you do is to guess a mechanism (a model) and see whether it fits the data. In the case of Nassar and Gold there was a prevailing guess which fit the data reasonably well. By investigating decision making in a particular, new situation (environment with change points) they found that they needed to extend that mechanism to account for the new data. However, the extended mechanism now has many explanations for the old impoverished data, because the extended mechanism is more flexible than the old mechanism. To me, this is all just part of the normal progress in science and nothing to be alarmed about in principle. Yet, Nassar and Gold are right to point out that in the light of the extended mechanism fits of the old mechanism to old data may be misleading. Interpreting the parameters of the old mechanism may then be similar to saying that you find that the earth is a disk, because from your window it looks like the ground goes to the horizon in a straight line and then stops.


Essentially, Nassar and Gold try to convince us that when looking at reward learning we should now also take learning rate adaptiveness into account, i.e., that we should interpret subject choices within their extended mechanism. Two questions remain: 1) Do we trust that their extended mechanism is worth pursuing? 2) If yes, what can we do with the old data?

The present paper does not provide evidence that their extended mechanism is a useful model for subject choices (1), because they here assumed that the extended mechanism is true and investigated how you would interpret the new data using the old mechanism. However, their original study and others point to the importance of learning rate adaptiveness [see their refs. 9-11,26-28].

If the extended mechanism is correct, then the present paper shows that the old data is pretty much useless (2) unless learning rate adaptiveness has been, perhaps accidentally, controlled for in previous studies. This is because the old data from previous experiments (probably) does not allow to estimate learning rate adaptiveness. Of course, if you can safely assume that the learning rate of subjects stayed roughly fixed in your experiment, for example, because prediction errors were very similar during the whole experiment, then the old mechanism with fixed learning rate should still apply and your data is interpretable in the light of the extended mechanism. Perhaps it would be useful to investigate how robust fitted parameters are to varying learning rate adaptiveness in a typical experiment producing old data (here we only see results for experiments designed to induce changes in learning rate through large jumps in mean reward values).

Overall the paper has a very general tone. It tries to discuss the difficulties of fitting computational models to behaviour in general. In my opinion, these things should be clear to anyone in science as they just reflect how science progresses: you make models which need to fit an observed phenomenon and you need to refine models when new observations are made. You progress by seeking new observations. There is nothing special about fitting computational models to behaviour with respect to this.

Causal role of dorsolateral prefrontal cortex in human perceptual decision making.

Philiastides, M. G., Auksztulewicz, R., Heekeren, H. R., and Blankenburg, F.
Curr Biol, 21:980–983, 2011
DOI, Google Scholar


The way that we interpret and interact with the world entails making decisions on the basis of available sensory evidence. Recent primate neurophysiology [1-6], human neuroimaging [7-13], and modeling experiments [14-19] have demonstrated that perceptual decisions are based on an integrative process in which sensory evidence accumulates over time until an internal decision bound is reached. Here we used repetitive transcranial magnetic stimulation (rTMS) to provide causal support for the role of the dorsolateral prefrontal cortex (DLPFC) in this integrative process. Specifically, we used a speeded perceptual categorization task designed to induce a time-dependent accumulation of sensory evidence through rapidly updating dynamic stimuli and found that disruption of the left DLPFC with low-frequency rTMS reduced accuracy and increased response times relative to a sham condition. Importantly, using the drift-diffusion model, we show that these behavioral effects correspond to a decrease in drift rate, a parameter describing the rate and thereby the efficiency of the sensory evidence integration in the decision process. These results provide causal evidence linking the DLPFC to the mechanism of evidence accumulation during perceptual decision making.


They apply repetitive TMS to the dorsolateralprefrontal cortex (DLPFC) assuming that this inhibits the decision making ability of subjects, because DLPFC has been shown to be involved in perceptual decision making. Indeed, they find a significant effect of TMS vs. SHAM on the responses of subjects (after TMS responses of subjects are less accurate and take longer). They also argue that the effect is particular to TMS, because it reduces over time, but I wonder why they did not compute the corresponding interaction (they just report that the effect of TMS vs. SHAM is significant earlier, but not significant later).

Furthermore, they hypothesised that TMS disrupted the accumulation process of noisy evidence over time by decreasing the rate of evidence increase. This is based on the previous finding that the DLPFC has higher BOLD activation for less noisy stimuli which suggests that, when DLPFC is disrupted, the evidence coming from less noisy stimuli cannot be optimally processed anymore.

They investigated the evidence accumulation hypothesis by fitting a drift-diffusion model (DDM) to response data. The DDM has more parameters than are necessary to explain the variations of response data for the different experimental conditions. Hence, they use the Bayesian information criterion (BIC) to select parameters which should be fitted for each experimental condition separately, i.e., to be able to say which parameters are affected by the experimental manipulations. The other parameters are still fitted but to all data across experimental conditions. The problem is that the BIC is a very crude approximation just taking the number of freely varying parameters into account. For example, an assumption underlying the BIC is that the Hessian of the likelihood evaluated at the fitted parameter values has full rank (Bishop, 2006, p. 217), but for highly correlated parameters this may not be the case. The used DMAT fitting toolbox actually approximates the Hessian matrix, checks whether a local minimum has been found (instead of a valley) and computes confidence intervals from the approximated Hessian, but the authors report no results for this apart from error bars on the plot for drift rate and nondecision time.

Anyway, the BIC analysis conveniently indicates that drift rate and nondecision time best explain the variations in response data across conditions. However, it has to be kept in mind that these results have been obtained by (presumably) assuming that the diffusion is fixed across conditions which is the standard when fitting a DDM [private correspondence with Rafal Bogacz, 09/2012], because drift rate, diffusion and threshold are redundant (a change in one of them can be reverted by a suitable change in the others). The interpretation of the BIC analysis probably should be that drift rate and nondecision time are the smallest set of parameters which still allow a good fit of the data given that diffusion is fixed.

You need to be careful when interpreting the fitted parameter values in the different conditions. In particular, fitting a DDM to data assumes that the evidence accumulation still works like a DDM, just with different parameters. However, it is not clear what TMS does to the affected processes in the brain. Hence, we can only say from the fitting results that TMS has an effect which is equivalent to a reduction of the drift rate (no clear effect on nondecision time) in a normally functioning DDM.

Similarly, the interpretation of the results for nondecision time is not straightforward. There, the main finding is that nondecision time decreases for high-evidence stimuli which the authors interpret as a reduced time of low-level sensory processing which provides input to evidence accumulation. However, it should be kept in mind that the total amount of time necessary to make a decision is also reduced for high-evidence stimuli. Also, part of the processes which are collected under ‘nondecision time’ may actually work in parallel to evidence accumulation, e.g., movement preparation. If you look at the percentage of RT that is explained by the nondecision time, then the picture is reversed: for high-evidence stimuli nondecision time explains about 82% of RTs while for low-evidence stimuli it explains only about 75% which is consistent with the basic idea that evidence accumulation takes longer for noisier stimuli. In general, these percentages are surprisingly high. Does the evidence accumulation really only account for about 25% of total RTs? But it’s good that we have a number to compare now.

So what do these findings mean for the DLPFC? We cannot draw any definite conclusions. The hypothesis that TMS over DLPFC affects drift rate is somewhat built into the analysis, because the authors use a DDM to fit the responses. Of course, other parameters could have been affected stronger such that the finding of the BIC analysis that drift rate explains the changes best can indeed be taken as evidence for the drift rate hypothesis. However, it is not possible to exclude other explanations which lie outside the parameter space of the DDM. What, for example, if the DLPFC has indeed a somewhat attentional effect on evidence accumulation in the sense that it not only accumulates evidence, but also modulates how big the individual peaces of evidence are by modulating lower-level sensory processing? Then, interrupting the DLPFC may still have a similar effect as observed here, but the interpretation of the role of the DLPFC would be slightly different. Actually, the authors argue against a role of the DLPFC (at least the part of DLPFC they found) in attentional processing, but I’m not entirely convinced. Their main argument is based on the assumption that a top-down attentional effect of the DLPFC on low-level sensory processing would increase the nondecision time, but this is not necessarily true. A) there is the previously mentioned issue of parallel processing and the general problems of fitting a standard model to a disturbed process which makes me doubt the reliability of the fitted nondecision times and B) I can easily conceive a system in which attentional modulation would not delay low-level sensory processing.

Inhibitory plasticity balances excitation and inhibition in sensory pathways and memory networks.

Vogels, T. P., Sprekeler, H., Zenke, F., Clopath, C., and Gerstner, W.
Science, 334:1569–1573, 2011
DOI, Google Scholar


Cortical neurons receive balanced excitatory and inhibitory synaptic currents. Such a balance could be established and maintained in an experience-dependent manner by synaptic plasticity at inhibitory synapses. We show that this mechanism provides an explanation for the sparse firing patterns observed in response to natural stimuli and fits well with a recently observed interaction of excitatory and inhibitory receptive field plasticity. The introduction of inhibitory plasticity in suitable recurrent networks provides a homeostatic mechanism that leads to asynchronous irregular network states. Further, it can accommodate synaptic memories with activity patterns that become indiscernible from the background state but can be reactivated by external stimuli. Our results suggest an essential role of inhibitory plasticity in the formation and maintenance of functional cortical circuitry.


The authors show that, if the same input to an output neuron arrives through an excitatory and a delayed inhibitory channel, synaptic plasticity (a symmetric STDP rule) at the inhibitory synapses leads to “detailed balance”, i.e., to cancellation of excitatory and inhibitory input currents. Then, the output neuron fires sparsely and irregularly (as observed for real neurons) only when an excitatory input was not predicted by the implicit model encoded by the synaptic weights of the inhibitory inputs. The adaptation of the inhibitory synapses also matches potential changes in the excitatory synapses, although here they only present simulations in which excitatory synapses changed abruptly and stayed constant afterwards. (What happens when excitatory and inhibitory synapses change concurrently?) Finally, the authors show that similar results apply to recurrently connected networks of neurons with dedicated inhibitory neurons (balanced networks). Arbitrary activity patterns can be encoded by the excitatory connections, activity in these patterns is then suppressed by the inhibitory neurons, while partial activation of the patterns through external input reactivates the whole patterns (cf. recall of memory) without suppressing potential reactivation of other patterns in the network.

These are interesting ideas, clearly presented and with very detailed supplementary information. The large number of inhibitory neurons in cortex makes the assumed pairing of excitatory and inhibitory input at least possible, but I don’t know how prevalent this really is. Another important assumption here is that the inhibitory input is a bit slower than the excitatory input. This makes intuitive sense, if you assume that the inhibitory input needs to be relayed through an additional inhibitory neuron, but I’ve seen the opposite assumption before, too.

Information Theory of Decisions and Actions.

Tishby, N. and Polani, D.
in: Perception-Action Cycle, Springer New York, pp. 601–636, 2011
URL, Google Scholar


The perception–action cycle is often defined as “the circular flow of information between an organism and its environment in the course of a sensory guided sequence of actions towards a goal” (Fuster, Neuron 30:319–333, 2001; International Journal of Psychophysiology 60(2):125–132, 2006). The question we address in this chapter is in what sense this “flow of information” can be described by Shannon’s measures of information introduced in his mathematical theory of communication. We provide an affirmative answer to this question using an intriguing analogy between Shannon’s classical model of communication and the perception–action cycle. In particular, decision and action sequences turn out to be directly analogous to codes in communication, and their complexity – the minimal number of (binary) decisions required for reaching a goal – directly bounded by information measures, as in communication. This analogy allows us to extend the standard reinforcement learning framework. The latter considers the future expected reward in the course of a behaviour sequence towards a goal (value-to-go). Here, we additionally incorporate a measure of information associated with this sequence: the cumulated information processing cost or bandwidth required to specify the future decision and action sequence (information-to-go). Using a graphical model, we derive a recursive Bellman optimality equation for information measures, in analogy to reinforcement learning; from this, we obtain new algorithms for calculating the optimal trade-off between the value-to-go and the required information-to-go, unifying the ideas behind the Bellman and the Blahut–Arimoto iterations. This trade-off between value-to-go and information-to-go provides a complete analogy with the compression–distortion trade-off in source coding. The present new formulation connects seemingly unrelated optimization problems. The algorithm is demonstrated on grid world examples.


Peter Dayan pointed me to this paper (which is actually a book chapter) when I told him that I find the continuous interaction between perception and action important and that Friston’s free energy framework is one of the few which covers this case. Now, this paper covers only discrete time (and states and actions), but certainly it addresses the issue that perception and action influence each other.

The main idea of the paper is to take the informational effort (they call it information-to-go) into account when finding a policy for a Markov decision process. A central finding is a recursive equation analogous to the (Bellman) equation for the Q-function in reinforcement learning which captures the expected (over all possible future state-action trajectories) informational effort of a certain state-action pair. Informational effort is defined as the KL-divergence between a factorising prior distribution over future states and actions (making them independent across time) and their true distribution. This means that the informational effort is the expected number of bits of information that you have to consider in addition to your prior when moving through the future. They then propose a free energy (also a recursive equation) which combines the informational effort with the Q-function of the underlying MDP and thus allows simultaneous optimisation of informational effort and reward where the two are traded off against each other.

Practically, this leads to “soft vs. sharp policies”: sharp policies which always choose the action with highest expected reward and soft policies which choose actions probabilistically with an associated penalty on reward compared to sharp policies. The softness of the resulting policy is controlled by the tradeoff parameter between informational effort and reward which can be interpreted as the informational capacity of the system under consideration. I understand it this way: the tradeoff parameter stands for the informational complexity/capacity of the distributions representing the internal model of the world in the agent and the optimal policy with a particular setting of tradeoff parameter is the optimal policy with respect to reward alone that a corresponding agent can achieve. This is easily seen when considering that informational effort depends on the prior for future state-action trajectories. For a given prior, tradeoff parameter and resulting policy you can find the corresponding more complex prior for which the same policy can be found for 0 informational effort. The prior here obviously corresponds to the internal model of the agent. Consequently, the authors present a general framework with which you can ask questions such as: “How much informational capacity does my agent need to solve a given task with a desired level of performance?” Or, in other words: “How complex does my agent need to be in order to solve the given task?” Or: “How well can my agent solve the given task?” Although this latter question is the standard question in RL. In particular, my intuition tells me that for every setting of the tradeoff parameter there probably is an equivalent POMDP formulation (which makes the corresponding difference between world and agent model explicit).

A particularly interesting discussion is that about “perfectly adapted environments” which seems to be directed towards Friston without mentioning him, though. The discussion results from the ability to optimise their free energy combined from informational effort and reward not only with respect to the policy, but also with respect to the (true) transition probabilities. The outcome of such an optimisation is an environment in which transition probabilities are directly related to rewards, or, in other words, an environment in which informational effort is equal to something like negative reward. In such an environment “minimizing the statistical surprise or maximizing the predictive information is equivalent to maximizing reward” which is what Friston argues (see also the associated discussion on Needless to say that they consider this as a very special case while in most other cases the environment contains information that is irrelevant in terms of reward. Nevertheless, they consider the possibility that the environments of living organisms are indeed perfectly or at least well adapted through millions of years of coevolution and they suggest to direct future research towards this issue. The question really is what is reward in this general sense? What is it that living organisms try to achieve? The more concrete reward is, for example, reward for a particular task, the less relevant most information in the environment will be. I’m tempted to say that the combined optimisation of informational effort and rewards, as presented here, will then lead to policies which particularly seak out relevant information, but I’m not sure whether this is a correct interpretation.

To sum up Tishby and Polani present a new theoretical framework which generalises reinforcement learning by incorporating ideas from information theory. They provide an interesting new perspective which is presented in a pleasingly accessible way. I do not think that they solved any particular problem in reinforcement learning, but they broadened the view by postulating that agents tradeoff informational effort (capacity?) and reward. Practically, computations derived from their framework may not be feasible in most cases, because original reinforcement learning is already hard and here a few expectations have been added. Or, maybe it’s not so bad, because you can do them together.

Sum-Product Networks: A New Deep Architecture.

Poon, H. and Domingos, P.
in: Proceedings of the 27th conference on Uncertainty in Artificial Intelligence (UAI 2011), 2011
URL, Google Scholar


The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sumproduct networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with standard deep networks. For example, SPNs perform image completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex.


The authors present a new type of graphical model which is hierarchical (rooted directed acyclic graph) and has a sum-product structure, i.e., the levels in the hierarchy alternately implement a sum or product operation of their children. They call these models sum-product networks (SPNs). The authors define conditions under which SPNs represent joint probability distributions over the leaves in the graph efficiently where efficient means that all the marginals can be computed efficiently, i.e., inference in SPNs is easy. They argue that SPNs subsume all previously known tractable graphical models while being more general.

When inference is tractable in SPNs, so is learning. Learning here means to update weights in the SPN which can also be used to change the structure of an SPN by pruning connections with 0 weights after convergence of learning. They suggest to use either EM or gradient-based learning, but note that for large hierarchies (very deep networks) you’ll have a gradient diffusion problem, as in general in deep learning. To overcome this problem they use the maximum posterior estimator which effectively updates only a single edge of a node instead of all edges dependent on the (diffusing) gradient.

The authors introduce the properties of SPNs using only binary variables. Leaves of the SPNs then are indicators for values of these variables, i.e., there are 2*number of variables leaves. It is straight forward to extend this to general discrete variables where the potential number of leaves then rises to number of values * number of variables. For continuous variables sum nodes become integral nodes (so you need distributions which you can easily integrate) and it is not so clear to me what leaves in the tree then are. In general, I didn’t follow the technical details well and can hardly comment on potential problems. One question certainly is how you initialise your SPN structure before learning (it will matter whether you start with a product or sum level at the bottom of your hierarchy and where the leaves are positioned).

Anyway, this work introduces a promising new deep network architecture which combines a solid probabilistic interpretation with tractable exact computations. In particular, in comparison to previous models (deep belief networks and deep Boltzmann machines) this leads to a jump in performance in both computation time and inference results as shown in image completion experiments. I’m looking forward to seeing more about this.

Bayesian estimation of dynamical systems: an application to fMRI.

Friston, K. J.
Neuroimage, 16:513–530, 2002
DOI, Google Scholar


This paper presents a method for estimating the conditional or posterior distribution of the parameters of deterministic dynamical systems. The procedure conforms to an EM implementation of a Gauss-Newton search for the maximum of the conditional or posterior density. The inclusion of priors in the estimation procedure ensures robust and rapid convergence and the resulting conditional densities enable Bayesian inference about the model parameters. The method is demonstrated using an input-state-output model of the hemodynamic coupling between experimentally designed causes or factors in fMRI studies and the ensuing BOLD response. This example represents a generalization of current fMRI analysis models that accommodates nonlinearities and in which the parameters have an explicit physical interpretation. Second, the approach extends classical inference, based on the likelihood of the data given a null hypothesis about the parameters, to more plausible inferences about the parameters of the model given the data. This inference provides for confidence intervals based on the conditional density.


I presented the algorithm which underlies various forms of dynamic causal modeling and which we use to estimate RNN parameters. At the core of it is an iterative computation of the posterior of the parameters of a dynamical model based on a first-order Taylor series approximation of a meta-function mapping parameter values to observations, i.e., the dynamical system is hidden in this function such that the probabilistic model does not have to care about it. This is possible, because the dynamics is assumed to be deterministic and noise only contributes at the level of observations. It can be shown that the resulting update equations for the posterior mode are equivalent with a Gauss-Newton optimisation of the log-joint probability of observations and parameters (this is MAP estimation of the parameters). Consequently, the rate of convergence of the posterior may be up to quadratic, but it is not guaranteed to increase the likelihood at every step or actually converge at all. It should work well close to an optimum (when observations are well fitted), or if the dynamics is close to linear with respect to parameters. Because the dynamical system is integrated numerically to get observation predictions and the Jacobian of the observations with respect to parameters is also obtained numerically, this algorithm may be very slow.

This algorithm is described in Friston2002 embedded into an application to fMRI. I did not present the specifics of this application and, particularly, ignored the influence of the there defined inputs u. The derivation of the parameter posterior described above is embedded in an EM algorithm for hyperparameters on the covariance of observations. I will discuss this in a future session.

Temporal sparseness of the premotor drive is important for rapid learning in a neural network model of birdsong.

Fiete, I. R., Hahnloser, R. H. R., Fee, M. S., and Seung, H. S.
J Neurophysiol, 92:2274–2282, 2004
DOI, Google Scholar


Sparse neural codes have been widely observed in cortical sensory and motor areas. A striking example of sparse temporal coding is in the song-related premotor area high vocal center (HVC) of songbirds: The motor neurons innervating avian vocal muscles are driven by premotor nucleus robustus archistriatalis (RA), which is in turn driven by nucleus HVC. Recent experiments reveal that RA-projecting HVC neurons fire just one burst per song motif. However, the function of this remarkable temporal sparseness has remained unclear. Because birdsong is a clear example of a learned complex motor behavior, we explore in a neural network model with the help of numerical and analytical techniques the possible role of sparse premotor neural codes in song-related motor learning. In numerical simulations with nonlinear neurons, as HVC activity is made progressively less sparse, the minimum learning time increases significantly. Heuristically, this slowdown arises from increasing interference in the weight updates for different synapses. If activity in HVC is sparse, synaptic interference is reduced, and is minimized if each synapse from HVC to RA is used only once in the motif, which is the situation observed experimentally. Our numerical results are corroborated by a theoretical analysis of learning in linear networks, for which we derive a relationship between sparse activity, synaptic interference, and learning time. If songbirds acquire their songs under significant pressure to learn quickly, this study predicts that HVC activity, currently measured only in adults, should also be sparse during the sensorimotor phase in the juvenile bird. We discuss the relevance of these results, linking sparse codes and learning speed, to other multilayered sensory and motor systems.


They model the generation of bird song as a simple feed-forward network and show that a sparse temporal code of HVC neurons (feeding into RA neurons) speeds up learning with backpropagation. They argue that this speed up is the main explanation for why real HVC neurons exhibit a sparse temporal code.

HVC neurons are modelled as either on or off, i.e., bursting or non-bursting, while RA neurons have continuous activities. A linear combination of RA neurons then determines the output of the network. They define a desired, low-pass filtered output that should be learnt, but while their Fig. 2 suggests that they model the sequential aspect of the data, the actual network has no such component and the temporal order of the data points is irrelevant for learning. Maybe fixing, i.e., not learning, the connections from RA to output is biologically well motivated, but other choices for the network seem to be quite arbitrary, e.g., why do RA neurons project from the beginning to only one of two outputs? They varied quite a few parameters and found that their main result (learning is faster with sparse HVC firing) holds for all of them, though. Interesting to note: they had to initialise HVC-RA and RA thresholds such that initial RA activity is low and nonuniform in order to get desired type of RA activity after learning.

I didn’t like the paper that much, because they showed the benefit of sparse coding for the biologically implausible backpropagation learning. Would it also hold up against a Hebbian learning paradigm? On the other hand, the whole idea of being able to learn better when each neuron is only responsible for one restricted part of the stimulus is so outrageously intuitive that you wonder why this needed to be shown in the first place (Stefan noted, though, that he doesn’t know of work investigating temporal sparseness compared to spatial sparseness)? Finally, you cannot argue that this is the main reason why HVC neurons fire in a temporally sparse manner, because there might be other unknown reasons and this is only a side effect.

SORN: a self-organizing recurrent neural network.

Lazar, A., Pipa, G., and Triesch, J.
Front Comput Neurosci, 3:23, 2009
DOI, Google Scholar


Understanding the dynamics of recurrent neural networks is crucial for explaining how the brain processes information. In the neocortex, a range of different plasticity mechanisms are shaping recurrent networks into effective information processing circuits that learn appropriate representations for time-varying sensory stimuli. However, it has been difficult to mimic these abilities in artificial neural network models. Here we introduce SORN, a self-organizing recurrent network. It combines three distinct forms of local plasticity to learn spatio-temporal patterns in its input while maintaining its dynamics in a healthy regime suitable for learning. The SORN learns to encode information in the form of trajectories through its high-dimensional state space reminiscent of recent biological findings on cortical coding. All three forms of plasticity are shown to be essential for the network’s success.


The paper considers the question of whether adapting an RNN used as a reservoir gives better performance in a sequence prediction task than randomly initialised RNNs. The authors demonstrate an adaptation procedure based on spike-timing-dependent plasticity (STDP) controlled with intrinsic plasticity (IP) and synaptic normalisation (SN) as homeostatic mechanisms and show that the performance of the adapted RNNs is indeed superior to the performance of the random RNNs. They further show that IP and SN are necessary for good results, or rather that without either the RNN exhibits disadvantageous firing patterns (bursting, always on, always off).

This is one of the few studies which shows successfull learning of RNNs. However, they use a rather simple model: a binary network in discrete time. The connectivity of the network is more elaborate: there are excitatory units which are recurrently connected, as well as fewer inhibitory neurons which have no connections between themselves, but are fully and reciprocally connected with all excitatory units. Input to the network is given to excitatory units through input units which are separated into subsets which each give a spike (1) when a specific symbol in the input sequence is currently present (input sequences consist of letters and numbers). The authors show that the RNN develops states (activity of all units in the network as a vector) which are specific to individual input symbols with the addition that also the serial number of the input symbol in the sequence is represented. This simplifies readout of the current symbol in the sequence from RNN activity and hence leads to improved performance of predicting the next symbol in the sequence using a standard reservoir computing readout function. However, the authors note that the RNN keeps on changing its response to input, i.e., their learning rule does not converge which means that the readout function would have to be updated all the time as well. Consequently, they switch off learning in the test phase.

The authors show that it is beneficial that recurrent connections between excitatory units are sparse.

BM: An iterative algorithm to learn stable non-linear dynamical systems with Gaussian mixture models.

Khansari-Zadeh, S. M. and Billard, A.
in: Proc. IEEE Int Robotics and Automation (ICRA) Conf, pp. 2381–2388, 2010
DOI, Google Scholar


We model the dynamics of non-linear point-topoint robot motions as a time-independent system described by an autonomous dynamical system (DS). We propose an iterative algorithm to estimate the form of the DS through a mixture of Gaussian distributions. We prove that the resulting model is asymptotically stable at the target. We validate the accuracy of the model on a library of 2D human motions and to learn a control policy through human demonstrations for two multidegrees of freedom robots. We show the real-time adaptation to perturbations of the learned model when controlling the two kinematically-driven robots.


The authors describe a system for learning nonlinear, multivariate dynamical systems based on Gaussian mixture regression (GMR). The difference to previous approaches using GMR (e.g. Gribovskaya2010) is that the GMR is done by pruning a Gaussian mixture model which has a Gaussian at each time point such that accuracy and stability criteria are adhered to. Pruning here actually means that two neighbouring Gaussians are merged. Consequently, the main contribution from the paper is the derivation and proof of the corresponding stability criteria – something that I haven’t checked properly.

They make a quantitative comparison between their binary merging approach, original EM learning of GMR, using LWPR to learn the dynamics and using DMPs. However, they do not tell the precise procedures. I am particular surprised about the very low accuracy of the DMPs compared to the other approaches. Unless they have done something special (such as introduce large temporal deviations as done for Fig. 2) I don’t see why the accuracy for DMPs should be so low.

They argue that the main advantages of their approach are that a minimal number of used Gaussians is automatically determined while the resulting dynamics is stable at all times, that the multivariate Gaussians can capture correlations between dimensions (in contrast to DMPs) and that the computations are less costly than when using Gaussian Process Regression. The disadvantages are that the number of parameters increases quadratically with the dimensionality (curse of dimensionality, not so crucial for their 2, 4 or 6D examples, but then?), but, in particular, that the pruning procedure is highly susceptible to local minima issues and results depend on the order in which Gaussians are merged. In the extreme case, imagine that through the presence of noise none of the initial Gaussians can be merged without violating the accuracy constraint. Again, this might not be a problem for their very smooth data, but it will become problematic for more noisy data. Similar problems lead to the dependency on the order of merges (which are selected randomly). To overcome the order dependency they suggest to restart the algorithm several times and then select the result with the smallest number of Gaussians. Note that this compromises their computational advantages over GPs. While computing a GP mapping is cubic in the number of data points, merging the Gaussians is quadratic in the number of time points, but if you consider that different merge orders need to be checked, you’ll notice that there are 2 to the power of time points possible merge sequences, meaning that your computational costs can increase exponentially in the worst case when really the best solution is supposed to be found (if you optimise the hyperparameters in GPs you’re in a similar situation in a continuous space, though).