Sum-Product Networks: A New Deep Architecture.

Poon, H. and Domingos, P.
in: Proceedings of the 27th conference on Uncertainty in Artificial Intelligence (UAI 2011), 2011
URL, Google Scholar

Abstract

The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sumproduct networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with standard deep networks. For example, SPNs perform image completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex.

Review

The authors present a new type of graphical model which is hierarchical (rooted directed acyclic graph) and has a sum-product structure, i.e., the levels in the hierarchy alternately implement a sum or product operation of their children. They call these models sum-product networks (SPNs). The authors define conditions under which SPNs represent joint probability distributions over the leaves in the graph efficiently where efficient means that all the marginals can be computed efficiently, i.e., inference in SPNs is easy. They argue that SPNs subsume all previously known tractable graphical models while being more general.

When inference is tractable in SPNs, so is learning. Learning here means to update weights in the SPN which can also be used to change the structure of an SPN by pruning connections with 0 weights after convergence of learning. They suggest to use either EM or gradient-based learning, but note that for large hierarchies (very deep networks) you’ll have a gradient diffusion problem, as in general in deep learning. To overcome this problem they use the maximum posterior estimator which effectively updates only a single edge of a node instead of all edges dependent on the (diffusing) gradient.

The authors introduce the properties of SPNs using only binary variables. Leaves of the SPNs then are indicators for values of these variables, i.e., there are 2*number of variables leaves. It is straight forward to extend this to general discrete variables where the potential number of leaves then rises to number of values * number of variables. For continuous variables sum nodes become integral nodes (so you need distributions which you can easily integrate) and it is not so clear to me what leaves in the tree then are. In general, I didn’t follow the technical details well and can hardly comment on potential problems. One question certainly is how you initialise your SPN structure before learning (it will matter whether you start with a product or sum level at the bottom of your hierarchy and where the leaves are positioned).

Anyway, this work introduces a promising new deep network architecture which combines a solid probabilistic interpretation with tractable exact computations. In particular, in comparison to previous models (deep belief networks and deep Boltzmann machines) this leads to a jump in performance in both computation time and inference results as shown in image completion experiments. I’m looking forward to seeing more about this.

Flexible vowel recognition by the generation of dynamic coherence in oscillator neural networks: speaker-independent vowel recognition.

Liu, F., Yamaguchi, Y., and Shimizu, H.
Biol Cybern, 71:105–114, 1994
DOI, Google Scholar

Abstract

We propose a new model for speaker-independent vowel recognition which uses the flexibility of the dynamic linking that results from the synchronization of oscillating neural units. The system consists of an input layer and three neural layers, which are referred to as the A-, B- and C-centers. The input signals are a time series of linear prediction (LPC) spectrum envelopes of auditory signals. At each time-window within the series, the A-center receives input signals and extracts local peaks of the spectrum envelope, i.e., formants, and encodes them into local groups of independent oscillations. Speaker-independent vowel characteristics are embedded as a connection matrix in the B-center according to statistical data of Japanese vowels. The associative interaction in the B-center and reciprocal interaction between the A- and B-centers selectively activate a vowel as a global synchronized pattern over two centers. The C-center evaluates the synchronized activities among the three formant regions to give the selective output of the category among the five Japanese vowels. Thus, a flexible ability of dynamical linking among features is achieved over the three centers. The capability in the present system was investigated for speaker-independent recognition of Japanese vowels. The system demonstrated a remarkable ability for the recognition of vowels very similar to that of human listeners, including misleading vowels. In addition, it showed stable recognition for unsteady input signals and robustness against background noise. The optimum condition of the frequency of oscillation is discussed in comparison with stimulus-dependent synchronizations observed in neurophysiological experiments of the cortex.

Review

The authors present an oscillating recurrent neural network model for the recognition of Japanese vowels. The model consists of 4 layers: 1) an input layer which gives pre-processed frequency information, 2) an oscillatory hidden layer with local inhibition, 3) another oscillatory hidden layer with long-range inhibition and 4) a readout layer implementing the classification of vowels using a winner-takes-all mechanism. Layers 1-3 each contain 32 units where each unit is associated to one input frequency. The output layer contains one unit for each of the 5 vowels and the readout mechanism is based on multiplication of weighted sums of layer 3 activities such that the output is also oscillatory. The oscillatory units in layers 2 and 3 consist of an excitatory element coupled with an inhibitory element which oscillate, or become silent, depending on the input. The long-range connections in layer 3 are determined manually based on known correlations between formants (characteristic frequencies) of the individual vowels.

In experiments the authors show that the classification of their network is robust against different speakers (14 men, 5 women, 5 girls, 5 boys): 6 out of 145 trials were correctly classified. However, they do not report what exactly their criterion for classification performance was (remember that the output was oscillatory, also sometimes alternative vowels show bumps in the time course of a vowel in the shown examples). They also report robustness to imperfect stimuli (formants varying within a vowel) and noise (superposition of 12 different conversations), but only single examples are shown.

Without being able to tell what the state of the art in neural networks in 1994 was, I guess the main contribution of the paper is that it shows that vowel recognition may be robustly implemented using oscillatory networks. At least from today’s perspective the suggested network is a bad solution to the technical problem of vowel recogntion, but even alternative algorithms at the time were probably better in that (there’s a hint in one of the paragraphs in the discussion). The paper is a good example for what was wrong with neural network research at the time: the models give the feeling that they are pretty arbitrary. Are the units in the network only defined and connected like they are, because these were the parameters that worked? Most probably. At least here the connectivity is partly determined through some knowledge of how frequencies produced by vowels relate, but many other parameters appear to be chosen arbitrarily. Respect to the person who made it work. However, the results section is rather weak. They only tested one example of a spoken vowel per person and they don’t define classification performance clearly. I guess, you could argue that it is a proof-of-concept of a possible biological implementation, but then again it is still unclear how this can be properly related to real networks in the brain.

Tuning properties of the auditory frequency-shift detectors.

Demany, L., Pressnitzer, D., and Semal, C.
J Acoust Soc Am, 126:1342–1348, 2009
DOI, Google Scholar

Abstract

Demany and Ramos [(2005). J. Acoust. Soc. Am. 117, 833-841] found that it is possible to hear an upward or downward pitch change between two successive pure tones differing in frequency even when the first tone is informationally masked by other tones, preventing a conscious perception of its pitch. This provides evidence for the existence of automatic frequency-shift detectors (FSDs) in the auditory system. The present study was intended to estimate the magnitude of the frequency shifts optimally detected by the FSDs. Listeners were presented with sound sequences consisting of (1) a 300-ms or 100-ms random “chord” of synchronous pure tones, separated by constant intervals of either 650 cents or 1000 cents; (2) an interstimulus interval (ISI) varying from 100 to 900 ms; (3) a single pure tone at a variable frequency distance (Delta) from a randomly selected component of the chord. The task was to indicate if the final pure tone was higher or lower than the nearest component of the chord. Irrespective of the chord’s properties and of the ISI, performance was best when Delta was equal to about 120 cents (1/10 octave). Therefore, this interval seems to be the frequency shift optimally detected by the FSDs.

Review

If you present 5 tones simultaneously, people cannot tell whether a subsequently presented tone was one of the 5 tones, or lay in the middle between any 2 of the 5 tones. On the other hand, people can judge whether a subsequently presented tone lay above or below any one of the 5 tones. This paper investigates the dependency of this effect on how much the subsequent tone lay above or below one of the 5 (here actually 6) tones (frequency shift), on how much the 6 tones were separated (Iv) and on the interstimulus interval (ISI) between the first set of tones and the subsequent tone. The authors replicated the mentioned previous findings and presented data suggesting that there is an optimal frequency shift at which subjects performed best in the task. They argue that this is at roughly 120 cents.

I have several remarks about the analysis. First of all, the number of subjects in the two experiments is very low (7 and 4, each including the first author). While in experiment 1 the curves of d-prime over subjects look relatively consistent, this is not the case for larger ISIs in experiment 2. The main flaw of the analysis is that their suggestion of an optimal frequency shift of 120 cents is based on curve fitting of an exponential function to 4,5, or 6 data points where they also add an artificial baseline data point at d-prime=0 for frequency shift=0. The data point as such makes sense as the judgement of a subject whether the shift was up or down must be random when the true shift was actually 0. Still, it feels wrong to include an artificial data point in the analysis. In the end, especially for large ISIs the variability of thus estimated optimal frequency shifts for individual subjects is so variable that it seems pointless to conclude anything about the mean over (4) subjects.

Sam actually tried to replicate the original finding on which this paper is based. He commented that it was hard to replicate it in a large group of subjects and he found differences between musicians and non-musicians (which shouldn’t be true for something that belongs to really basic hearing abilities). He also noted that subjects were generally quite bad in this task and that he found it to be impossible to make the task easier, when one wants to maintain that the 6 initial tones cannot be perceived individually.

The authors of the paper seem to use subjects, which perform particularly well in these tasks, repeatedly in their experiments.

It has been noted in the groupmeeting that this research could be linked better to, e.g., the mismatch negativity literature which is also concerned with detection of deviations. Sam pointed to the publication containing the original findings in response.

Category-specific versus category-general semantic impairment induced by transcranial magnetic stimulation.

Pobric, G., Jefferies, E., and Ralph, M. A. L.
Curr Biol, 20:964–968, 2010
DOI, Google Scholar

Abstract

Semantic cognition permits us to bring meaning to our verbal and nonverbal experiences and to generate context- and time-appropriate behavior. It is core to language and nonverbal skilled behaviors and, when impaired after brain damage, it generates significant disability. A fundamental neuroscience question is, therefore, how does the brain code and generate semantic cognition? Historical and some contemporary theories emphasize that conceptualization stems from the joint action of modality-specific association cortices (the “distributed” theory) reflecting our accumulated verbal, motor, and sensory experiences. Parallel studies of semantic dementia, rTMS in normal participants, and neuroimaging indicate that the anterior temporal lobe (ATL) plays a crucial and necessary role in conceptualization by merging experience into an amodal semantic representation. Some contemporary computational models suggest that concepts reflect a hub-and-spoke combination of information–modality-specific association areas support sensory, verbal, and motor sources (the spokes) while anterior temporal lobes act as an amodal hub. We demonstrate novel and striking evidence in favor of this hypothesis by applying rTMS to normal participants: ATL stimulation generates a category-general impairment whereas IPL stimulation induces a category-specific deficit for man-made objects, reflecting the coding of praxis in this neural region.

Review

This is a short TMS experiment investigating the role of the left anterior temporal lobe (ATL) in semantic processing of stimuli. Semantics here is practically defined as the association to a high-level category defining an object. The task was simply to name the object shown on a picture. Involvement of ATL in this task is indicated by patients with semantic dementia who forget the meaning of categories/objects, i.e., they cannot associate a perceived object with its category/class (example: they see a sheep and don’t know what it is – do they still know what a sheep is, if you tell them that it is a sheep?).

The experiment is supposed to differentiate between 3 hypothesis: 1) object meaning results from a distributed representation of a stimulus between all modalities, 2) object meaning is only generated in ATL, other areas provide only sensory input and 3) part of the object meaning is generated already in single modal areas and ATL acts as an amodal integration hub. These hypothesis are only verbally described and indeed it seems difficult to differentiate between 2) and 3).

The experiment shows that 10min of repetitive TMS can increase response times of subjects in the picture naming, but not a number reading task, if TMS was applied to the left ATL. In a post-hoc analysis the authors then devided the shown pictures into living-nonliving and low-high manipulable objects and again looked for interactions with TMS stimulation. They found that stimulation of left IPL, an area associated with manipulable objects, had an effect on nonliving and high-manipulable objects while having no effect on the others. Stimulation of ATL, however, had a (smaller) effect on all categories. Furthermore, stimulation in occipital lobe had no effect with respect to taks or stimulus at all. The authors conclude that this is evidence for hypothesis 3) above.

A major concern with the study is that the main result has been obtained with a post-hoc analysis and the authors did not even specify more precisely which pictures they used in this analysis, e.g., we don’t know which objects were among them. Furthermore, the results do not really allow to make any conclusions about the connectivity of the different regions. Hypotheses 2) and 3) cannot be discerned with the given results. Even hypotheses 1) could still be true, if one assumes that ATL is a region mainly for producing verbal output of a category – something necessary for the task, but not necessarily involved in associating with a category. However, Katharina mentioned that ATL was also implicated in experiments with other output modalities (e.g. drawing). So, what stays, if one believes the post-hoc analysis, is that TMS on ATL disrupts picture naming in general while TMS on IPL disrupts picture naming selectively for nonliving, high-manipulable objects. We cannot rule out any of the hypotheses above completely.

Internal models and the construction of time: generalizing from state estimation to trajectory estimation to address temporal features of perception, including temporal illusions.

Grush, R.
Journal of Neural Engineering, 2:S209, 2005
URL, Google Scholar

Abstract

The question of whether time is its own best representation is explored. Though there is theoretical debate between proponents of internal models and embedded cognition proponents (e.g. Brooks R 1991 Artificial Intelligence 47 139—59) concerning whether the world is its own best model, proponents of internal models are often content to let time be its own best representation. This happens via the time update of the model that simply allows the model’s state to evolve along with the state of the modeled domain. I argue that this is neither necessary nor advisable. I show that this is not necessary by describing how internal modeling approaches can be generalized to schemes that explicitly represent time by maintaining trajectory estimates rather than state estimates. Though there are a variety of ways this could be done, I illustrate the proposal with a scheme that combines filtering, smoothing and prediction to maintain an estimate of the modeled domain’s trajectory over time. I show that letting time be its own representation is not advisable by showing how trajectory estimation schemes can provide accounts of temporal illusions, such as apparent motion, that pose serious difficulties for any scheme that lets time be its own representation.

Review

The author argues based on temporal illusions that perceptual states correspond to smoothed trajectories where smoothing is meant as in the context of a Kalman smoother. In particular, temporal illusions such as the flash-lag effect and the cutaneous rabbit show that stimuli later in time can influence the perception of earlier stimuli. However, it seems that this is only the case for temporally very close stimuli (within 100ms). Thus, Grush suggests that stimuli are internally represented as trajectories including past and future states. However, the representation of the past states in the trajectory is also updated when new sensory evidence is collected (the observations, or rather the states, are smoothed). This idea has actually already been suggested by Rao, Eagleman and Sejnowski (2001) as stated by the author, but here he additionally postulates that also some of the future states are represented in the trajectory to account for apparent motion effects (where a motion is continued in the head when the stimulus disappears).

It’s an interesting account of temporal aspects in perceptions, but note that he develops things for the perceptual level, which does not necessarily let us draw conclusions for processing on the sensory level. Also, his discussion on whether Rao et al’s account of a fixed-lag smoother can be true is interesting, though he didn’t entirely convince me that fixed-lag perception is not what is happening in the brain. Wouldn’t instantaneous updating of the perceptual trajectory mean that at some point our perception changes, but during the illusions people report coherent motion. Ok, it could be that we just don’t “remember” our previous perception after it’s updated, but it still sounds counterintuitive. I don’t think that the apparent motion illusions are a good argument for representing future states, because other mechanisms could be responsible for that.

Bayesian estimation of dynamical systems: an application to fMRI.

Friston, K. J.
Neuroimage, 16:513–530, 2002
DOI, Google Scholar

Abstract

This paper presents a method for estimating the conditional or posterior distribution of the parameters of deterministic dynamical systems. The procedure conforms to an EM implementation of a Gauss-Newton search for the maximum of the conditional or posterior density. The inclusion of priors in the estimation procedure ensures robust and rapid convergence and the resulting conditional densities enable Bayesian inference about the model parameters. The method is demonstrated using an input-state-output model of the hemodynamic coupling between experimentally designed causes or factors in fMRI studies and the ensuing BOLD response. This example represents a generalization of current fMRI analysis models that accommodates nonlinearities and in which the parameters have an explicit physical interpretation. Second, the approach extends classical inference, based on the likelihood of the data given a null hypothesis about the parameters, to more plausible inferences about the parameters of the model given the data. This inference provides for confidence intervals based on the conditional density.

Review

I presented the algorithm which underlies various forms of dynamic causal modeling and which we use to estimate RNN parameters. At the core of it is an iterative computation of the posterior of the parameters of a dynamical model based on a first-order Taylor series approximation of a meta-function mapping parameter values to observations, i.e., the dynamical system is hidden in this function such that the probabilistic model does not have to care about it. This is possible, because the dynamics is assumed to be deterministic and noise only contributes at the level of observations. It can be shown that the resulting update equations for the posterior mode are equivalent with a Gauss-Newton optimisation of the log-joint probability of observations and parameters (this is MAP estimation of the parameters). Consequently, the rate of convergence of the posterior may be up to quadratic, but it is not guaranteed to increase the likelihood at every step or actually converge at all. It should work well close to an optimum (when observations are well fitted), or if the dynamics is close to linear with respect to parameters. Because the dynamical system is integrated numerically to get observation predictions and the Jacobian of the observations with respect to parameters is also obtained numerically, this algorithm may be very slow.

This algorithm is described in Friston2002 embedded into an application to fMRI. I did not present the specifics of this application and, particularly, ignored the influence of the there defined inputs u. The derivation of the parameter posterior described above is embedded in an EM algorithm for hyperparameters on the covariance of observations. I will discuss this in a future session.

Spike-Based Population Coding and Working Memory.

Boerlin, M. and Denève, S.
PLoS Comput Biol, 7:e1001080, 2011
DOI, Google Scholar

Abstract

Abstract

Compelling behavioral evidence suggests that humans can make optimal decisions despite the uncertainty inherent in perceptual or motor tasks. A key question in neuroscience is how populations of spiking neurons can implement such probabilistic computations. In this article, we develop a comprehensive framework for optimal, spike-based sensory integration and working memory in a dynamic environment. We propose that probability distributions are inferred spike-per-spike in recurrently connected networks of integrate-and-fire neurons. As a result, these networks can combine sensory cues optimally, track the state of a time-varying stimulus and memorize accumulated evidence over periods much longer than the time constant of single neurons. Importantly, we propose that population responses and persistent working memory states represent entire probability distributions and not only single stimulus values. These memories are reflected by sustained, asynchronous patterns of activity which make relevant information available to downstream neurons within their short time window of integration. Model neurons act as predictive encoders, only firing spikes which account for new information that has not yet been signaled. Thus, spike times signal deterministically a prediction error, contrary to rate codes in which spike times are considered to be random samples of an underlying firing rate. As a consequence of this coding scheme, a multitude of spike patterns can reliably encode the same information. This results in weakly correlated, Poisson-like spike trains that are sensitive to initial conditions but robust to even high levels of external neural noise. This spike train variability reproduces the one observed in cortical sensory spike trains, but cannot be equated to noise. On the contrary, it is a consequence of optimal spike-based inference. In contrast, we show that rate-based models perform poorly when implemented with stochastically spiking neurons.

Author Summary

Most of our daily actions are subject to uncertainty. Behavioral studies have confirmed that humans handle this uncertainty in a statistically optimal manner. A key question then is what neural mechanisms underlie this optimality, i.e. how can neurons represent and compute with probability distributions. Previous approaches have proposed that probabilities are encoded in the firing rates of neural populations. However, such rate codes appear poorly suited to understand perception in a constantly changing environment. In particular, it is unclear how probabilistic computations could be implemented by biologically plausible spiking neurons. Here, we propose a network of spiking neurons that can optimally combine uncertain information from different sensory modalities and keep this information available for a long time. This implies that neural memories not only represent the most likely value of a stimulus but rather a whole probability distribution over it. Furthermore, our model suggests that each spike conveys new, essential information. Consequently, the observed variability of neural responses cannot simply be understood as noise but rather as a necessary consequence of optimal sensory integration. Our results therefore question strongly held beliefs about the nature of neural “signal” and “noise”.

Review

[note: I here often write posterior, but mean log-posterior as this is what the authors mostly compute with]

Boerlin and Deneve present a recurrent spiking neural network which integrates dynamically changing stimuli from different modalities, allows for simple readout of the complete posterior distribution, predicts state dynamics and, therefore, may act as a working memory when a stimulus is absent. Interestingly, spikes in the recurrent neural network (RNN) are generated deterministically, but from an outside perspective interspike intervals of individual neurons appear to follow a Poisson distribution as measured experimentally. How is all this achieved and what are the limitations?

The experimental setup is as follows: There is a ONE-dimensional, noisy, dynamic variable in the world (state from here on) which we want to track through time. However, observations are only made through noisy spike trains from different sensory modalities where the conditional probability of a spike given a particular state is modelled as a Poisson distribution (actually exponential family distr. but in the experiments they use a Poisson). The RNN receives these spikes as input and the question then is how we have to setup the dynamics of each neuron in the RNN such that a simple integrator can readout the posterior distribution of the state from RNN activities.

The main trick of the paper is to find an approximation of the true (log-)posterior L which in turn may be approximated using the readout posterior G under the assumption that the two are good approximations of each other. You recognise the circularity in this statement. This is resolved by using a spiking mechanism which ensures that the two are indeed close to each other which in turn ensures that the true posterior L is approximated. The rest is deriving formulae and substituting them in each other until you get a formula describing the (dynamics of the) membrane potential of a single neuron in the RNN which only depends on sensory and RNN spikes, the tuning curves or gains of the associated neurons, rate constants of the network (called leaks here) and (true) parameters of the state dynamics.

The approximations used for the (log-)posterior are a Taylor expansion of 2nd order, a subsequent Taylor expansion of 1st order and a discretisation of the posterior according to the preferred state of each RNN neuron. However, the most critical assumption for the derivation of the results is that the dynamics is 1st order Markovian and linear. In particular, they assume a state dynamics which has a constant drift and a Wiener process diffusion. In the last paragraph of the discussion they mention that it is straightforward to extend the model to state dependent drift, but I don’t follow how this could be done, because their derivation of L crucially depends on the observation that p(x_t|x_{t-dt}) = p(x_t – x_{t-dt}) which is only true for state-independent drift.

The resulting membrane potential has a form corresponding to a leaky integrate and fire neuron. The authors differentiate between 4 parts: a leakage current, feed-forward input from sensory neurons (containing a bias term which, I think, is wrong in Materials and Methods but which is also not used in the experiments), instantaneous recurrent input from the RNN and slow recurrent currents from the RNN which are responsible for keeping up a memory of the approximated posterior past the time constant of the neuron. The slow currents are defined by two separate differential equations and I wonder where these are implemented in the neuron, if it already has a membrane potential associated with it to which the slow currents contribute. Also interesting to note is that all terms except for the leakage current are modulated by the RNN spike gains (Gamma) defining which effect a spike of neuron i has on the readout of the approximate posterior at the preferred state of neuron j. This includes the feed-forward input and means that feed-forward connection weights are determined by a linear combination of posterior gains (Gamma) and gains defined by the conditional probability of sensory spikes given the state (H). This means that the feed-forward weights are tuned to also take the effect of an input spike on the readout into account?

Anyway, the resulting spiking mechanism makes neurons spike whenever they improve the readout of the posterior from the RNN. The authors interpret this as a prediction error signal: a spike indicates that the posterior represented by the RNN deviated from the true (approximated) posterior. I guess we can call this prediction, because the readout/posterior has dynamics. But note that it is hard to interpret individual spikes with respect to prediction errors of the input spike train (something not desired anyway?). Also, the authors show that this representation is highly redundant. There always exist alternative spike trains of the RNN which represent the same posterior. This results in the demonstrated robustness and apparent randomness of the coding scheme. However, it also makes it impossible to interpret what it means when a neuron is silent. Nevertheless, neurons still exhibit characteristic tuning curves on average.

Notice that they do not assume a distributional form of the posterior and indeed they show that the network can represent a bimodal posterior, too.

In summary, the work at hand impressively combines many important aspects of recognising dynamic stimuli in a spike-based framework. Probably the most surprising property of the suggested neural network is that it produces spikes deterministically in order to optimise a global criterion although with a local spiking rule. However, the authors have to make important assumptions to arrive at these results. In particular, they need constant drift dynamics for their derivations, but also the “local” spiking rule turns out to use some global information: the weights of input and recurrently connected neurons in the membrane potential dynamics of an RNN neuron are determined from the gains for the readout of every neuron in the network, i.e., each neuron needs to know what a spike of each other neuron contributes to the posterior. I wonder what a corresponding learning rule would look like. Additionally, they need to assume that the RNN is fully connected, i.e., that every neuron, which contributes to the posterior, sends messages (spikes) to all other neurons contributing to the posterior. The authors also do not explain how the suggested slow, recurrent currents are represented in a spiking neuron. After all, these currents seem to have dynamics independent from the membrane potential of the neuron, yet they implement the dynamics of the posterior and are, therefore, absolutely central for predicting the development of the posterior over time. Finally, we have to keep in mind that the population of neurons coded for a discretisation of the posterior of a one-dimensional variable. With increasing dimensionality you’ll therefore have to spend an exponentially increasing number of neurons to represent the posterior and all of them will have to be connected.

Recurrent excitation in neocortical circuits.

Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A., and Suarez, H. H.
Science, 269:981–985, 1995
DOI, Google Scholar

Abstract

The majority of synapses in the mammalian cortex originate from cortical neurons. Indeed, the largest input to cortical cells comes from neighboring excitatory cells. However, most models of cortical development and processing do not reflect the anatomy and physiology of feedback excitation and are restricted to serial feedforward excitation. This report describes how populations of neurons in cat visual cortex can use excitatory feedback, characterized as an effective “network conductance”, to amplify their feedforward input signals and demonstrates how neuronal discharge can be kept proportional to stimulus strength despite strong, recurrent connections that threaten to cause runaway excitation. These principles are incorporated into models of cortical direction and orientation selectivity that emphasize the basic design principles of cortical architectures.

Review

The paper suggests that the functional role of recurrent excitatory connections is to amplify (increase gain between inputs and outputs) and denoise inputs to a (sensory) cortical area. This would allow these input signals to be relatively small and would, therefore, help to save energy (they don’t make this argument explicitly).

The work is motivated by an estimate of the number of recurrent connections directly made between spiny stellate cells of layer IV in the cat visual cortex. The authors conclude that these connections alone can already “provide a significant source of recurrent excitation”.

First, they consider an electronic circuit analogy describing the feed-forward input and recurrent currents acting on a neuron in the network. They look at the influence of the recurrent conductance (can be seen as the connectivity strength between all recurrently connected neurons) on the stability of the network and suggest that inhibitory neurons keep the network stable when the recurrent conductance is too high and would alone lead to divergence of network activities. However, they also implemented a model recurrent network consisting of excitatory and inhibitory spiking neurons and showed that it can implement direction selectivity of V1 simple cells. Interestingly, direction selectivity is based on asymmetric firing of excitatory and inhibitory connections from LGN (“in the preferred direction excitation precedes inhibition”) which they support with two references.

I find it hard to believe that cortical recurrent networks apparently don’t do any computations on their own except for improving the incoming signal. It means that all computations are actually done in the feed forward connections between areas. The excitation-inhibition asynchrony being an example here. But then, if you assume a hierarchy of similar processing units, where does, e.g., the necessary excitation-inhibition asynchrony come from? Well, potentially there are readout-neurons outside of the recurrently connected network which do exactly that. Then again, the whole processing in the brain would be feed-forward where the only intrinsically dynamic units would just amplify the feed-forward signals. Reservoir computing could be seen as an extension to this where the dynamics of the recurrent neurons is allowed to be more sophisticated, but becomes uninterpretable in turn. Still, the presented model is consistent, as far as I can tell, with the idea that the activity in response to a stimulus represents the posterior while activity at rest represents the prior over the variables represented by the network under consideration.

Note that the authors do not have any direct experimental evidence for their model in terms of simultaneous recordings of neurons in the same network. They only compare two summary statistics based on individual cells, for the second of which I don’t understand the experiment.

Recurrent neuronal circuits in the neocortex.

Douglas, R. J. and Martin, K. A. C.
Curr Biol, 17:R496–R500, 2007
DOI, Google Scholar

Abstract

In this Primer, we shall describe one interesting property of neocortical circuits – recurrent connectivity – and suggest what its computational significance might be.

Review

First, they use data of the distribution of synapses in cat visual cortex to argue that the predominant drive of activity in a cortical area is from recurrent connections within this area. They then suggest that the reason for this is the ability to enhance and denoise incoming signals through suitable recurrent connections. They show corresponding functional behaviour in a model based on linear threshold neurons (LTNs). They do not use sigmoid activation functions, because neurons apparently only rarely operate on their maximum firing rate such that sigmoid activation functions are not necessary. To maintain stability they instead use a global inhibitory unit. I guess you could equivalently use a suitable sigmoid function. Finally they suggest that top-down connections may bias the activity in the recurrent network such that one of a few alternative inputs may be selected based on, e.g., attention.

So here the functional role of the recurrent neural network is merely to increase the signal to noise ratio. It’s a bit strange to me that actually no computation is done. Does that mean that all the computation from sensory signals to hidden states are done by the projections from lower level area to higher level area? This seems to be consistent with the reservoir computing idea where the reservoir can also be seen as enhancing the representation of the input (by stretching its effects in time). The difference just being that the dynamics and function in reservoirs is more involved.

The ideas presented here are almost the same as already proposed by the first author in 1995 (see Douglas1995).

Spatiotemporal representations in the olfactory system.

Schaefer, A. T. and Margrie, T. W.
Trends Neurosci, 30:92–100, 2007
DOI, Google Scholar

Abstract

A complete understanding of the mechanisms underlying any kind of sensory, motor or cognitive task requires analysis from the systems to the cellular level. In olfaction, new behavioural evidence in rodents has provided temporal limits on neural processing times that correspond to less than 150ms–the timescale of a single sniff. Recent in vivo data from the olfactory bulb indicate that, within each sniff, odour representation is not only spatially organized, but also temporally structured by odour-specific patterns of onset latencies. Thus, we propose that the spatial representation of odour is not a static one, but rather evolves across a sniff, whereby for difficult discriminations of similar odours, it is necessary for the olfactory system to “wait” for later-activated components. Based on such evidence, we have devised a working model to assess further the relevance of such spatiotemporal processes in odour representation.

Review

They review evidence for temporal coding of odours in the olfactory bulb (and olfactory receptor neurons). Main finding is that with increasing intensity of an odour corresponding neurons fire more action potentials in a given time. However, this is achieved by an earlier onset of firing while inter-spike intervals stay roughly equal. The authors argue that this is a fast temporal code that can be used to discriminate odours. Especially, they suggest that this can explain why very different odours can be discriminated faster. The assumption there is that these differ mainly in high-intensity, i.e., fast subodours while similar odours differ mainly in low-intensity, i.e., slow subodours. But can it not be that similar odours differ only slightly in high-intensity subodours? My intuition says that the decision boundary is more determined by considerations of uncertainty rather than a temporal code of high- and low-intensity.

The authors ignore that there is an increased amount of action potentials for high-intensity odours and rely in their arguments entirely on the temporal aspect of earlier firing. If only the temporal code was important, this would be a huge energy waste by the brain. Stefan suggested that it might be related to subsequent checks and to cumulating evidence.