## Normative evidence accumulation in unpredictable environments.

Glaze, C. M., Kable, J. W., and Gold, J. I.
Elife, 4, 2015

### Abstract

In our dynamic world, decisions about noisy stimuli can require temporal accumulation of evidence to identify steady signals; differentiation to detect unpredictable changes in those signals; or both. Normative models can account for learning in these environments but have not yet been applied to faster decision processes. We present a novel, normative formulation of adaptive learning models that forms decisions by acting as a leaky accumulator with non-absorbing bounds. These dynamics, derived for both discrete and continuous cases, depend on the expected rate of change of the statistics of the evidence and balance signal identification and change detection. We found that, for two different tasks, human subjects learned these expectations, albeit imperfectly, then used them to make decisions in accordance with the normative model. The results represent a unified, empirically supported account of decision-making in unpredictable environments that provides new insights into the expectation-driven dynamics of the underlying neural signals.

### Review

The authors suggest a model of sequential information processing that is aware of possible switches in the underlying source of information. They further show that the model fits responses of people in two perceptual decision making tasks and consequently argue that behaviour, which was previously considered to be suboptimal, may follow the normative, i.e., optimal, mechanism of the model. This mechanism postulates that typical evidence accumulation mechanisms in perceptual decision making are altered by the expected switch rate of the stimulus. Specifically, evidence accumulation becomes more leaky and a non-absorbing bound becomes lower when the expected switch rate increases. The paper is generally well-written (although there are some convoluted bits in the results section) and convincing. I was a bit surprised, though, that only choices, but not their timing is considered in the analysis with the model. In the following I’ll go through some more details of the model and discuss limitations of the presented models and their relation to other models in the field, but first I describe the experiments reported in the paper.

The paper reports two experiments. In the first (triangles task) people saw two triangles on the screen and had to judge whether a single dot was more likely to originate from the one triangle or the other. There was one dot and corresponding response per trial. In each trial the position of the dot was redrawn from a Gaussian distribution centred around one of the two triangles. There were also change point trials in which the triangle from which the dot was drawn switched (and then remained the same until the next change point). The authors analysed the proportion correct in relation to whether a trial was a change point. Trials were grouped into blocks which were defined by constant rate of switches (hazard rate) in the true originating triangle. In the second experiment (dots-reversal task), a random dot stimulus repeatedly switched (reversed) direction within a trial. In each trial people had to tell in which direction the dots moved before they vanished. The authors analysed the proportion correct in relation to the time between the last switch and the end of stimulus presentation. There were no blocks. Each trial had one of two hazard rates and one of two difficulty levels. The two difficulty levels were determined for each subject individually such that the more difficult one lead to correct identification of motion direction of a 500ms long stimulus in 65% of cases.

The authors present two normative models, one discrete and one continuous, which they apply across and within trial in the triangles and dots-reversal tasks, respectively. The discrete model is a simple hidden Markov model in which the hidden state can take one of two values and there is a common transition probability between these two values which they call hazard ‘rate’ (H). Observations were implicitly assumed Gaussian. They only enter during fitting as log-likelihood ratios in the form $$\beta*x_n$$ where beta is a scaling relating to the internal / sensory uncertainty associated with the generative model of observations and $$x_n$$ is the observed dot position (x-coordinate) in the triangles task. In methods, the authors derive the update equation for the log posterior odds ($$L_n$$) of the hidden state values given in Eqs. (1) and (2).

The continuous model is based on a Markov jump process with two states which is the continuous equivalent of the hidden Markov model above. Using Ito-calculus the authors again derive an update equation for the log posterior odds of the two states (Eq. 4), but during fitting they actually approximate Eq. (4) with the discrete Eq. (1), because it is supposedly the most efficient discrete-time approximation of Eq. (4) (no explanation for why this is the case was given). They just replace the log-likelihood ratio placeholder (LLR) with a coherence-dependent term applicable to the random dot motion stimulus. Notably, in contrast to standard drift-diffusion modelling of random dot motion tasks, the authors used coherence-dependent noise. I’d be interested in the reason for this choice.

There is an apparent fundamental difference between the discrete and continuous models which can be seen in Fig. 1 B vs C. In the discrete model, for H>0.5, the log posterior odds may actually switch sign from one observation to the next whereas this cannot happen in the continuous model. Conceptually, this means that the log posterior odds in the discrete model, when the LLR is 0, i.e., when there is no evidence in either direction, would oscillate between decreasing positive and increasing negative values until converging to 0. This oscillation can be seen in Fig. 2G, red line for |LLR|>0. In the continuous model such an oscillation cannot happen, because the infinitely many, tiny time steps allow the model to converge to 0 before switching the sign. Another way to see this is through the discrete hazard ‘rate’ H which is the probability of a sign reversal within one time step of size dt. When you want to decrease dt in the model, but want to maintain a given rate of sign reversals in, e.g., 1 second, H would also have to decrease. Consequently, when dt approaches 0, the probability of a sign reversal approaches 0, too, which means that H is a useless parameter in continuous time which, in turn, is the reason why it is replaced by a real rate parameter ($$\lambda$$) representing the expected number of reversals per second. In conclusion, the fundamental difference between discrete and continuous models is only an apparent one. They are very similar models, just expressed in different resolutions of time. In that sense it would have perhaps been better to present results in the paper consistently in terms of a real hazard rate ($$\lambda$$) which could be obtained in the triangles task by dividing H by the average duration of a trial in seconds. Notice that the discrete model represents all hazard rates $$\lambda>1/dt$$ as H=1, i.e., it cannot represent hazard rates which would lead to more than 1 expected sign reversal per $$dt$$. There may be more subtle differences between the models when the exact distributions of sign reversals are considered instead of only the expected rates.

Using first order approximations of the two models the authors identify two components in the dynamics of the log posterior odds L: a leak and a bias. [Side remark: there is a small sign mistake in the definition of leak k of the continuous model in the Methods section.] Both depend on hazard rate and the authors show that the leak dominates the dynamics for small L whereas the bias dominates for large L. I find this denomination a bit misleading, because both, leak and bias, effectively result in a leak of log-posterior odds L by reducing L in every time step (cf. Fig. 1B,C). The change from a multiplicative leak to one based on a bias just means that the effective amount of leak in L increases nonlinearly with L as the bias takes over.

To test whether this special form of leak underlies decision making the authors compared the full model to two versions which only had a multiplicative leak, or one based on bias. In the former the leak stayed constant for increasing L, i.e., $$L’ = \gamma*L$$. In the latter there was perfect accumulation without leak up to the bias and then a bias-based leak which corresponds to a multiplicative leak where the leak rate increased with L such that $$L’ = \gamma(L)*L$$ with $$\gamma(L) = bias / L$$. The authors report evidence that in both tasks both alternative models do not describe choice behaviour as well as the full, normative model. In Fig. 9 they provide a reason by estimating the effective leak rate in the data and the models in dependence on the strength of sensory evidence (coherence in the dots reversal task). They do this by fitting the model with multiplicative leak separately to trials with low and high coherence (fitting to choices in the data or predicted by the different fitted models). In both data and normative model the effective leak rates depended on coherence. This dependence arises, because high sensory evidence leads to large values of L and I have argued above that larger L has larger effective leak rate due to the bias. It is, therefore, not surprising that the alternative model with multiplicative leak shows no dependence of effective leak on coherence. But it is also not surprising that the alternative model with bias-based leak has a larger dependence of effective leak on coherence than the data, because this model jumps from no leak to very large leak when coherence jumps from low to high. The full, normative model lies in between, because it smoothly transitions between the two alternative models.

Why is there a leak in the first place? Other people have found no evidence for a leak in evidence accumulation (eg. Brunton et al., 2013). The leak results from the possibility of a switch of the source of the observations, i.e., a switch of the underlying true stimulus. Without any information, i.e., without observations the possibility of a switch means that you should become more uncertain about the stimulus as time passes. The larger the hazard rate, i.e., the larger the probability of a switch within some time window, the faster you should become uncertain about the current stimulus. For a log posterior odds of L=0 uncertainty is at its maximum (both stimuli have equal posterior probability). This is another reason why discrete hazard ‘rates’ H>0.5 which lead to sign reversals in L do not make much sense. The absence of evidence for one stimulus should not lead to evidence for the other stimulus. Anyway, as the hazard rate goes to 0 the leak will go to 0 such that in experiments where usually no switches in stimulus occur subjects should not exhibit a leak which explains why we often find no evidence for leaks in typical perceptual decision making experiments. This does not mean that there is no leak, though. Especially, the authors report here that hazard rates estimated from behaviour of subjects (subjective) tended to be a bit higher than the ones used to generate the stimuli (objective), when the objective hazard rates were very low and the other way around for high objective hazard rates. This indicates that people have some prior expectations towards intermediate hazard rates that biased their estimates of hazard rates in the experiment.

The discussed forms of leak implement a property of the model that the authors called a ‘non-absorbing bound’. I find this wording also a bit misleading, because ‘bound’ was usually used to indicate a threshold in drift diffusion models which, when reached, would trigger a response. The bound here triggers nothing. Rather, it represents an asymptote of the average log posterior odds. Thus, it’s not an absolute bound, but it’s often passed due to variance in the momentary sensory evidence (LLR). I can also not follow the authors when they write: “The stabilizing boundary is also in contrast to the asymptote in leaky accumulation, which increases linearly with the strength of evidence”. Based on the dynamics of L discussed above the ‘bound’ here should exhibit exactly the described behaviour of an asymptote in leaky accumulation. The strength of evidence is reflected in the magnitude of LLR which is added to the intrinsic dynamics of the log posterior odds L. The non-absorbing bound, therefore, should be given by bias + average of LLR for the current stimulus. The bound, thus, should rise linearly with the strength of evidence (LLR).

Fitting of the discrete and continuous models was done by maximising the likelihood of the models (in some fits with many parameters, priors over parameters were used to regularise the optimisation). The likelihood in the discrete models was Gaussian with mean equal to the log posterior odds ($$L_n$$) computed from the actual dot positions $$x_n$$. The variance of the Gaussian likelihood was fitted to the data as a free parameter. In the continuous model the likelihood was numerically approximated by simulating the discretised evolution of the probabilities that the log posterior odds take on particular values. This is very similar to the approach used by Brunton2013. The distribution of the log posterior odds $$L_n$$ was considered here, because the stream of sensory observations $$x(t)$$ was unknown and therefore had to enter as a random variable while in the triangles task $$x(t)=x_n$$ was set to the known x-coordinates of the presented dots.

The authors argued that the fits of behaviour were good, but at least for the dots reversal task Fig. 8 suggests otherwise. For example, Fig. 8G shows that 6 out of 12 subjects (there were supposed to be 13, but I can only see 12 in the plots) made 100% errors in trials with the low hazard rate of 0.1Hz and low coherence where the last switch in stimulus was very recent (maximally 300ms before the end of stimulus presentation). The best fitting model, however, predicted error rates of at most 90% in these conditions. Furthermore, there is a significant difference in choice errors between the low and high hazard rate for large times after the last switch in stimulus (Fig. 8A, more errors for high hazard rate) which was not predicted by the fitted normative model. Despite these differences the fitted normative model seems to capture the overall patterns in the data.

#### Conclusion

The authors present an interesting normative model in discrete and continuous time that extends previous models of evidence accumulation to situations in which switches in the presented stimulus can be expected. In light of this model, a leak in evidence accumulation reflects a tendency to increase uncertainty about the stimulus due to a potentially upcoming switch in the stimulus. The model provides a mathematical relation between the precise type of leak and the expected switch (hazard) rate of the stimulus. In particular, and in contrast to previous models, the leak in the present model depends nonlinearly on the accumulated evidence. As the authors discuss, the presented normative model potentially unifies decision making processes observed in different situations characterised by different stabilities of the underlying stimuli. I had the impression that the authors were very thorough in their analysis. However, some deviations of model and data apparent in Fig. 8 suggest that either the model itself, or the fitting procedure may be improved such that the model better fits people’s behaviour in the dots-reversal task. It was anyway surprising to me that subjects only had to make a single response per trial in that task. This feels like a big waste of potential choice data when I consider that each trial was 5-10s long and contained several stimulus switches (reversals).

## A test of Bayesian observer models of processing in the Eriksen flanker task.

White, C. N., Brown, S., and Ratcliff, R.
J Exp Psychol Hum Percept Perform, 38:489–497, 2012

### Abstract

Two Bayesian observer models were recently proposed to account for data from the Eriksen flanker task, in which flanking items interfere with processing of a central target. One model assumes that interference stems from a perceptual bias to process nearby items as if they are compatible, and the other assumes that the interference is due to spatial uncertainty in the visual system (Yu, Dayan, & Cohen, 2009). Both models were shown to produce one aspect of the empirical data, the below-chance dip in accuracy for fast responses to incongruent trials. However, the models had not been fit to the full set of behavioral data from the flanker task, nor had they been contrasted with other models. The present study demonstrates that neither model can account for the behavioral data as well as a comparison spotlight-diffusion model. Both observer models missed key aspects of the data, challenging the validity of their underlying mechanisms. Analysis of a new hybrid model showed that the shortcomings of the observer models stem from their assumptions about visual processing, not the use of a Bayesian decision process.

### Review

This is a response to Yu2009 in which the authors show that Yu et al.'s main Bayesian models cannot account for the full data of an Eriksen flanker task. In particular, Yu et al.'s models predict a far too high overall error rate with the suggested parameter settings that reproduce the inital drop of accuracy below chance level for very fast responses. The argument put forward by White et al. is that the mechanisms used in Yu et al.'s models to overcome initial, flanker-induced biases is too slow, i.e., the probabilistic evidence accumulation implemented by the models is influenced by the flankers for too long. White et al's shrinking spotlight models do not have such a problem, mostly because the speed with which flankers loose influence is fitted to the data. The argument seems compelling, but I would like to understand better why it takes so long in the Bayesian model to overcome flanker influence and whether there are other ways of speeding this up than the one suggested by White et al..

## Dynamics of attentional selection under conflict: toward a rational Bayesian account.

Yu, A. J., Dayan, P., and Cohen, J. D.
J Exp Psychol Hum Percept Perform, 35:700–717, 2009

### Abstract

The brain exhibits remarkable facility in exerting attentional control in most circumstances, but it also suffers apparent limitations in others. The authors' goal is to construct a rational account for why attentional control appears suboptimal under conditions of conflict and what this implies about the underlying computational principles. The formal framework used is based on Bayesian probability theory, which provides a convenient language for delineating the rationale and dynamics of attentional selection. The authors illustrate these issues with the Eriksen flanker task, a classical paradigm that explores the effects of competing sensory inputs on response tendencies. The authors show how 2 distinctly formulated models, based on compatibility bias and spatial uncertainty principles, can account for the behavioral data. They also suggest novel experiments that may differentiate these models. In addition, they elaborate a simplified model that approximates optimal computation and may map more directly onto the underlying neural machinery. This approximate model uses conflict monitoring, putatively mediated by the anterior cingulate cortex, as a proxy for compatibility representation. The authors also consider how this conflict information might be disseminated and used to control processing.

### Review

They suggest two simple, Bayesian perceptual models based on evidence integration for the (deadlined) Eriksen task. Their focus is on attentional mechanisms that can explain why particpants' responses are below chance for very fast responses. These mechanisms are based on a prior on compatibility (that flankers are compatible with the relevant centre stimulus) and spatial uncertainty (flankers influence processing of centre stimulus on a low, sensory level). The core inference is the same and replicates the basic mechanism you would expect for any perceptual decision making task. They don't fit behaviour, but rather show average trajectories from model simulations with hand-tuned parameters. They further suggest a third model inspired by previous work on conflict monitoring and cognitive control which supposedly is more likely to be implemented in the brain, because instead of having to consider (and compute with) all possible stimuli in the environment, it uses a conflict monitoring mechanism to switch between types of stimuli that are considered.

## Universality in numerical computations with random data.

Deift, P. A., Menon, G., Olver, S., and Trogdon, T.
Proc Natl Acad Sci U S A, 111:14973–14978, 2014

### Abstract

The authors present evidence for universality in numerical computations with random data. Given a (possibly stochastic) numerical algorithm with random input data, the time (or number of iterations) to convergence (within a given tolerance) is a random variable, called the halting time. Two-component universality is observed for the fluctuations of the halting time-i.e., the histogram for the halting times, centered by the sample average and scaled by the sample variance, collapses to a universal curve, independent of the input data distribution, as the dimension increases. Thus, up to two components-the sample average and the sample variance-the statistics for the halting time are universally prescribed. The case studies include six standard numerical algorithms as well as a model of neural computation and decision-making. A link to relevant software is provided for readers who would like to do computations of their own.

### Review

The author’s show that normalised halting / stopping times follow common distributions. Stopping times are assumed to be generated by an algorithm A from a random ensemble E where E does not represent the particular sample from which stopping times are generated, but the theoretical distribution of that sample. Normalisation is standard normalisation: subtract mean and divide by standard deviation of a sample of stopping times. The resulting distribution is the same across different ensembles E, but differs across algorithms A. That distributions are the same the authors call (two-component) universality without explanation why they call it like that. There is also no reference to a concept of universality. Perhaps it’s something common in physics. Perhaps it’s explained in their first reference. Reference numbers are shifted by one, by the way.

How is that interesting? I’m not sure. The authors give an example with a model of reaction times. This is a kind of Ising model where decisions are made once a sufficient number of binary states have switched to one of the states. States flip with a certain probability as determined by a given function of the current state of the whole Ising model. When different such functions were considered, corresponding to different ensembles E, normalised reaction times followed the same distribution again. However, the distribution of normalised reaction times differed for different total numbers of binary states in the Ising model. These results suggest that normalised reaction times should follow the same distribution over subjects, but only if subjects differ maximally by the randomness on which their decisions are based. If subjects use slightly different algorithms for making decisions, you would expect differences in the distribution of normalised reaction times. I guess it would be cool to infer that subjects use the same (or a different) algorithm purely from their reaction time distributions, but what would be an appropriate test for this and what would be its power?

## Probabilistic population codes for Bayesian decision making.

Beck, J. M., Ma, W. J., Kiani, R., Hanks, T., Churchland, A. K., Roitman, J., Shadlen, M. N., Latham, P. E., and Pouget, A.
Neuron, 60:1142–1152, 2008

### Abstract

When making a decision, one must first accumulate evidence, often over time, and then select the appropriate action. Here, we present a neural model of decision making that can perform both evidence accumulation and action selection optimally. More specifically, we show that, given a Poisson-like distribution of spike counts, biological neural networks can accumulate evidence without loss of information through linear integration of neural activity and can select the most likely action through attractor dynamics. This holds for arbitrary correlations, any tuning curves, continuous and discrete variables, and sensory evidence whose reliability varies over time. Our model predicts that the neurons in the lateral intraparietal cortex involved in evidence accumulation encode, on every trial, a probability distribution which predicts the animal’s performance. We present experimental evidence consistent with this prediction and discuss other predictions applicable to more general settings.

### Review

In this article the authors apply probabilistic population coding as presented in Ma et al. (2006) to perceptual decision making. In particular, they suggest a hierarchical network with a MT and LIP layer in which the firing rates of MT neurons encode the current evidence for a stimulus while the firing rates of LIP neurons encode the evidence accumulated over time. Under the made assumptions it turns out that the accumulated evidence is independent of nuisance parameters of the stimuli (when they can be interpreted as contrasts) and that LIP neurons only need to sum (integrate) the activity of MT neurons in order to represent the correct posterior of the stimulus given the history of evidence. They also suggest a readout layer implementing a line attractor which reads out the maximum of the posterior under some conditions.

Details

Probabilistic population coding is based on the definition of the likelihood of stimulus features p(r|s,c) as an exponential family distribution of firing rates r. A crucial requirement for the central result of the paper (that LIP only needs to integrate the activity of MT) is that nuisance parameters c of the stimulus s do not occur in the exponential itself while the actual parameters of s only occur in the exponential. This restricts the exponential family distribution to the “Poisson-like family”, as they call it, which requires that the tuning curves of the neurons and their covariance are proportional to the nuisance parameters c (details for this need to be read up in Ma et al., 2006). The point is that this is the case when c corresponds to contrast, or gain, of the stimulus. For the considered random dot stimuli the coherence of the dots may indeed be interpreted as the contrast of the motion in the sense that I can imagine that the tuning curves of the MT neurons are multiplicatively related to the coherence of the dots.

The probabilistic model of the network activities is setup such that the firing of neurons in the network is an indirect, noisy observation of the underlying stimulus, but what we are really interested in is the posterior of the stimulus. So the question is how you can estimate this posterior from the network firing rates. The trick is that under the Poisson-like distribution the likelihood and posterior share the same exponential such that the posterior becomes proportional to this exponential, because the other parts of the likelihood do not depend on the stimulus s (they assume a flat prior of s such that you don’t need to consider it when computing the posterior). Thus, the probability of firing in the network is determined from the likelihood while the resulting firing rates simultaneously encode the posterior. Mind-boggling. The main contribution from the authors then is to show, assuming that firing rates of MT neurons are driven from the stimulus via the corresponding Poisson-like likelihood, that LIP neurons only need to integrate the spikes of MT neurons in order to correctly represent the posterior of the stimulus given all previous evidence (firing of MT neurons). Notice, that they also assume that LIP neurons have the same tuning curves with respect to the stimulus as MT neurons and that the neurons in LIP sum the activity of this MT neuron which they share a tuning curve with. They note that a naive procedure like that, i.e. a single neuron integrating MT firing over time, would quickly saturate its activity. So they show, and that is really cool, that global inhibition in the LIP network does not affect the representation of the posterior, allowing them to prevent saturation of firing while maintaining the probabilistic interpretation.

So far to the theory. In practice, i.e. experiments, the authors do something entirely different, because “these results are important, but they are based on assumptions that are not necessarily exactly true in vivo. […] It is therefore essential that we test our theory in biologically realistic networks.” Now, this is a noble aim, but what exactly do we learn about this theory, if all results are obtained using methods which violate the assumptions of the theory? For example, neither the probability of firing in MT nor LIP is Poisson-like, LIP neurons not just integrate MT activity, but are also recurrently connected, LIP neurons have local inhibition (they are leaky integrators, inhibition between LIP neurons depending on tuning properties) instead of global inhibition and LIP neurons have an otherwise completely unmotivated “urgency signal” whose contribution increases with time (this stems from experimental observations). Without any concrete links between the two models in theory (I guess, the main ideas are similar, but the details are very different) it has to be shown that they are similar using experimental results. In any case, it is hard to differentiate between contributions from the probabilistic theory and the network implementation, i.e., how much of the fit between experimental findings in monkeys and the behaviour of the model is due to the chosen implementation and how much is due to the probabilistic interpretation?

Results

The overall aim of the experiments / simulations in the paper is to show that the proposed probabilistic interpretation is compatible with the experimental findings in monkey LIP. The hypothesis is that LIP neurons encode the posterior of the stimulus as suggested in the theory. This hypothesis is false from the start, because some assumptions of the theory apparently don’t apply to neurons (as acknowledged by the authors). So the new hypothesis is that LIP neurons approximately encode some posterior of the stimulus. The requirement for this posterior is that updates of the posterior should take the uncertainty of the evidence and the uncertainty of the previous estimate of the posterior into account which the authors measure as a linear increase of the log odds of making a correct choice, log[ p(correct) / (1-p(correct)) ], with time together with the dependence of the slope of this linear increase on the coherence (contrast) of the stimulus. I did not follow up why the previous requirement is related to the log odds in this way, but it sounds ok. Remains the question how to estimate the log odds from simulated and real neurons. For the simulated neurons the authors approximate the likelihood with a Poisson-like distribution whose kernel (parameters) were estimated from the simulated firing rates. They argue that it is a good approximation, because linear estimates of the Fisher information appear to be sufficient (I can’t comment on the validity of this argument). A similar approximation of the posterior cannot be done for real LIP neurons, because of a lack of multi-unit recordings which estimate the response of the whole LIP population. Instead, the authors approximate the log odds from measured firing rates of neurons tuned to motion in direction 0 and 180 degrees via a linear regression approach described in the supplemental data.

The authors show that the log-odds computed from the simulated network exhibit the desired properties, i.e., the log-odds linearly increase with time (although there’s a kink at 50ms which supposedly is due to the discretisation) and depend on the coherence of the motion such that the slope of the log-odds increases also when coherence is increased within a trial. The corresponding log-odds of real LIP neurons are far noisier and, thus, do not allow to make definite judgements about linearity. Also, we don’t know whether their slopes would actually change after a change in motion coherence during a trial, as this was never tested (it’s likely, though).

In order to test whether the proposed line attractor network is sufficient to read out the maximum of the posterior in all conditions (readout time and motion coherence) the authors compare a single (global) readout with local readouts adapted for a particular pair of readout time and motion coherence. However, the authors don’t actually use attractor networks in these experiments, but note that these are equivalent to local linear estimators and so use these. Instead of comparing the readouts from these estimators with the actual maximum of the posterior, they only compare the variance of the estimators (Fisher information) which they show to be roughly the same for the local and global estimators. From this they conclude that a single, global attractor network could read out the maximum of the (approximated) posterior. However, this is only true, if there’s no additional bias of the global estimator which we cannot see from these results.

Finally, also the build-up rates of LIP neurons seem to be qualitatively similar in the simulation and the data, although they are consistently lower in the model. The build-up rates for the model are estimated from the first 50ms within each trial. However, the log-odds ratio had this kink at 50ms after which its slope was larger. So, if this effect is also seen directly in the firing rates, the fit of the build-up rates to the data may even be better, if probability of firing after 50ms is used. In Fig. 2C no such kink can be seen in the firing rates, but this is only data for 2 neurons in the model.

Conclusion

Overall the paper is very interesting and stimulating. It is well written and full of sound theoretical results which originate from previous work of the authors. Unfortunately, biological nature does not completely fit the beautiful theory. Consequently, the authors run experiments with more plausible neural networks which only approximately implement the theory. So what conclusions can we draw from the presented results? As long as the firing of MT neurons reflects the likelihood of a stimulus (their MT network is setup in this way), probably a wide range of networks which accumulate this firing will show responses similar to real LIP neurons. It is not easy to say whether this is a consequence of the theory, which states that MT firing rates should be simply summed over time in order to get the right posterior, because of the violation of the assumptions of the theory in more realistic networks. It could also be that more complicated forms of accumulation are necessary such that LIP firing represents the correct posterior. Simple summing then just represents a simple approximation. Also, I don’t believe that the presented results can rule out the possibility of sampling based coding of probabilities (see Fiser et al., 2010) for decision making as long as also the sampling approach would implement some kind of accumulation procedure (think of particle filters – the implementation in a recurrent neural network would probably be quite similar).

Nevertheless, the main point of the paper is that the activity in LIP represents the full posterior and not only MAP estimates or log-odds. Consequently, the model very easily extends to the case of continuous directions of motion which is in contrast to previous, e.g., attractor-based, neural models. I like this idea. However, I cannot determine from the experiments whether their network actually implements the correct posterior, because all their tests yield only indirect measures based on approximated analyses. Even so, it is pretty much impossible to verify that the firing of LIP neurons fits to the simulated results as long as we cannot measure firing of a large part of the corresponding neural population in LIP.

## Information Theory of Decisions and Actions.

Tishby, N. and Polani, D.
in: Perception-Action Cycle, Springer New York, pp. 601–636, 2011

### Abstract

The perception–action cycle is often defined as “the circular flow of information between an organism and its environment in the course of a sensory guided sequence of actions towards a goal” (Fuster, Neuron 30:319–333, 2001; International Journal of Psychophysiology 60(2):125–132, 2006). The question we address in this chapter is in what sense this “flow of information” can be described by Shannon’s measures of information introduced in his mathematical theory of communication. We provide an affirmative answer to this question using an intriguing analogy between Shannon’s classical model of communication and the perception–action cycle. In particular, decision and action sequences turn out to be directly analogous to codes in communication, and their complexity – the minimal number of (binary) decisions required for reaching a goal – directly bounded by information measures, as in communication. This analogy allows us to extend the standard reinforcement learning framework. The latter considers the future expected reward in the course of a behaviour sequence towards a goal (value-to-go). Here, we additionally incorporate a measure of information associated with this sequence: the cumulated information processing cost or bandwidth required to specify the future decision and action sequence (information-to-go). Using a graphical model, we derive a recursive Bellman optimality equation for information measures, in analogy to reinforcement learning; from this, we obtain new algorithms for calculating the optimal trade-off between the value-to-go and the required information-to-go, unifying the ideas behind the Bellman and the Blahut–Arimoto iterations. This trade-off between value-to-go and information-to-go provides a complete analogy with the compression–distortion trade-off in source coding. The present new formulation connects seemingly unrelated optimization problems. The algorithm is demonstrated on grid world examples.

### Review

Peter Dayan pointed me to this paper (which is actually a book chapter) when I told him that I find the continuous interaction between perception and action important and that Friston’s free energy framework is one of the few which covers this case. Now, this paper covers only discrete time (and states and actions), but certainly it addresses the issue that perception and action influence each other.

The main idea of the paper is to take the informational effort (they call it information-to-go) into account when finding a policy for a Markov decision process. A central finding is a recursive equation analogous to the (Bellman) equation for the Q-function in reinforcement learning which captures the expected (over all possible future state-action trajectories) informational effort of a certain state-action pair. Informational effort is defined as the KL-divergence between a factorising prior distribution over future states and actions (making them independent across time) and their true distribution. This means that the informational effort is the expected number of bits of information that you have to consider in addition to your prior when moving through the future. They then propose a free energy (also a recursive equation) which combines the informational effort with the Q-function of the underlying MDP and thus allows simultaneous optimisation of informational effort and reward where the two are traded off against each other.

Practically, this leads to “soft vs. sharp policies”: sharp policies which always choose the action with highest expected reward and soft policies which choose actions probabilistically with an associated penalty on reward compared to sharp policies. The softness of the resulting policy is controlled by the tradeoff parameter between informational effort and reward which can be interpreted as the informational capacity of the system under consideration. I understand it this way: the tradeoff parameter stands for the informational complexity/capacity of the distributions representing the internal model of the world in the agent and the optimal policy with a particular setting of tradeoff parameter is the optimal policy with respect to reward alone that a corresponding agent can achieve. This is easily seen when considering that informational effort depends on the prior for future state-action trajectories. For a given prior, tradeoff parameter and resulting policy you can find the corresponding more complex prior for which the same policy can be found for 0 informational effort. The prior here obviously corresponds to the internal model of the agent. Consequently, the authors present a general framework with which you can ask questions such as: “How much informational capacity does my agent need to solve a given task with a desired level of performance?” Or, in other words: “How complex does my agent need to be in order to solve the given task?” Or: “How well can my agent solve the given task?” Although this latter question is the standard question in RL. In particular, my intuition tells me that for every setting of the tradeoff parameter there probably is an equivalent POMDP formulation (which makes the corresponding difference between world and agent model explicit).

A particularly interesting discussion is that about “perfectly adapted environments” which seems to be directed towards Friston without mentioning him, though. The discussion results from the ability to optimise their free energy combined from informational effort and reward not only with respect to the policy, but also with respect to the (true) transition probabilities. The outcome of such an optimisation is an environment in which transition probabilities are directly related to rewards, or, in other words, an environment in which informational effort is equal to something like negative reward. In such an environment “minimizing the statistical surprise or maximizing the predictive information is equivalent to maximizing reward” which is what Friston argues (see also the associated discussion on hunch.net). Needless to say that they consider this as a very special case while in most other cases the environment contains information that is irrelevant in terms of reward. Nevertheless, they consider the possibility that the environments of living organisms are indeed perfectly or at least well adapted through millions of years of coevolution and they suggest to direct future research towards this issue. The question really is what is reward in this general sense? What is it that living organisms try to achieve? The more concrete reward is, for example, reward for a particular task, the less relevant most information in the environment will be. I’m tempted to say that the combined optimisation of informational effort and rewards, as presented here, will then lead to policies which particularly seak out relevant information, but I’m not sure whether this is a correct interpretation.

To sum up Tishby and Polani present a new theoretical framework which generalises reinforcement learning by incorporating ideas from information theory. They provide an interesting new perspective which is presented in a pleasingly accessible way. I do not think that they solved any particular problem in reinforcement learning, but they broadened the view by postulating that agents tradeoff informational effort (capacity?) and reward. Practically, computations derived from their framework may not be feasible in most cases, because original reinforcement learning is already hard and here a few expectations have been added. Or, maybe it’s not so bad, because you can do them together.

## Internal models and the construction of time: generalizing from state estimation to trajectory estimation to address temporal features of perception, including temporal illusions.

Grush, R.
Journal of Neural Engineering, 2:S209, 2005

### Abstract

The question of whether time is its own best representation is explored. Though there is theoretical debate between proponents of internal models and embedded cognition proponents (e.g. Brooks R 1991 Artificial Intelligence 47 139Ã¢Â€”59) concerning whether the world is its own best model, proponents of internal models are often content to let time be its own best representation. This happens via the time update of the model that simply allows the model’s state to evolve along with the state of the modeled domain. I argue that this is neither necessary nor advisable. I show that this is not necessary by describing how internal modeling approaches can be generalized to schemes that explicitly represent time by maintaining trajectory estimates rather than state estimates. Though there are a variety of ways this could be done, I illustrate the proposal with a scheme that combines filtering, smoothing and prediction to maintain an estimate of the modeled domain’s trajectory over time. I show that letting time be its own representation is not advisable by showing how trajectory estimation schemes can provide accounts of temporal illusions, such as apparent motion, that pose serious difficulties for any scheme that lets time be its own representation.

### Review

The author argues based on temporal illusions that perceptual states correspond to smoothed trajectories where smoothing is meant as in the context of a Kalman smoother. In particular, temporal illusions such as the flash-lag effect and the cutaneous rabbit show that stimuli later in time can influence the perception of earlier stimuli. However, it seems that this is only the case for temporally very close stimuli (within 100ms). Thus, Grush suggests that stimuli are internally represented as trajectories including past and future states. However, the representation of the past states in the trajectory is also updated when new sensory evidence is collected (the observations, or rather the states, are smoothed). This idea has actually already been suggested by Rao, Eagleman and Sejnowski (2001) as stated by the author, but here he additionally postulates that also some of the future states are represented in the trajectory to account for apparent motion effects (where a motion is continued in the head when the stimulus disappears).

It’s an interesting account of temporal aspects in perceptions, but note that he develops things for the perceptual level, which does not necessarily let us draw conclusions for processing on the sensory level. Also, his discussion on whether Rao et al’s account of a fixed-lag smoother can be true is interesting, though he didn’t entirely convince me that fixed-lag perception is not what is happening in the brain. Wouldn’t instantaneous updating of the perceptual trajectory mean that at some point our perception changes, but during the illusions people report coherent motion. Ok, it could be that we just don’t “remember” our previous perception after it’s updated, but it still sounds counterintuitive. I don’t think that the apparent motion illusions are a good argument for representing future states, because other mechanisms could be responsible for that.

## SORN: a self-organizing recurrent neural network.

Lazar, A., Pipa, G., and Triesch, J.
Front Comput Neurosci, 3:23, 2009

### Abstract

Understanding the dynamics of recurrent neural networks is crucial for explaining how the brain processes information. In the neocortex, a range of different plasticity mechanisms are shaping recurrent networks into effective information processing circuits that learn appropriate representations for time-varying sensory stimuli. However, it has been difficult to mimic these abilities in artificial neural network models. Here we introduce SORN, a self-organizing recurrent network. It combines three distinct forms of local plasticity to learn spatio-temporal patterns in its input while maintaining its dynamics in a healthy regime suitable for learning. The SORN learns to encode information in the form of trajectories through its high-dimensional state space reminiscent of recent biological findings on cortical coding. All three forms of plasticity are shown to be essential for the network’s success.

### Review

The paper considers the question of whether adapting an RNN used as a reservoir gives better performance in a sequence prediction task than randomly initialised RNNs. The authors demonstrate an adaptation procedure based on spike-timing-dependent plasticity (STDP) controlled with intrinsic plasticity (IP) and synaptic normalisation (SN) as homeostatic mechanisms and show that the performance of the adapted RNNs is indeed superior to the performance of the random RNNs. They further show that IP and SN are necessary for good results, or rather that without either the RNN exhibits disadvantageous firing patterns (bursting, always on, always off).

This is one of the few studies which shows successfull learning of RNNs. However, they use a rather simple model: a binary network in discrete time. The connectivity of the network is more elaborate: there are excitatory units which are recurrently connected, as well as fewer inhibitory neurons which have no connections between themselves, but are fully and reciprocally connected with all excitatory units. Input to the network is given to excitatory units through input units which are separated into subsets which each give a spike (1) when a specific symbol in the input sequence is currently present (input sequences consist of letters and numbers). The authors show that the RNN develops states (activity of all units in the network as a vector) which are specific to individual input symbols with the addition that also the serial number of the input symbol in the sequence is represented. This simplifies readout of the current symbol in the sequence from RNN activity and hence leads to improved performance of predicting the next symbol in the sequence using a standard reservoir computing readout function. However, the authors note that the RNN keeps on changing its response to input, i.e., their learning rule does not converge which means that the readout function would have to be updated all the time as well. Consequently, they switch off learning in the test phase.

The authors show that it is beneficial that recurrent connections between excitatory units are sparse.

## Efficient Reductions for Imitation Learning.

Ross, S. and Bagnell, D.
in: JMLR W&CP 9: AISTATS 2010, pp. 661–668, 2010

### Abstract

Imitation Learning, while applied successfully on many large real-world problems, is typically addressed as a standard supervised learning problem, where it is assumed the training and testing data are i.i.d.. This is not true in imitation learning as the learned policy influences the future test inputs (states) upon which it will be tested. We show that this leads to compounding errors and a regret bound that grows quadratically in the time horizon of the task. We propose two alternative algorithms for imitation learning where training occurs over several episodes of interaction. These two approaches share in common that the learner’s policy is slowly modified from executing the expert’s policy to the learned policy. We show that this leads to stronger performance guarantees and demonstrate the improved performance on two challenging problems: training a learner to play 1) a 3D racing game (Super Tux Kart) and 2) Mario Bros.; given input images from the games and corresponding actions taken by a human expert and near-optimal planner respectively.

### Review

The authors note that previous approaches of learning a policy from an example policy are limited in the sense that they only see successful examples generated from the desired policy and, therefore, will exhibit a larger error than expected from supervised learning of independent samples, because an error can propagate through the series of decisions, if the policy hasn’t learnt to recover to the desired policy when an error occurred. They then show that a lower error can be expected when a Forward Algorithm is used for training which learns a non-stationary policy successively for each time step. The idea probably being (I’m not too sure) that the data at the time step that is currently learnt contains the errors (that lead to different states) you would usually expect from the learnt policies, because for every time step new data is sampled based on the already learnt policies. They transfer this idea to learning of a stationary policy and propose SMILe (stochastic mixing iterative learning). In this algorithm the stationary policy is a linear combination of policies learnt in previous iterations where the initial policy is the desired one. The influence of the desired policy decreases exponentially with the number of iterations, but also the weights of policies learnt later decrease exponentially, but stay fixed in subsequent iterations, i.e. the policies learnt first will have the largest weights eventually. This makes sense, because they will most probably be closest to the desired policy (seeing mostly samples produced from the desired policy).

The aim is to make the learnt policy more robust without using too many samples from the desired policy. I really wonder whether you could achieve exactly the same performance by simply additionally sampling the desired policy from randomly perturbed states and adding these as training points to learning of a single policy. Depending on how expensive your learning algorithm is this may be much faster in total (as you only have to learn once on a larger data set). Of course, you then may not have the theoretical guarantees provided in the paper. Another drawback of the approach presented in the paper is that it needs to be possible to sample from the desired policy interactively during the learning. I can’t imagine a scenario where this is practical (a human in the loop?).

I was interested in this, because in an extended abstract to a workshop (see attached files) the authors referred to this approach and also mentioned Langford2009 as a similar learning approach based on local updates. Also you can see the policy as a differential equation, i.e. the results of the paper may also apply to learning of dynamical systems without control inputs. The problems are certainly very similar.

They use a neural network to learn policies in the particular application they consider.