Representation of confidence associated with a decision by neurons in the parietal cortex.

Kiani, R. and Shadlen, M. N.
Science, 324:759–764, 2009
DOI, Google Scholar

Abstract

The degree of confidence in a decision provides a graded and probabilistic assessment of expected outcome. Although neural mechanisms of perceptual decisions have been studied extensively in primates, little is known about the mechanisms underlying choice certainty. We have shown that the same neurons that represent formation of a decision encode certainty about the decision. Rhesus monkeys made decisions about the direction of moving random dots, spanning a range of difficulties. They were rewarded for correct decisions. On some trials, after viewing the stimulus, the monkeys could opt out of the direction decision for a small but certain reward. Monkeys exercised this option in a manner that revealed their degree of certainty. Neurons in parietal cortex represented formation of the direction decision and the degree of certainty underlying the decision to opt out.

Review

The authors used a 2AFC-task with an option to waive the decision in favour of a choice which provides low, but certain reward (the sure option) to investigate the representation of confidence in LIP neurons. Behaviourally the sure option had the expected effect: it was increasingly chosen the harder the decisions were, i.e., the more likely a false response was. Trials in which the sure option was chosen, thus, may be interpreted as those in which the subject was little confident in the upcoming decision. It is important to note that task difficulty here was manipulated by providing limited amounts of information for a limited amount of time, i.e., this was not a reaction time task.

The firing rates of the recorded LIP neurons indicate that selection of the sure option is associated with an intermediate level of activity compared to that of subsequent choices of the actual decision options. For individual trials the authors found that firing rates closer to the mean firing rate (in a short time period before the sure option became available) more frequently lead to selection of the sure option than firing rates further away from the mean, but in absolute terms the activity in this time window could predict choice of the sure option only weakly (probability of 0.4). From these results the authors conclude that the LIP neurons which have previously been found to represent evidence accumulation also encode confidence in a decision. They suggest a simple drift-diffusion model with fixed diffusion parameter to explain the results. Additional to standard diffusion models they define confidence in terms of the log-posterior odds which they compute from the state of the drift-diffusion model. They define posterior as p(S_i|v), the probability that decision option i is correct given that the drift-diffusion state (the decision variable) is v. They compute it from the corresponding likelihood p(v|S_i), but don’t state how they obtained that likelihood. Anyway, the sure option is chosen in the model, when the log-posterior odds is below a certain level. I don’t see why the detour via the log-posterior odds is necessary. You could directly define v as the posterior for decision option i and still be consistent with all the findings in the paper. Of course, then v could not be governed by a linear drift anymore, but why should it in the first place? The authors keenly promote the Bayesian brain, but stop just before the finishing line. Why?

Robust averaging during perceptual judgment.

de Gardelle, V. and Summerfield, C.
Proc Natl Acad Sci U S A, 108:13341–13346, 2011
DOI, Google Scholar

Abstract

An optimal agent will base judgments on the strength and reliability of decision-relevant evidence. However, previous investigations of the computational mechanisms of perceptual judgments have focused on integration of the evidence mean (i.e., strength), and overlooked the contribution of evidence variance (i.e., reliability). Here, using a multielement averaging task, we show that human observers process heterogeneous decision-relevant evidence more slowly and less accurately, even when signal strength, signal-to-noise ratio, category uncertainty, and low-level perceptual variability are controlled for. Moreover, observers tend to exclude or downweight extreme samples of perceptual evidence, as a statistician might exclude an outlying data point. These phenomena are captured by a probabilistic optimal model in which observers integrate the log odds of each choice option. Robust averaging may have evolved to mitigate the influence of untrustworthy evidence in perceptual judgments.

Review

The authors investigate what influence the variance of evidence has on perceptual decisions. A bit counterintuitively, they implement varying evidence by simultaneously presenting elements with different feature values (e.g. color) to subjects instead of presenting only one element which changes its feature value over time (would be my naive approach). Perhaps they did this to be able to assume constant evidence over time such that the standard drift diffusion model applies. My intuition is that subjects anyway implement a more sequential sampling of the stimulus display by varying attention to individual elements.

The behavioural results show that subjects take both mean presented evidence as well as the variance of evidence into account when making a decision: For larger mean evidence and smaller variance of evidence subjects are faster and make less mistakes. The results are attention dependent: mean and variance in a task-irrelevant feature dimension had no effect on responses.

The behavioural results can be explained by a drift diffusion model with a drift rate which takes the variance of the evidence into account. The authors present two such drift rates. 1) SNR drift = mean / standard deviation (as computed from trial-specific feature values). 2) LPR drift = mean log posterior ratio (also computed from trial-specific feature values). The two cannot be differentiated based on the measured mean RTs and error rates in the different conditions. So the authors provide an additional analysis which estimates the influence of the different presented elements, that is, the influence of the different feature values presented by them, on the given responses. This is done via a generalised linear regression by fitting a model which predicts response probabilites from presented feature values for individual trials. The fitted linear weights suggest that extreme (outlying) feature values have little influence on the final responses compared to the influence that (inlying) feature values close to the categorisation boundary have. Only the LPR model (2) replicates this effect.

Why have inlying feature values greater influence on responses than outlying ones in the LPR model, but not in the other models? The LPR model alone would not predict this, because for more extreme posterior values you get more extreme LPR values which then have a greater influence on the mean LPR value, i.e., the drift rate. Therefore, It is not entirely clear to me yet why they find a greater importance of inlying feature values in the generalised linear regression from feature values to responses. The best explanation I currently have is the influence of the estimated posterior values: Fig. S5 shows that the posterior values are constant for sufficiently outlying feature values and only change for inlying feature values, where the greatest change is at the feature value defining the categorisation boundary. When mapped through the LPR the posterior values lead to LPR values following the same sigmoidal form setting low and high feature values to constants. These constant high and low values may cancel each other out when, on average, they are equally many. Then, only the inlying feature values may have a lasting contribution on the LPR mean; especially those close to the categorisation boundary, because they tend to lead to larger variation in LPR values which may tip the LPR mean (drift rate) towards one of the two responses. This explanation means that the results depend on the estimated posterior values. In particular, that these are set to values of about 0.2, or 0.8, respectively, for a large range of extreme feature values.

I am unsure what conclusions can be drawn from the results. Although, the basic behavioural results are clear, it is not surprising that the responses of subjects depend on the variance of the presented evidence. You can define the feature values varying around the mean as noise. More variance then just means more noise and it is a basic result that people become slower and more error prone when presented with more noise. Perhaps surprisingly, it is here shown that this also works when noisy features are presented simultaneously on the screen instead of sequentially over time.

The DDM analysis shows that the drift rate of subjects decreases with increasing variance of evidence. This makes sense and means that subjects become more cautious in their judgements when confronted with larger variance (more noise). But I find the LPR model rather strange. It’s like pressing a Bayesian model into a mechanistic corset. The posterior ratio is an ad-hoc construct. Ok, it’s equivalent to the log-likelihood ratio, but why making it to a posterior ratio then? The vagueness arises already because of how the task is defined: all information is presented at once, but you want to describe accumulation of evidence over time. Consequently, you have to define some approximate, ad-hoc construct (mean LPR) which you can use to define the temporal integration. That the model based on that construct replicates an aspect of the behavioural data may be an artefact of the particular approximation used (apparently it is important that the estimated posterior values are constant for extreme feature values). So, it remains unclear to me whether an LPR-DDM is a good explanation for the involved processes in this case.

Actually, a large part of the paper (cf. title) concerns the finding that extreme feature values appear to have smaller influence on subject responses than feature values close to the categorisation boundary. This is surprising to me. Although it makes intuitive sense in terms of ‘robust averaging’, I wouldn’t predict it for optimal probabilistic integration of evidence, at least not without making further assumptions. Such assumptions are also implicit in the LPR-DDM and I’m a bit skeptical about it anyway. Thus, a good explanation is still needed, in my opinion. Finally, I wonder how reliable the generalised linear regression analysis, which led to these results, is. On the one hand, the authors report using two different generalised linear models and obtaining equivalent results. On the other hand, they estimate 9 parameters from only one binary response variable and I wonder how the optimisation landscape looks in this case.

A healthy fear of the unknown: perspectives on the interpretation of parameter fits from computational models in neuroscience.

Nassar, M. R. and Gold, J. I.
PLoS Comput Biol, 9:e1003015, 2013
DOI, Google Scholar

Abstract

Fitting models to behavior is commonly used to infer the latent computational factors responsible for generating behavior. However, the complexity of many behaviors can handicap the interpretation of such models. Here we provide perspectives on problems that can arise when interpreting parameter fits from models that provide incomplete descriptions of behavior. We illustrate these problems by fitting commonly used and neurophysiologically motivated reinforcement-learning models to simulated behavioral data sets from learning tasks. These model fits can pass a host of standard goodness-of-fit tests and other model-selection diagnostics even when the models do not provide a complete description of the behavioral data. We show that such incomplete models can be misleading by yielding biased estimates of the parameters explicitly included in the models. This problem is particularly pernicious when the neglected factors are unknown and therefore not easily identified by model comparisons and similar methods. An obvious conclusion is that a parsimonious description of behavioral data does not necessarily imply an accurate description of the underlying computations. Moreover, general goodness-of-fit measures are not a strong basis to support claims that a particular model can provide a generalized understanding of the computations that govern behavior. To help overcome these challenges, we advocate the design of tasks that provide direct reports of the computational variables of interest. Such direct reports complement model-fitting approaches by providing a more complete, albeit possibly more task-specific, representation of the factors that drive behavior. Computational models then provide a means to connect such task-specific results to a more general algorithmic understanding of the brain.

Review

Nassar and Gold use tasks from their recent experiments (e.g. Nassar et al., 2012) to point to the difficulties of interpreting model fits of behavioural data. The background is that it has become more popular to explain experimental findings (often behaviour) using computational models. But how reliable are those computational interpretations and how to ensure that they are valid? I will briefly review what Nassar and Gold did and point out that researchers investigating reward learning using computational models should think about learning rate adaptation in their experiments, because, in the light of the present paper, their results may else not be interpretable. Further, I will argue that Nassar and Gold’s appeal to more interaction between modelling and task design is just how science should work in principle.

Background

The considered tasks belong to the popular class of reward learning tasks in which a subject has to learn which choices are rewarded to maximise reward. These tasks may be modelled by a simple delta-rule mechanism which updates current (learnt) estimates of reward by an amount proportional to a prediction error where the exact amount of update is determined by a learning rate. This learning rate is one of the parameters that you want to fit to data. The second parameter Nassar and Gold consider is the ‘inverse temperature’ which tells how a subject trades off exploitation (choose to get reward) against exploration (choose randomly).

Nassar and Gold’s tasks are special, because at so-called change points during an experiment the underlying rewards may abruptly change (in addition to smaller variation of reward between single trials). The experimental subject then has to learn the new reward values. Importantly, Nassar and Gold have found that subjects use an adaptive learning rate, i.e., when subjects encounter small prediction errors they tend to reduce the learning rate while they tend to increase learning rate when experiencing large prediction errors. However, typical delta-rule learning models assume a fixed learning rate.

The issue

The issue discussed in the paper is that it will not be easily possible to detect a problem when fitting a fixed learning rate model to choices which were produced with an adaptive learning rate. As shown in the present paper, this issue results from a redundancy between learning rate adaptiveness (a hyperparameter, or hidden factor) and the inverse temperature with respect to subject choices, i.e., a change in learning rate adaptiveness can equivalently be explained by a change in inverse temperature (with fixed learning rate adaptiveness) when such a change is only measured by the choices a subject makes. Statistically, this means that, if you were to fit learning rate adaptiveness with inverse temperature to subject choices, then you should find that the two parameters are highly correlated given the data. Even better, if you were to look at the posterior distribution of the two parameters given subject choices, you should observe a large variance of them together with a strong covariance between them. As a statistician you would then report this variance and acknowledge that interpretation may be difficult. But learning rate adaptiveness is not typically fitted to choices. Instead only learning rate itself is fitted given a particular adaptiveness. Then, the relation between adaptiveness and inverse temperature is hidden from the analysis and investigators may be fooled into thinking that the combination of fitted learning rate and inverse temperature comprehensively explains the data. Well, it does explain the data, but there are potentially many other explanations of this kind which become apparent when the hidden factor learning rate adaptiveness is taken into account.

What does it mean?

The discussed issue exemplifies a general problem of cognitive psychology: that you try to investigate (computational) mechanisms, e.g., decision making, by looking at quite impoverished data, e.g., decisions, which only represent the final product of the mechanisms. So what you do is to guess a mechanism (a model) and see whether it fits the data. In the case of Nassar and Gold there was a prevailing guess which fit the data reasonably well. By investigating decision making in a particular, new situation (environment with change points) they found that they needed to extend that mechanism to account for the new data. However, the extended mechanism now has many explanations for the old impoverished data, because the extended mechanism is more flexible than the old mechanism. To me, this is all just part of the normal progress in science and nothing to be alarmed about in principle. Yet, Nassar and Gold are right to point out that in the light of the extended mechanism fits of the old mechanism to old data may be misleading. Interpreting the parameters of the old mechanism may then be similar to saying that you find that the earth is a disk, because from your window it looks like the ground goes to the horizon in a straight line and then stops.

Conclusion

Essentially, Nassar and Gold try to convince us that when looking at reward learning we should now also take learning rate adaptiveness into account, i.e., that we should interpret subject choices within their extended mechanism. Two questions remain: 1) Do we trust that their extended mechanism is worth pursuing? 2) If yes, what can we do with the old data?

The present paper does not provide evidence that their extended mechanism is a useful model for subject choices (1), because they here assumed that the extended mechanism is true and investigated how you would interpret the new data using the old mechanism. However, their original study and others point to the importance of learning rate adaptiveness [see their refs. 9-11,26-28].

If the extended mechanism is correct, then the present paper shows that the old data is pretty much useless (2) unless learning rate adaptiveness has been, perhaps accidentally, controlled for in previous studies. This is because the old data from previous experiments (probably) does not allow to estimate learning rate adaptiveness. Of course, if you can safely assume that the learning rate of subjects stayed roughly fixed in your experiment, for example, because prediction errors were very similar during the whole experiment, then the old mechanism with fixed learning rate should still apply and your data is interpretable in the light of the extended mechanism. Perhaps it would be useful to investigate how robust fitted parameters are to varying learning rate adaptiveness in a typical experiment producing old data (here we only see results for experiments designed to induce changes in learning rate through large jumps in mean reward values).

Overall the paper has a very general tone. It tries to discuss the difficulties of fitting computational models to behaviour in general. In my opinion, these things should be clear to anyone in science as they just reflect how science progresses: you make models which need to fit an observed phenomenon and you need to refine models when new observations are made. You progress by seeking new observations. There is nothing special about fitting computational models to behaviour with respect to this.

A supramodal accumulation-to-bound signal that determines perceptual decisions in humans.

O’Connell, R. G., Dockree, P. M., and Kelly, S. P.
Nat Neurosci, 15:1729–1735, 2012
DOI, Google Scholar

Abstract

In theoretical accounts of perceptual decision-making, a decision variable integrates noisy sensory evidence and determines action through a boundary-crossing criterion. Signals bearing these very properties have been characterized in single neurons in monkeys, but have yet to be directly identified in humans. Using a gradual target detection task, we isolated a freely evolving decision variable signal in human subjects that exhibited every aspect of the dynamics observed in its single-neuron counterparts. This signal could be continuously tracked in parallel with fully dissociable sensory encoding and motor preparation signals, and could be systematically perturbed mid-flight during decision formation. Furthermore, we found that the signal was completely domain general: it exhibited the same decision-predictive dynamics regardless of sensory modality and stimulus features and tracked cumulative evidence even in the absence of overt action. These findings provide a uniquely clear view on the neural determinants of simple perceptual decisions in humans.

Review

The authors report EEG signals which may represent 1) instantaneous evidence and 2) accumulated evidence (decision variable) during perceptual decision making. The result promises a big leap for experiments in perceptual decision making with humans, because it is the first time that we can directly observe the decision process as it accumulates evidence with reasonable temporal resolution without sticking needles in participant’s brains. Furthermore, one of the found signals appears to be sensory and response modality independent, i.e., it appears to reflect the decision process alone – something that has not been clearly found in species other than humans, but let’s discuss the study in more detail.

The current belief about the perceptual decision making process is formalised in accumulation to bound models: When presented with a stimulus, the decision maker determines at each time point of the presentation the current amount of evidence for all possible alternatives. This estimate of “instantaneous evidence” is noisy, because of either the noise within the stimulus itself, or because of internal processing noise. Therefore, the decision maker does not immediately make a decision between alternatives, but accumulates evidence over time until the accumulated evidence for one of the alternatives reaches a threshold which is internally set by the decision maker itself and indicates a certain level of certainty, or response urgency. The alternative, for which the threshold was crossed, is the decision outcome and the time the threshold was crossed is the decision time (potentially including an additional delay). The authors argue that they have found signals in the EEG of humans which can be associated with the instantaneous and accumulated evidence variables of these kinds of models.

The paradigm used in this study was different from the perceptual decision making paradigm popular in monkeys (random dot stimuli). Here the authors used stimuli which did not move, but rather gradually changed their intensity or contrast: In the experiments with visual stimuli, participants were continuously viewing a flickering disk which from time to time gradually changed its contrast with the background (the contrast gradually went back to base level after 1.6s). So the participants had to decide whether they observe a contrast different from baseline at the current time. Note that this setup is slightly different from usual trial-based perceptual decision making experiments where a formally new trial begins after a participant’s response. The disk also had a pattern, but it’s unclear why the pattern was necessary. On the other hand, using the other stimulus properties seems reasonable: The flickering induced something like continuous evoked potentials in the EEG ensuring that something stimulus-related could be measured at all times, but the gradual change of contrast “successfully eliminated sensory-evoked deflections from the ERP trace” such that the more subtle accumulated evidence signals were not masked by large deflections solely due to stimulus onsets. In the experiments with sounds, equivalent stimulus changes were implemented by either gradually changing the volume of a presented, envelope-modulated tone or its frequency.

The authors report 4 EEG signals related to perceptual decision making. They argue that the occipital steady-state visual-evoked potential (SSVEP) indicated the estimated instantaneous evidence when visual stimuli were used, because its trajectories directly reflected the changes in constrast. For auditory stimuli, the authors found a corresponding steady-state auditory-evoked potential (SSAEP) which was located at more central EEG electrodes and at 40Hz instead of 20Hz (SSVEP). Further, the authors argue that a left-hemisphere beta (LHB, 22-30Hz) and a centro-parietal potential (CPP, direct electrode measurements) could be interpreted as evidence accumulation signals, because the time of their peaks tightly predicted reaction times and their time courses were better predicted by the cumulative SSVEP instead of the original SSVEP. LHB and CPP also (roughly) showed the expected dependency on whether the participant correctly identified the target, or missed it (lower signals for misses). Furthermore, they reacted expectedly, when contrast varied in more complex ways than just a linear decrease (decrease followed by short increase followed by decrease). CPP was different from LHB by also showing the expected changes when the task did not require an overt response at target detection time whereas LHB showed no relation to the present evidence in this task indicating that it may have something to do with motor preparation of the response while CPP is a more abstract decision signal. Additionally, the CPP showed the characteristic changes measured with visual stimuli also with auditory stimuli and it depended on attentional focus: In one experimental condition the task of the participants was altered (‘detect a transient size change of a central fixation square’), but the original disk stimulus was still presented including the gradual contrast changes. In this ‘non-attend’ condition the SSVEP decreased with contrast as before, but the CPP showed no response reinforcing the idea that the CPP is an abstract decision signal. On a final note, the authors speculate that the CPP could be equal to the standard P300 signal, when transient stimuli need to be detected instead of gradual stimulus changes. This connection, if true, would be a nice functional explanation of the P300.

Open Questions

Despite the generally intriguing results presented in the paper a few questions remain. These predominantly regard details.

1) omission of data

In Figs. 2 and 3 the SSVEP is not shown anymore, presumably because of space restrictions. Similarly, the LHB is not presented in Fig. 4. I can believe that the SSVEP behaved expectedly in the different conditions of Figs. 2 and 3 such that not much information would have been added by providing the plots, but it would at least be interesting to know whether the accumulated SSVEP still predicted the LHB and CCP better than the original SSVEP in these conditions. Likewise, the authors do not report the equivalent analysis for the SSAEP in the auditory conditions. Regarding the omission of the LHB in Fig. 4, I’m not so certain about the behaviour of the LHB in the auditory conditions. It seems possible that the LHB shows different behaviour with different modalities. There is no mention of this in the text, though.

2) Is there a common threshold level?

The authors argue that the LHB and CCP reached a common threshold level just before response initiation (a prediction of accumulation to bound models, Fig. 1c), but the used test does not entirely convince me: They compared the variance just before response initiation with the variance of measurements across different time points (they randomly assigned the RT of one trial to another trial and computed variance of measurements at the shuffled time points). For a strongly varying function of time, it is no surprise that the measurements at a consistent time point vary less than the measurements made across many different time points as long as the measurement noise is small enough. Based on this argument, it is strange that they did not find a significant difference for the SSVEP which also varies strongly across time (this fits into their interpretation, though), but this lack of difference could be explained by larger measurement noise associated with the SSVEP.

Furthermore, the authors report themselves that they found a significant difference between the size of CPP peaks around decision time for varying contrast levels (Fig. 2c). Especially, the CPP peak for false alarms (no contrast change, but participant response) was lower than the other peaks. If the CPP really is the decision variable predicted by the models, then these differences should not have occurred. So where do they come from? The authors provide arguments that I cannot follow without further explanations.

3) timing of peaks

It appears that the mean reaction time precedes the peaks of the mean signals slightly. The effect is particularly clear in Fig. 3b (CPP), Fig. 4d (CPP) and Fig. 5a, but is also slightly visible in the averages centred at the time of response in Figs. 1c and 2c. Presuming a delay from internal decision time to actual response, the time of the peak of the decision variable should precede the reaction time, especially when reaction time is measured from button presses (here) compared to saccade initiation (typical monkey experiments). So why does it here appear to be the other way round?

4) variance of SSVEP baseline

The SSVEP in Fig. 4a is in a different range (1.0-1.3) than the SSVEP in Fig. 4d (1.7-2.5) even though the two plots should each contain a time course for the same experimental condition. Where does the difference come from?

5) multiple alternatives

The CPP, as described by the authors, is a single, global signal of a decision variable. If the decision problem is composed of only two decision alternatives, a single decision variable is indeed sufficient for decision making, but if more alternatives are considered, several evidence accumulating variables are needed. What would the CPP then signal? One of the decision variables? The total amount of certainty of the upcoming decision?

Conclusion

I do like the results in the paper. If they hold up, the CPP may provide a high temporal resolution window into the decision processes of humans. As a result, it may allow us to investigate decision processes for more complex situations than those which animals can master, but maybe it’s only a signal for the simple, perceptual decisions investigated here. Based on the above open questions I also guess that the reported signals were noisier than the plots make us belief and the correspondence of the CPP with theoretical decision variables should be further examined.

Perceptions as hypotheses: saccades as experiments.

Friston, K., Adams, R. A., Perrinet, L., and Breakspear, M.
Front Psychol, 3:151, 2012
DOI, Google Scholar

Abstract

If perception corresponds to hypothesis testing (Gregory, 1980); then visual searches might be construed as experiments that generate sensory data. In this work, we explore the idea that saccadic eye movements are optimal experiments, in which data are gathered to test hypotheses or beliefs about how those data are caused. This provides a plausible model of visual search that can be motivated from the basic principles of self-organized behavior: namely, the imperative to minimize the entropy of hidden states of the world and their sensory consequences. This imperative is met if agents sample hidden states of the world efficiently. This efficient sampling of salient information can be derived in a fairly straightforward way, using approximate Bayesian inference and variational free-energy minimization. Simulations of the resulting active inference scheme reproduce sequential eye movements that are reminiscent of empirically observed saccades and provide some counterintuitive insights into the way that sensory evidence is accumulated or assimilated into beliefs about the world.

Review

In this paper Friston et al. introduce the notion that an agent (such as the brain) minimizes uncertainty about its state in the world by actively sampling those states which minimise the uncertainty of the agent’s posterior beliefs, when visited some time in the future. The presented ideas can also be seen as reply to the commonly formulated dark-room-critique of Friston’s free energy principle which states that under the free energy principle an agent would try to find a dark, stimulus-free room in which sensory input can be perfectly predicted. Here, I review these ideas together with the technical background (see also a related post about Friston et al., 2011). Although I find the presented theoretical argument very interesting and sound (and compatible with other proposals for the origin of autonomous behaviour), I do not think that the presented simulations conclusively show that the extended free energy principle as instantiated by the particular model chosen in the paper leads to the desired exploratory behaviour.

Introduction: free energy principle and the dark room

Friston’s free energy principle has gained considerable momentum in the field of cognitive neuroscience as a unifying framework under which many cognitive phenomena may be understood. Its main axiom is that an agent tries to minimise the long-term uncertainty about its state in the world by executing actions which make prediction of changes in the agent’s world more precise, i.e., which minimise surprises. In other words, the agent tries to maintain a sort of homeostasis with its environment.

While homeostasis is a concept which most people happily associate with bodily functions, it is harder to reconcile with cognitive functions which produce behaviour. Typically, the counter-argument for the free energy principle is the dark-room-problem: changes in a dark room can be perfectly predicted (= no changes), so shouldn’t we all just try to lock ourselves into dark rooms instead of frequently exploring our environment for new things?

The shortcoming of the dark-room-problem is that an agent cannot maintain homeostasis in a dark room, because, for example, its bodily functions will stop working properly after some time without water. There may be many more environmental factors which may disturb the agent’s dark room pleasure. An experienced agent knows this and has developed a corresponding model about its world which tells it that the state of its world becomes increasingly uncertain as long as the agent only samples a small fraction of the state space of the world, as it is the case when you are in a dark room and don’t notice what happens outside of the room.

The present paper formalises this idea. It assumes that an agent only observes a small part of the world in its local surroundings, but also maintains a more comprehensive model of its world. To decrease uncertainty about the global state of the world, the agent then explores other parts of the state space which it beliefs to be informative according to its current estimate of the global world state. In the remainder I will present the technical argument in more detail, discuss the supporting experiments and conclude with my opinion about the presented approach.

Review of theoretical argument

In previous publications Friston postulated that agents try to minimise the entropy of the world states which they encounter in their life and that this minimisation is equivalent to minimising the entropy of their sensory observations (by essentially assuming that the state-observation mapping is linear). The sensory entropy can be estimated by the average of sensory surprise (negative model evidence) across (a very long) time. So the argument goes that an agent should minimise sensory surprise at all times. Because sensory surprise cannot usually be computed directly, Friston suggests a variational approximation in which the posterior distribution over world states (posterior beliefs) and model parameters is separated. Further, the posterior distributions are approximated with Gaussian distributions (Laplace approximation). Then, minimisation of surprise is approximated by minimisation of Friston’s free energy. This minimisation is done with respect to the posterior over world states and with respect to action. The former corresponds to perception and ensures that the agent maintains a good estimate of the state of the world and the latter implements how the agent manipulates its environment, i.e., produces behaviour. While the former is a particular instantiation of the Bayesian brain hypothesis, and hence not necessarily a new idea, the latter had not previously been proposed and subsequently spurred some controversy (cf. above).

At this point it is important to note that the action variables are defined on the level of primitive reflex arcs, i.e., they directly control muscles in response to unexpected basic sensations. Yet, the agent can produce arbitrary complex actions by suitably setting sensory expectations which can be done via priors in the model of the agent. In comparison with reinforcement learning, the priors of the agent about states of the world (the probability mass attributed by the prior to the states), therefore, replace values or costs. But how does the agent choose its priors? This is the main question addressed by the present paper, however, only in the context of a freely exploring (i.e., task-free) agent.

In this paper, Friston et al. postulate that an agent minimises the joint entropy of world states and sensory observations instead of only the entropy of world states. Because the joint entropy is the sum of sensory entropy and conditional entropy (world states conditioned on sensory observations), the agent needs to implement two minimisations. The minimisation of sensory entropy is exactly the same as before implementing perception and action. However, conditional entropy is minimised with respect to the priors of the agent’s model, implementing higher-level action selection.

In Friston’s dynamic free energy framework (and other filters) priors correspond to predictive distributions, i.e., distributions over the world states some time in the future given their current estimate. Friston also assumes that the prior densities are Gaussian. Hence, priors are parameterised by their mean and covariance. To manipulate the probability mass attributed by the prior to the states he, thus, has to change prior mean or covariance of the world states. In the present paper the authors use a fixed covariance (as far as I can tell) and implement changes in the prior by manipulating its mean. They do this indicrectly by introducing new, independent control variables (“controls” from here on) which parameterise the dynamics of the world states without having a dynamics associated with themselves. The controls are treated like the other hidden variables in the agent model and their values are inferred from the sensory observations via free energy minimisation. However, I guess, that the idea is to more or less fix the controls to their prior means, because the second entropy minimisation, i.e., minimisation of the conditional entropy, is with respect to these prior means. Note that the controls are pretty arbitrary and can only be interpreted once a particular model is considered (as is the case for the remaining variables mentioned so far).

As with the sensory entropy, the agent has no direct access to the conditional entropy. However, it can use the posterior over world states given by the variational approximation to approximate the conditional entropy, too. In particular, Friston et al. suggest to approximate the conditional entropy using a predictive density which looks ahead in time from the current posterior and which they call counterfactual density. The entropy of this counterfactual density tells the agent how much uncertainty about the global state of the world it can expect in the future based on its current estimate of the world state. The authors do not specify how far in the future the counterfactual density looks. They here use the denotational trick to call negative conditional entropy ‘saliency’ to make the correspondence between the suggested framework and experimental variables in their example more intuitive, i.e., minimisation of conditional entropy becomes maximisation of saliency. The actual implementation of this nonlinear optimisation is computationally demanding. In particular, it will be very hard to find global optima using gradient-based approaches. In this paper Friston et al. bypass this problem by discretising the space spanned by the controls (which are the variables with respect to which they optimise), computing conditional entropy at each discrete location and simply selecting the location with minimal entropy, i.e., they do grid search.

In summary, the present paper extends previous versions of Friston’s free energy principle by adding prior selection, or, say, high-level action, to perception and action. This is done by adding new control variables representing high-level actions and setting these variables using a new optimisation which minimises future uncertainty about the state of the world. The descriptions in the paper implicitly suggest that the three processes happen sequentially: first the agent perceives to get the best estimate of the current world state, then it produces action to take the world state closer to its expectations and then it reevaluates expectations and thus sets high-level actions (goals). However, Friston’s formulations are in continuous time such that all these processes supposedly happen in parallel. For perception and action alone this leads to unexpected interactions. (Do you rather perceive the true state of the world as it is, or change it such that it corresponds to your expectations?) Adding control variables certainly doesn’t reduce this problem, if their values are inferred (perceived), too, but if perception cannot change them, only action can reduce the part of free energy contributed by them, thereby disentangling perception and action again. Therefore, the new control variables may be a necessary extension, if used properly. To me, it does not seem plausible that high-level actions are reevaluated continuously. Shouldn’t you wait until, e.g., a goal is reached? Such a mechanism is still missing in the present proposal. Instead the authors simply reevaluate high-level actions (minimise conditional entropy with respect to control variable priors) at fixed, ad-hoc intervals spanning sufficiently large amounts of time.

Review of presented experiments (saccade model)

To illustrate the theoretical points, Friston et al. present a model for saccadic eye movements. This model is very basic and is only supposed to show in principle that the new minimisation of conditional entropy can provide sensible high-level action. The model consists of two main parts: 1) the world, which defines how sensory input changes based on the true underlying state of the world and 2) the agent, which defines how the agent believes the world behaves. In this case, the state of the world is the position in a viewed image which is currently fixated by the eye of the agent. This position, hence, determines what input the visual sensors of the agent currently get (the field of view around the fixation position is restricted), but additionally there are proprioceptive sensors which give direct feedback about the position. Action changes the fixation position. The agent has a similar, but extended model of the world. In it, the fixation position depends on the hidden controls. Additionally, the model of the agent contains several images such that the agent has to infer what image it sees based on its sensory input.

In Friston’s framework, inference results heavily depend on the setting of prior uncertainties of the agent. Here, the agent is assumed to have certain proprioception, but uncertain vision such that it tends to update its beliefs of what it sees (which image) rather than trying to update its beliefs of where it looks. [I guess, this refers to the uncertainties of the hidden states and not the uncertainties of the actual sensory input which was probably chosen to be quite certain. The text does not differentiate between these and, unfortunately, the code was not yet available within the SPM toolbox at the time of writing (08.09.2012).]

As mentioned above, every 16 time steps the prior for the hidden controls of the agent is recomputed by minimising the conditional entropy of the hidden states given sensory input (minimising the uncertainty over future states given the sensory observations up to that time point). This is implemented by defining a grid of fixation positions and computing the entropy of the counterfactual density (uncertainty of future states) while setting the mean of the prior to one of the positions. In effect, this translates for the agent into: ‘Use your internal model of the world to simulate how your estimate of the world will change when you execute a particular high-level action. (What will be your beliefs about what image you see, when fixating a particular position?) Then choose the high-level action which reduces your uncertainty about the world most. (Which position gives you most information about what image you see?)’ Up to here, the theoretical ideas were self-contained and derived from first principles, but then Friston et al. introduce inhibition of return to make their results ‘more realistic’. In particular, they introduce an inhibition of return map which is a kind of fading memory of which positions were previously chosen as saccade targets and which is subtracted from the computed conditional entropy values. [The particular form of the inhibition of return computations, especially the initial substraction of the minimal conditional entropy value, is not motivated by the authors.]

For the presented experiments the authors use an agent model which contains three images as hypotheses of what the agent observes: a face and its 90° and 180° rotated versions. The first experiment is supposed to show that the agent can correctly infer which image it observes by making saccades to low conditional entropy (‘salient’) positions. The second experiment is supposed to show that, when an image is observed which is unknown to the agent, the agent cannot be certain of which of the three images it observes. The third experiment is supposed to show that the uncertainty of the agent increases when high entropy high-level actions are chosen instead of low entropy ones (when the agent chooses positions which contain very little information). I’ll discuss them in turn.

In the first experiment, the presented posterior beliefs of the agent about the identity of the observed image show that the agent indeed identifies the correct image and becomes more certain about it. Figure 5 of the paper also shows us the fixated positions and inhibition of return adapted conditional entropy maps. The presented ‘saccadic eye movements’ are misleading: the points only show the stabilised fixated positions and the lines only connect these without showing the large overshoots which occur according to the plot of ‘hidden (oculomotor) states’. Most critically, however, it appears that the agent already had identified the right image with relative certainty before any saccade was made (time until about 200ms). The results, therefore, do not clearly show that the saccade selection is beneficial for identifying the observed image, also because the presented example is only a single trial with a particular initial fixation point and with a noiseless observed image. Also, because the image was clearly identified very quickly, my guess is that the conditional entropy maps would be very similar after each saccade without inhibition of return, i.e., always the same fixation position would be chosen and no exploratory behaviour (saccades) would be seen, but this needs to be confirmed by running the experiment without inhibition of return. My overall impression of this experiment is that it presents a single, trivial example which does not allow me to draw general conclusions about the suggested theoretical framework.

The second experiment acts like a sanity check: the agent shouldn’t be able to identify one of its three images, when it observes a fourth one. Whether the experiment shows that, depends on the interpretation of the inferred hidden states. The way these states were defined their values can be directly interpreted as the probability of observing one of the three images. If only these are considered the agent appears to be very certain at times (it doesn’t help that the scale of the posterior belief plot in Figure 6 is 4 times larger than that of the same plot in Figure 5). However, the posterior uncertainty directly associated with the hidden states appears to be indeed considerably larger than in experiment 1, but, again, this is only a single example. Something that is rather strange: the sequence of fixation positions is almost exactly the same as in experiment 1 even though the observed image and the resulting posterior beliefs were completely different. Why?

Finally, experiment three is more like a thought experiment: what would happen, if an agent chooses high-level actions which maximise future uncertainty instead of minimising it. Well, the uncertainty of the agent’s posterior beliefs increases as shown in Figure 7, which is the expected behaviour. One thing that I wonder, though, and it applies to the presented results of all experiments: In Friston’s Bayesian filtering framework the uncertainty of the posterior hidden states is a direct function of their mean values. Hence, as long as the mean values do not change, the posterior uncertainty should stay constant, too. However, we see in Figure 7 that the posterior uncertainty increases even though the posterior means stay more or less constant. So there must be an additional (unexplained) mechanism at work, or we are not shown the distribution of posterior hidden states, but something slightly different. In both cases, it would be important to know what exactly resulted in the presented plots to be able to interpret the experiments in the correct way.

Conclusion

The paper presents an important theoretical extension to Friston’s free energy framework. This extension consists of adding a new layer of computations which can be interpreted as a mechanism for how an agent (autonomously) chooses its high-level actions. These high-level actions are defined in terms of desired future states encoded by the probability mass which is assigned to these states by the prior state distribution. Conceptually, these ideas translate into choosing maximally informative actions given the agent’s model of the world and its current state estimate. As discussed by Friston et al. such approaches to action selection are not new (see also Tishby and Polani, 2011). So, the author’s contribution is to show that these ideas are compatible with Friston’s free energy framework. Hence, on the abstract, theoretical level this paper makes sense. It also provides a sound theoretical argument for why an agent would not seek sensory deprivation in a dark room, as feared by critics of the free energy principle. However, the presented framework heavily relies on the agent’s model of the world and it leaves open how the agent has attained this model. Although the free energy principle also provides a way for the agent to learn parameters of its model, I still, for example, haven’t seen a convincing application in which the agent actually learnt the dynamics of an unknown process in the world. Probably Friston would here also refer to evolution as providing a good initialisation for process dynamics, but I find that a too cheap way out.

From a technical point of view the paper leaves a few questions open, for example: How far does the counterfactual distribution look into the future? What does it mean for high-level actions to change how far the agent looks into his subjective future? How well does the presented approach scale? Is it important to choose the global minimum of the conditional entropy (this would be bad, as it’s probably extremely hard to find in a general setting)? When, or how often, does the agent minimise conditional entropy to set high-level actions? What happens with more than one control variables (several possible high-level actions)? How can you model discrete high-level actions in Friston’s continuous Gaussian framework? How do results depend on the setting of prior covariances / uncertainties. And many more.

Finally, I have to say that I find the presented experiments quite poor. Although providing the agent with a limited field of view such that it has to explore different regions of a presented image is a suitable setting to test the proposed ideas, the trivial example and introduction of ad-hoc inhibition of return make it impossible to judge whether the underlying principle is successfully at work, or the simulations have been engineered to work in this particular case.

Probabilistic population codes for Bayesian decision making.

Beck, J. M., Ma, W. J., Kiani, R., Hanks, T., Churchland, A. K., Roitman, J., Shadlen, M. N., Latham, P. E., and Pouget, A.
Neuron, 60:1142–1152, 2008
DOI, Google Scholar

Abstract

When making a decision, one must first accumulate evidence, often over time, and then select the appropriate action. Here, we present a neural model of decision making that can perform both evidence accumulation and action selection optimally. More specifically, we show that, given a Poisson-like distribution of spike counts, biological neural networks can accumulate evidence without loss of information through linear integration of neural activity and can select the most likely action through attractor dynamics. This holds for arbitrary correlations, any tuning curves, continuous and discrete variables, and sensory evidence whose reliability varies over time. Our model predicts that the neurons in the lateral intraparietal cortex involved in evidence accumulation encode, on every trial, a probability distribution which predicts the animal’s performance. We present experimental evidence consistent with this prediction and discuss other predictions applicable to more general settings.

Review

In this article the authors apply probabilistic population coding as presented in Ma et al. (2006) to perceptual decision making. In particular, they suggest a hierarchical network with a MT and LIP layer in which the firing rates of MT neurons encode the current evidence for a stimulus while the firing rates of LIP neurons encode the evidence accumulated over time. Under the made assumptions it turns out that the accumulated evidence is independent of nuisance parameters of the stimuli (when they can be interpreted as contrasts) and that LIP neurons only need to sum (integrate) the activity of MT neurons in order to represent the correct posterior of the stimulus given the history of evidence. They also suggest a readout layer implementing a line attractor which reads out the maximum of the posterior under some conditions.

Details

Probabilistic population coding is based on the definition of the likelihood of stimulus features p(r|s,c) as an exponential family distribution of firing rates r. A crucial requirement for the central result of the paper (that LIP only needs to integrate the activity of MT) is that nuisance parameters c of the stimulus s do not occur in the exponential itself while the actual parameters of s only occur in the exponential. This restricts the exponential family distribution to the “Poisson-like family”, as they call it, which requires that the tuning curves of the neurons and their covariance are proportional to the nuisance parameters c (details for this need to be read up in Ma et al., 2006). The point is that this is the case when c corresponds to contrast, or gain, of the stimulus. For the considered random dot stimuli the coherence of the dots may indeed be interpreted as the contrast of the motion in the sense that I can imagine that the tuning curves of the MT neurons are multiplicatively related to the coherence of the dots.

The probabilistic model of the network activities is setup such that the firing of neurons in the network is an indirect, noisy observation of the underlying stimulus, but what we are really interested in is the posterior of the stimulus. So the question is how you can estimate this posterior from the network firing rates. The trick is that under the Poisson-like distribution the likelihood and posterior share the same exponential such that the posterior becomes proportional to this exponential, because the other parts of the likelihood do not depend on the stimulus s (they assume a flat prior of s such that you don’t need to consider it when computing the posterior). Thus, the probability of firing in the network is determined from the likelihood while the resulting firing rates simultaneously encode the posterior. Mind-boggling. The main contribution from the authors then is to show, assuming that firing rates of MT neurons are driven from the stimulus via the corresponding Poisson-like likelihood, that LIP neurons only need to integrate the spikes of MT neurons in order to correctly represent the posterior of the stimulus given all previous evidence (firing of MT neurons). Notice, that they also assume that LIP neurons have the same tuning curves with respect to the stimulus as MT neurons and that the neurons in LIP sum the activity of this MT neuron which they share a tuning curve with. They note that a naive procedure like that, i.e. a single neuron integrating MT firing over time, would quickly saturate its activity. So they show, and that is really cool, that global inhibition in the LIP network does not affect the representation of the posterior, allowing them to prevent saturation of firing while maintaining the probabilistic interpretation.

So far to the theory. In practice, i.e. experiments, the authors do something entirely different, because “these results are important, but they are based on assumptions that are not necessarily exactly true in vivo. […] It is therefore essential that we test our theory in biologically realistic networks.” Now, this is a noble aim, but what exactly do we learn about this theory, if all results are obtained using methods which violate the assumptions of the theory? For example, neither the probability of firing in MT nor LIP is Poisson-like, LIP neurons not just integrate MT activity, but are also recurrently connected, LIP neurons have local inhibition (they are leaky integrators, inhibition between LIP neurons depending on tuning properties) instead of global inhibition and LIP neurons have an otherwise completely unmotivated “urgency signal” whose contribution increases with time (this stems from experimental observations). Without any concrete links between the two models in theory (I guess, the main ideas are similar, but the details are very different) it has to be shown that they are similar using experimental results. In any case, it is hard to differentiate between contributions from the probabilistic theory and the network implementation, i.e., how much of the fit between experimental findings in monkeys and the behaviour of the model is due to the chosen implementation and how much is due to the probabilistic interpretation?

Results

The overall aim of the experiments / simulations in the paper is to show that the proposed probabilistic interpretation is compatible with the experimental findings in monkey LIP. The hypothesis is that LIP neurons encode the posterior of the stimulus as suggested in the theory. This hypothesis is false from the start, because some assumptions of the theory apparently don’t apply to neurons (as acknowledged by the authors). So the new hypothesis is that LIP neurons approximately encode some posterior of the stimulus. The requirement for this posterior is that updates of the posterior should take the uncertainty of the evidence and the uncertainty of the previous estimate of the posterior into account which the authors measure as a linear increase of the log odds of making a correct choice, log[ p(correct) / (1-p(correct)) ], with time together with the dependence of the slope of this linear increase on the coherence (contrast) of the stimulus. I did not follow up why the previous requirement is related to the log odds in this way, but it sounds ok. Remains the question how to estimate the log odds from simulated and real neurons. For the simulated neurons the authors approximate the likelihood with a Poisson-like distribution whose kernel (parameters) were estimated from the simulated firing rates. They argue that it is a good approximation, because linear estimates of the Fisher information appear to be sufficient (I can’t comment on the validity of this argument). A similar approximation of the posterior cannot be done for real LIP neurons, because of a lack of multi-unit recordings which estimate the response of the whole LIP population. Instead, the authors approximate the log odds from measured firing rates of neurons tuned to motion in direction 0 and 180 degrees via a linear regression approach described in the supplemental data.

The authors show that the log-odds computed from the simulated network exhibit the desired properties, i.e., the log-odds linearly increase with time (although there’s a kink at 50ms which supposedly is due to the discretisation) and depend on the coherence of the motion such that the slope of the log-odds increases also when coherence is increased within a trial. The corresponding log-odds of real LIP neurons are far noisier and, thus, do not allow to make definite judgements about linearity. Also, we don’t know whether their slopes would actually change after a change in motion coherence during a trial, as this was never tested (it’s likely, though).

In order to test whether the proposed line attractor network is sufficient to read out the maximum of the posterior in all conditions (readout time and motion coherence) the authors compare a single (global) readout with local readouts adapted for a particular pair of readout time and motion coherence. However, the authors don’t actually use attractor networks in these experiments, but note that these are equivalent to local linear estimators and so use these. Instead of comparing the readouts from these estimators with the actual maximum of the posterior, they only compare the variance of the estimators (Fisher information) which they show to be roughly the same for the local and global estimators. From this they conclude that a single, global attractor network could read out the maximum of the (approximated) posterior. However, this is only true, if there’s no additional bias of the global estimator which we cannot see from these results.

In an additional analysis the authors show that the model qualitatively replicates the behavioural variables (probability correct and reaction time). However, these are determined from the LIP activities in a surprisingly ad-hoc way: the decision time is determined as the time when any one of the simulated LIP neurons reaches a threshold defined on the probability of firing and the decision is determined as the preferred direction of the neuron hitting the threshold (for 2 and 4 choice tasks the response is determined as the quadrant of the motion direction in which the preferred direction of the neuron falls). Why do the authors not use the attractor network to readout the response here? Also, the authors use a lower threshold for the 4-choice task than for the 2-choice task. This is strange, because one of the main findings of the Churchland et al. (2008) paper was that the decision in both, 2- and 4-choice tasks, appears to be determined by a common decision threshold while the initial firing rates of LIP neurons were lower for 4-choice tasks. Here, they also initialise with lower firing rates in the 4-choice task, but additionally choose a lower threshold. They don’t motivate this. Maybe it was necessary to fit the data from Churchland et al. (2008). This discrepancy between data and model is even more striking as the authors of the two papers partially overlap. So, do they deem the corresponding findings of Churchland et al. (2008) not important enough to be modelled, is it impossible to be modelled within their framework, or did they simply forget?

Finally, also the build-up rates of LIP neurons seem to be qualitatively similar in the simulation and the data, although they are consistently lower in the model. The build-up rates for the model are estimated from the first 50ms within each trial. However, the log-odds ratio had this kink at 50ms after which its slope was larger. So, if this effect is also seen directly in the firing rates, the fit of the build-up rates to the data may even be better, if probability of firing after 50ms is used. In Fig. 2C no such kink can be seen in the firing rates, but this is only data for 2 neurons in the model.

Conclusion

Overall the paper is very interesting and stimulating. It is well written and full of sound theoretical results which originate from previous work of the authors. Unfortunately, biological nature does not completely fit the beautiful theory. Consequently, the authors run experiments with more plausible neural networks which only approximately implement the theory. So what conclusions can we draw from the presented results? As long as the firing of MT neurons reflects the likelihood of a stimulus (their MT network is setup in this way), probably a wide range of networks which accumulate this firing will show responses similar to real LIP neurons. It is not easy to say whether this is a consequence of the theory, which states that MT firing rates should be simply summed over time in order to get the right posterior, because of the violation of the assumptions of the theory in more realistic networks. It could also be that more complicated forms of accumulation are necessary such that LIP firing represents the correct posterior. Simple summing then just represents a simple approximation. Also, I don’t believe that the presented results can rule out the possibility of sampling based coding of probabilities (see Fiser et al., 2010) for decision making as long as also the sampling approach would implement some kind of accumulation procedure (think of particle filters – the implementation in a recurrent neural network would probably be quite similar).

Nevertheless, the main point of the paper is that the activity in LIP represents the full posterior and not only MAP estimates or log-odds. Consequently, the model very easily extends to the case of continuous directions of motion which is in contrast to previous, e.g., attractor-based, neural models. I like this idea. However, I cannot determine from the experiments whether their network actually implements the correct posterior, because all their tests yield only indirect measures based on approximated analyses. Even so, it is pretty much impossible to verify that the firing of LIP neurons fits to the simulated results as long as we cannot measure firing of a large part of the corresponding neural population in LIP.

Why don’t we use Bayesian statistics to analyse experimental data?

This paper decoder post is a little different as it doesn’t relate to a particular paper. Rather it’s my answer to the question in the title of this post which was triggered by a colleague of mine. The colleague has a psychology background and just got to know about Bayesian statistics when the following question crossed his mind:

Question

You
do Bayesian stuff, right? Trying to learn about it now, can’t quite get my head
around it yet, but it sounds like how I should be analysing data. In
psychophysics we usually collect a lot of data from a small number of subjects,
but then collapse all this data into a small number of points per subject for
the purposes of stats. This loses quite a lot of fine detail: for instance,
four steep psychometric functions with widely different means average together
to create a shallow function, which is not a good representation of the data.
Usually, the way psychoacousticians in particular get around this problem is
not to bother with the stats. This, of course, is not optimal either! As far as
I can tell the Bayesian approach to stats allows you to retain the variance
(and thus the detail) from each stage of analysis, which sounds perfect for my
old phd data and for the data i’m collecting now. it also sounds like the thing
to do for neuroimaging data: we collect a HUGE amount of data per subject in
the scanner, but then create these extremely course averages, leading people to
become very happy when they see something at the single-subject level. But of
course all effects should REALLY be at the single-subject level, we assume they
aren’t visible due to noise. So I’m wondering why everyone doesn’t employ this
Bayesian approach, even in fMRI etc..

In short, my answer is twofold: 1) Bayesian statistics can be computationally very hard and, conceptually critical, 2) choosing a prior influences the results of your statistical inference which makes experimenters uneasy.

The following is my full answer. It contains a basic introduction to Bayesian statistics targeted to people who just realised that this exists. I bet that a simple search for “Bayesian frequentist” brings up a lot more valuable information.

Answer

You’re right: the best way to analyse any data is to maintain the full distribution of your variables of interest throughout all analysis steps. You nicely described the reasons for this. The problem only is that this can be really hard depending on your statistical model, i.e., data. So you’ll need to make approximations. One way of doing this is to summarise the distribution by its mean and variance. The Gaussian distribution is so cool, because these two variables are actually sufficient to represent the whole distribution. For other probability distributions the mean and variance are not sufficient representations so that when you summarise the distribution with them you make an approximation. Therefore, you could say that the standard analysis methods you mention are valid approximations in the sense that they summarise the desired distribution with its mean. Then the question becomes: Can you make better approximations for the model you consider? This is where the expertise of the statistician comes into play, because what you can do really depends on the particular situation with your data. It’s most of the time impossible to come up with the right distribution analytically, but actually many things could be solved numerically in the computer these days.

Now a little clarification what I understand under the Bayesian approach. Here’s a hypothetical example: your variable of interest, x, is whether person A is a genius. You can’t really tell directly whether a person is a genius and you have to collect indirect evidence, y, from their behaviour (might be the questions they ask, the answers they give, or indeed a battery of psychological tests). So x can take values 0 (no genius) and 1 (genius). Your inference will be based on a statistical model of behaviour given genius or no genius (in words: if A is a genius then with probability p(y|x=1) he will exhibit behaviour y):

p(y|x=1) and p(y|x=0).

In a frequentist (classic) approach you will make a maximum likelihood estimate for x which will end up in a simple procedure where you sum up the log-probabilities of your evidence and compare which sum is larger:

sum over i log(p(y_i|x=1)) > sum over i log(p(y_i|x=0)) ???

If this statement is true, you’ll believe that A is a genius. Now, the problem is that, if you only have a few pieces of evidence, you can easily make false judgements with this procedure. Bayesians therefore take one additional source of information into account: the prior probability of someone being a genius, p(x=1), which is quite low. We can then get something called a maximum a posteriori estimate in which you weight evidence by the prior probability which leads to the following decision procedure:

sum over i log(p(y_i|x=1)p(x=1)) > sum over i log(p(y_i|x=0)p(x=0)) ???

Because p(x=1) is much smaller than p(x=0) this means that you now have to collect much more evidence where the probability of behaviour given that A is a genius, p(y_i|x=1), is larger than the probability of behaviour given that A is no genius, p(y_i|x=0), before you believe that A is a genius. In the full Bayesian approach you would actually not make a judgement, but estimate the posterior probability of A being a genius:

p(x=1|y) = p(y|x=1)p(x=1) / p(y).

This is the distribution which I said is hard to estimate above. The thing that makes it hard is p(y). In this case, where x can only take two values it is actually very easy to compute:

p(y) = p(y|x=1)p(x=1) + p(y|x=0)p(x=0)

but for each additional value x can take you’ll have to add a term to this equation and when x is a continuous variable this sum will become an integral and integration is hard.

One more, but very important thing: the technical problems aside, the biggest criticism of the Baysian approach is the use of the prior. In my example it helped us from making a premature judgement, but only because we had a suitable estimate of the prior probability of someone being a genius. The question is where does the prior come from? Well, it’s prior information that enters your inference. If you don’t have prior information about your variable of interest, you’ll use an uninformative prior which assigns equal probability to each value of x. Then the maximum likelihood and maximum a posteriori estimators above become equal, but what does it mean for the posterior distribution p(x|y)? It changes its interpretation. The posterior becomes an entity representing a belief over the corresponding statement (A is a genius) given the prior information provided by the prior. If the prior measures the true frequency of the corresponding event in the real world, the posterior is a statement about the state of the world. But if the prior has no such interpretation, the posterior is just the mentioned belief under the assumed prior. These arguments are very subtle. Think about my example. The prior could be paraphrased as the prior probabilty that person A is a genius. This prior cannot represent a frequency in the world, because person A exists only once in the world. So whatever we choose as prior merely is a prior belief. While frequentists often argue that the posterior does not faithfully represent the world, because of a potentially unsuitable prior, in my example the Bayesian approach allowed us to incorporate information in the inference that is inaccessible to the frequentist approach. We did this by transferring the frequency of being a genius in the whole population to our a priori belief that person A is a genius.

Note that there really is no “correct” prior in my example and any prior will correspond to a particular prior assumption. Furthermore, the frequentist maximum likelihood estimator is equivalent to a maximum a posteriori estimator with a particular (uninformative) prior. Therefore, it has been argued that the Bayesian approach just makes the prior assumptions explicit which are implicit also in the more common (frequentist) statistical analyses. Unfortunately, it seems to be a bitter pill to swallow for experimenters to admit that their statistical analysis (and thus outcome) of their experiment depends on prior assumptions (although they appear to be happy to do this in other contexts, for example, when making Gaussian assumptions when doing an ANOVA). Also, remember that the prior will ultimately be overwritten by sufficient evidence (even for a very low prior probability of A being a genius we’ll at some point belief that A is a genius, if A behaves accordingly). Given these considerations, the prior shouldn’t be a hindrance for using a Bayesian analyis of experimental data, but the technical issues remain.

Action understanding and active inference.

Friston, K., Mattout, J., and Kilner, J.
Biol Cybern, 104:137–160, 2011
DOI, Google Scholar

Abstract

We have suggested that the mirror-neuron system might be usefully understood as implementing Bayes-optimal perception of actions emitted by oneself or others. To substantiate this claim, we present neuronal simulations that show the same representations can prescribe motor behavior and encode motor intentions during action-observation. These simulations are based on the free-energy formulation of active inference, which is formally related to predictive coding. In this scheme, (generalised) states of the world are represented as trajectories. When these states include motor trajectories they implicitly entail intentions (future motor states). Optimizing the representation of these intentions enables predictive coding in a prospective sense. Crucially, the same generative models used to make predictions can be deployed to predict the actions of self or others by simply changing the bias or precision (i.e. attention) afforded to proprioceptive signals. We illustrate these points using simulations of handwriting to illustrate neuronally plausible generation and recognition of itinerant (wandering) motor trajectories. We then use the same simulations to produce synthetic electrophysiological responses to violations of intentional expectations. Our results affirm that a Bayes-optimal approach provides a principled framework, which accommodates current thinking about the mirror-neuron system. Furthermore, it endorses the general formulation of action as active inference.

Review

In this paper the authors try to convince the reader that the function of the mirror neuron system may be to provide amodal expectations for how an agent’s body will change, or interact with the world. In other words, they propose that the mirror neuron system represents, more or less abstract, intentions of an agent. This interpretation results from identifying the mirror neuron system with hidden states in a dynamic model within Friston’s active inference framework. I will first comment on the active inference framework and the particular model used and will then discuss the biological interpretation.

Active inference framework:

Active inference has been described by Friston elsewhere (Friston et al. PLoS One, 2009; Friston et al. Biol Cyb, 2010). Note that all variables are continuous. The main idea is that an agent maximises the likelihood of its internal model of the world as experienced by its sensors by (1) updating the hidden states of this model and (2) producing actions on the world. Under the Gaussian assumptions made by Friston both ways to maximise the likelihood of the model are equivalent to minimising the precision-weighted prediction errors defined in the model. Potentially the models are hierarchical, but here only a single layer is used which consists of sensory states and hidden states. The prediction errors on sensory states are simply defined as the difference between sensory observations and sensory predictions from the model as you would intuitively do. The model also defines prediction errors on hidden states (*). Both types of prediction errors are used to infer hidden states (1) which explain sensory observations, but action is only produced (2) from sensory state prediction errors, because action is not part of the agent’s model and only affects sensory observations produced by the world.

Well, actually the agent needs a whole other model for action which implements the gradient of sensory observations with respect to action, i.e., which tells the agent how sensory observations change when it exerts action. However, Friston restricts sensory obervations in this context to proprioceptive observations, i.e., muscle feedback, and argues that the corresponding gradient may be sufficiently simple to learn and represent so that we don’t have to worry about it (in the simulation he just provides the gradient to the agent). Therefore, action solely tries to implement proprioceptive predictions. On the other hand, proprioceptive predictions may be coupled to predictions in other modalities (e.g. vision) through the agent’s model which allows the agent to execute (seemingly) higher-level actions. For example, if an agent sees its hand move from a cup to a glass on a table in front of it, its generative model must also represent the corresponding proprioceptive signals. If then the agent predicts this movement of its hand in visual space, the generative model must automatically predict the corresponding proprioceptive signals, because they always accompanied the seen movement. Action then minimises the resulting precision-weighted proprioceptive prediction error and so implements the hand movement from cup to glass.

Notice that the agent minimises the *precision-weighted* prediction errors. Precision here means the inverse *prior* covariance, i.e., it is a measure for how certain the agent *expects* to be about its observations. By changing the precisions, qualitatively very different results can be obtained within the active inference framework. Indeed, here they implement the switch from action generation to action observation by heavily reducing the precision of the proprioceptive observations. This makes the agent ignore any proprioceptive prediction errors when both updating hidden states (1) and generating action (2). This leads to an interesting prediction: when you observe an action by somebody else, you shouldn’t notice when the corresponding body part is moved externally, or alternatively, when you observe somebody elses movement, you shouldn’t be able to move the corresponding body part yourself (in a different way than the observed). In this strict formulation this prediction appears to be very unlikely, but formulating it more softly, that you should see interference effects in these situations, you may be able to find evidence for it.

This thought also points to the general problem of finding suitable precisions: How do you strike a balance between action (2) and perception (1)? Because they are both trying to reduce the same prediction errors, the agent has to tradeoff recognising the world as it is (1) and changing it so that it corresponds to his expectations (2). This dichotomy is not easily resolved. When asked about it, Friston usually points to empirical priors, i.e., that the agent has learnt to choose suitable precisions based on his past experience (not very helpful, if you want to know how they are chosen). I guess, it’s really a question about how strongly the agent expects (wants) a certain outcome. A useful practical consideration also is that action is constrained, e.g., an agent can’t move infinitely fast, which means that enough prediction error should be left over for perceiving changes in the world (1), in particular those that are not within reach of the agent’s actions on the expected time scale.

I do not discuss the most common reservation against Friston’s free-energy principle / active inference framework (that people seem to have an intrinsic curiosity towards new things as well), because it has been covered elsewhere (John Langford’s blogNature Neuroscience).

Handwriting model:

In this paper the particular model used is interpreted as a model for handwriting although neither a hand is modeled, nor actual writing. Rather, a two-joint system (arm) is used where the movement of the end-effector position (tip) is designed such that it is qualitatively similar to hand-writing without actually producing common letters. The dynamic model of the agent consists of two parts: (a) a stable heteroclinic channel (SHC) which produces a periodic sequence of 6 continuously changing states and (b) a linear attractor dynamics in joint angle space of the arm which is attracted to a rest position, but modulated by the distance of the tip to a desired point in Cartesian space which is determined by the SHC state. Thus, the agent expects that the tip of its arm moves along a sequence of 6 desired points where the dynamics of the arm movement is determined by the linear attractor. The agent observes the joint angle positions and velocities (proprioceptive) and the Cartesian positions of the elbow joint and tip (visual). The dynamic model of the world (so to say implementing the underlying physics) lacks the SHC dynamics and only defines the linear attractor in joint space which is modulated by action and some (unspecified) external variables which can be used to perturb the system. Interestingly, the arm is stronger attracted to its rest position in the world model than in the agent model. The reason for this is not clear to me, but it might not be important, because action could correct for this.

Biological interpretation:

The system is setup such that the agent model contains additional hidden states compared to the world which may be interpreted as intentions of the agent, because they determine the order of the points that the tip moves to. In simulations the authors show that the described models within the active inference framework indeed lead to actions of the agent which implement a “writing” movement even though the world model did not know anything about “writing” at all. This effect has already been shown in the previously mentioned publications.

Here is new that they show that the same model can be used to observe an action without generating action at the same time. As mentioned before, they simply reduce the precision of the proprioceptive observations to achieve this. They then replay the previously recorded actions of the agent in the world by providing them via the external variables. This produces an equivalent movement of the arm in the world without any action being exerted by the agent. Instead of generating its own movement the agent then has the task to recognise a movement executed by somebody/something else. This works, because the precision of the visual obserations was kept high such that the hidden SHC states can be inferred correctly (1). The authors mention a delay before the SHC states catch up with the equivalent trajectory under action. This should not be over-interpreted, because other than mentioned in the text the initial conditions for the two simulations were not the same (see figures and code). The important argument the authors try to make here is that the same set of variables (SHC states) are equally active during action as well as action observation and, therefore, provide a potential functional explanation for activity in the mirror neuron system.

Furthermore, the authors argue that SHC states represent the intentions of the agent, or, equivalently, the intentions of the agent which is observed, by noting that the desired tip positions as specified by the SHC states are only (approximately) reached at a later point in time in the world. This probably results from the inertia built into the joint angle dynamics. Probably there are dynamic models for which this effect disappears, but it sounds plausible to me to assume that when one dynamic system d1 influences the parameters of another dynamic system d2 (as here), that d2 first needs to catch up with its state to the new parameter setting. So these delays would be expected for most hierarchical dynamic systems.

Another line of argument of the authors is to relate prediction errors in the model with electrophysiological (EEG) findings. This is based on Friston’s previous suggestion that superficial pyramidal cells are likely candidates for implementing prediction error units. At the same time, activity of these cells is thought to dominate EEG signals. I cannot judge the validity of both hypothesis, although the former seems to have less experimental support than the latter. In any case, I find the corresponding arguments in this paper quite weak. The problem is that results from exactly one run with one particular setting of parameters of one particular model is used to make very general statements based on a mere qualitative fit of parts of the data to general experimental findings. In other words, I’m not confident that similar (desired) patterns would be seen in the prediction errors, if other settings of precisions, or parameters of the dynamical systems would be chosen.

Conclusion:

The authors suggest how the mirror neuron system can be understood within Friston’s active inference framework. These conceptual considerations make sense. In general, the active inference framework provides large explanatory power and many phenomena may be understood in its context. However, in my point of view, it is an entirely open question how the functional considerations of the active inference framework may be implemented in neurobiological substrate. The superficial arguments based on prediction errors generated by the model, which are presented in the paper, are not convincing. More evidence needs to be found which robustly links variables in an active inference model with neuroscientific measurements.

But also conceptually it is not clear whether the active inference solution correctly describes the computations of the brain. On the one hand, it potentially explains many important and otherwise disparate phenomena under a common principle (e.g. perception, action, learning, computing with noise, dynamics, internal models, prediction; this paper adds action understanding). On the other hand, we don’t know whether all brain functions actually follow a common principle and whether functionally equivalent solutions for subsets of phenomena may be better descriptions of the underlying computations.

An important issue for future studies which aim to discern these possibilities is that active inference is a general framework which needs to be instantiated with a particular model before its properties can be compared to experimental data. However, little is known about the kind of hierarchical, dynamic, functional models itself, which must serve as generative models for active inference. As in this paper, it then is hard to discern the properties of the chosen model from the properties imposed by the active inference framework. Therefore, great care has to be taken in the interpretation of corresponding results, but it would be exciting to learn about which properties of the active inference framework are crucial in brain function and which would need to be added, adapted, or dropped in a faithful description of (subsets of) brain function.

(*) Hidden state prediction errors result from Friston’s special treatment of dynamical systems by extending states by their temporal derivatives to obtain generalised states which represent a local trajectory of the states through time. The hidden state prediction errors, thus, can be seen, intuitively, as the difference between the velocity of the (previously inferred) hidden states as represented by the trajectory in generalised coordinates and the velocity predicted by the dynamic model.

Tuning properties of the auditory frequency-shift detectors.

Demany, L., Pressnitzer, D., and Semal, C.
J Acoust Soc Am, 126:1342–1348, 2009
DOI, Google Scholar

Abstract

Demany and Ramos [(2005). J. Acoust. Soc. Am. 117, 833-841] found that it is possible to hear an upward or downward pitch change between two successive pure tones differing in frequency even when the first tone is informationally masked by other tones, preventing a conscious perception of its pitch. This provides evidence for the existence of automatic frequency-shift detectors (FSDs) in the auditory system. The present study was intended to estimate the magnitude of the frequency shifts optimally detected by the FSDs. Listeners were presented with sound sequences consisting of (1) a 300-ms or 100-ms random “chord” of synchronous pure tones, separated by constant intervals of either 650 cents or 1000 cents; (2) an interstimulus interval (ISI) varying from 100 to 900 ms; (3) a single pure tone at a variable frequency distance (Delta) from a randomly selected component of the chord. The task was to indicate if the final pure tone was higher or lower than the nearest component of the chord. Irrespective of the chord’s properties and of the ISI, performance was best when Delta was equal to about 120 cents (1/10 octave). Therefore, this interval seems to be the frequency shift optimally detected by the FSDs.

Review

If you present 5 tones simultaneously, people cannot tell whether a subsequently presented tone was one of the 5 tones, or lay in the middle between any 2 of the 5 tones. On the other hand, people can judge whether a subsequently presented tone lay above or below any one of the 5 tones. This paper investigates the dependency of this effect on how much the subsequent tone lay above or below one of the 5 (here actually 6) tones (frequency shift), on how much the 6 tones were separated (Iv) and on the interstimulus interval (ISI) between the first set of tones and the subsequent tone. The authors replicated the mentioned previous findings and presented data suggesting that there is an optimal frequency shift at which subjects performed best in the task. They argue that this is at roughly 120 cents.

I have several remarks about the analysis. First of all, the number of subjects in the two experiments is very low (7 and 4, each including the first author). While in experiment 1 the curves of d-prime over subjects look relatively consistent, this is not the case for larger ISIs in experiment 2. The main flaw of the analysis is that their suggestion of an optimal frequency shift of 120 cents is based on curve fitting of an exponential function to 4,5, or 6 data points where they also add an artificial baseline data point at d-prime=0 for frequency shift=0. The data point as such makes sense as the judgement of a subject whether the shift was up or down must be random when the true shift was actually 0. Still, it feels wrong to include an artificial data point in the analysis. In the end, especially for large ISIs the variability of thus estimated optimal frequency shifts for individual subjects is so variable that it seems pointless to conclude anything about the mean over (4) subjects.

Sam actually tried to replicate the original finding on which this paper is based. He commented that it was hard to replicate it in a large group of subjects and he found differences between musicians and non-musicians (which shouldn’t be true for something that belongs to really basic hearing abilities). He also noted that subjects were generally quite bad in this task and that he found it to be impossible to make the task easier, when one wants to maintain that the 6 initial tones cannot be perceived individually.

The authors of the paper seem to use subjects, which perform particularly well in these tasks, repeatedly in their experiments.

It has been noted in the groupmeeting that this research could be linked better to, e.g., the mismatch negativity literature which is also concerned with detection of deviations. Sam pointed to the publication containing the original findings in response.

Internal models and the construction of time: generalizing from state estimation to trajectory estimation to address temporal features of perception, including temporal illusions.

Grush, R.
Journal of Neural Engineering, 2:S209, 2005
URL, Google Scholar

Abstract

The question of whether time is its own best representation is explored. Though there is theoretical debate between proponents of internal models and embedded cognition proponents (e.g. Brooks R 1991 Artificial Intelligence 47 139—59) concerning whether the world is its own best model, proponents of internal models are often content to let time be its own best representation. This happens via the time update of the model that simply allows the model’s state to evolve along with the state of the modeled domain. I argue that this is neither necessary nor advisable. I show that this is not necessary by describing how internal modeling approaches can be generalized to schemes that explicitly represent time by maintaining trajectory estimates rather than state estimates. Though there are a variety of ways this could be done, I illustrate the proposal with a scheme that combines filtering, smoothing and prediction to maintain an estimate of the modeled domain’s trajectory over time. I show that letting time be its own representation is not advisable by showing how trajectory estimation schemes can provide accounts of temporal illusions, such as apparent motion, that pose serious difficulties for any scheme that lets time be its own representation.

Review

The author argues based on temporal illusions that perceptual states correspond to smoothed trajectories where smoothing is meant as in the context of a Kalman smoother. In particular, temporal illusions such as the flash-lag effect and the cutaneous rabbit show that stimuli later in time can influence the perception of earlier stimuli. However, it seems that this is only the case for temporally very close stimuli (within 100ms). Thus, Grush suggests that stimuli are internally represented as trajectories including past and future states. However, the representation of the past states in the trajectory is also updated when new sensory evidence is collected (the observations, or rather the states, are smoothed). This idea has actually already been suggested by Rao, Eagleman and Sejnowski (2001) as stated by the author, but here he additionally postulates that also some of the future states are represented in the trajectory to account for apparent motion effects (where a motion is continued in the head when the stimulus disappears).

It’s an interesting account of temporal aspects in perceptions, but note that he develops things for the perceptual level, which does not necessarily let us draw conclusions for processing on the sensory level. Also, his discussion on whether Rao et al’s account of a fixed-lag smoother can be true is interesting, though he didn’t entirely convince me that fixed-lag perception is not what is happening in the brain. Wouldn’t instantaneous updating of the perceptual trajectory mean that at some point our perception changes, but during the illusions people report coherent motion. Ok, it could be that we just don’t “remember” our previous perception after it’s updated, but it still sounds counterintuitive. I don’t think that the apparent motion illusions are a good argument for representing future states, because other mechanisms could be responsible for that.