Special Section on Replicability in Psychological Science: A Crisis of Confidence?

Perspectives on Psychological Science November 2012 7
web, Google Scholar

Abstract

Is there currently a crisis of confidence in psychological science reflecting an unprecedented level of doubt among practitioners about the reliability of research findings in the field? It would certainly appear that there is. These doubts emerged and grew as a series of unhappy events unfolded in 2011: the Diederik Stapel fraud case (see Stroebe, Postmes, & Spears, 2012, this issue), the publication in a major social psychology journal of an article purporting to show evidence of extrasensory perception (Bem, 2011) followed by widespread public mockery (see Galak, LeBoeuf, Nelson, & Simmons, in press; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011), reports by Wicherts and colleagues that psychologists are often unwilling or unable to share their published data for reanalysis (Wicherts, Bakker, & Molenaar, 2011; see also Wicherts, Borsboom, Kats, & Molenaar, 2006), and the publication of an important article in Psychological Science showing how easily researchers can, in the absence of any real effects, nonetheless obtain statistically significant differences through various questionable research practices (QRPs) such as exploring multiple dependent variables or covariates and only reporting these when they yield significant results (Simmons, Nelson, & Simonsohn, 2011).

Comment

I came across this special issue and thought that it is worth sharing here. The abstract above is the first paragraph of the editorial introduction (doi) by Harold Pashler and Eric-Jan Wagenmakers. The special issue even contains a republication of a famous blog post
(The 9 Circles of Scientific Hell)
on Neuroskeptic. The only thing I wonder is why it has taken psychologists until now to become aware of these problems (I realise that some people probably knew this all along, but why was the field as a whole not concerned)? Maybe it has something to do with an increase in quantitative methods through the rise of cognitive neuroscience. In any case, it is not really surprising that it is hard to replicate studies involving the human mind/brain, because it is hard to control for the precise state of mind of a person. Already the person supervising the experiment, or the experiment location may have an effect on how a participant behaves in the experiment (cf. e.g. Milgram experiments in the 1960s). Interestingly, already Richard Feynman used these problems to deny psychology the label “science” and rather called it together with all social sciences “pseudoscience“. Here is an excerpt from his 1974 Caltech Commencement Address titled “Cargo Cult Science: Some Remarks on Science, Pseudoscience, and Learning How to Not Fool Yourself” (in, e.g., Richard P. Feynman, The Pleasure of Finding Things Out, Penguin Books, 1999) which describes the core problems psychologists in the special issue are also concerned with:

Other kinds of errors are more characteristic of poor science. When I was at Cornell, I often talked to the people in the psychology department. One of the students told me she wanted to do an experiment that went something like this–it had been found by others that under certain circumstances, X, rats did something, A. She was curious as to whether, if she changed the circumstances to Y, they would still do A. So her proposal was to do the experiment under circumstances Y and see if they still did A.

I explained to her that it was necessary first to repeat in her laboratory the experiment of the other person–to do it under condition X to see if she could also get result A, and then change to Y and see if A changed. Then she would know that the real difference was the thing she thought she had under control.

She was very delighted with this new idea, and went to her professor. And his reply was, no, you cannot do that, because the experiment has already been done and you would be wasting time. This was in about 1947 or so, and it seems to have been the general policy then to not try to repeat psychological experiments, but only to change the conditions and see what happens.

Nowadays there’s a certain danger of the same thing happening, even in the famous (?) field of physics. I was shocked to hear of an experiment done at the big accelerator at the National Accelerator Laboratory, where a person used deuterium. In order to compare his heavy hydrogen results to what might happen with light hydrogen” he had to use data from someone else’s experiment on light hydrogen, which was done on different apparatus. When asked why, he said it was because he couldn’t get time on the program (because there’s so little time and it’s such expensive apparatus) to do the experiment with light hydrogen on this apparatus because there wouldn’t be any new result. And so the men in charge of programs at NAL are so anxious for new results, in order to get more money to keep the thing going for public relations purposes, they are destroying–possibly–the value of the experiments themselves, which is the whole purpose of the thing. It is often hard for the experimenters there to complete their work as their scientific integrity demands.

All experiments in psychology are not of this type, however. For example, there have been many experiments running rats through all kinds of mazes, and so on–with little clear result. But in 1937 a man named Young did a very interesting one. He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.

The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? Obviously there was something about the door that was different from the other doors. So he painted the doors very carefully, arranging the textures on the faces of the doors exactly the same. Still the rats could tell. Then he thought maybe the rats were smelling the food, so he used chemicals to change the smell after each run. Still the rats could tell. Then he realized the rats might be able to tell by seeing the lights and the arrangement in the laboratory like any commonsense person. So he covered the corridor, and still the rats could tell.

He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand. So he covered one after another of all possible clues and finally was able to fool the rats so that they had to learn to go in the third door. If he relaxed any of his conditions, the rats could tell.

Now, from a scientific standpoint, that is an A-number-one experiment. That is the experiment that makes rat-running experiments sensible, because it uncovers the clues that the rat is really using–not what you think it’s using. And that is the experiment that tells exactly what conditions you have to use in order to be careful and control everything in an experiment with rat-running.

I looked into the subsequent history of this research. The next experiment, and the one after that, never referred to Mr. Young. They never used any of his criteria of putting the corridor on sand, or being very careful. They just went right on running rats in the same old way, and paid no attention to the great discoveries of Mr. Young, and his papers are not referred to, because he didn’t discover anything about the rats. In fact, he discovered all the things you have to do to discover something about rats. But not paying attention to experiments like that is a characteristic of cargo cult science.

Causal role of dorsolateral prefrontal cortex in human perceptual decision making.

Philiastides, M. G., Auksztulewicz, R., Heekeren, H. R., and Blankenburg, F.
Curr Biol, 21:980–983, 2011
DOI, Google Scholar

Abstract

The way that we interpret and interact with the world entails making decisions on the basis of available sensory evidence. Recent primate neurophysiology [1-6], human neuroimaging [7-13], and modeling experiments [14-19] have demonstrated that perceptual decisions are based on an integrative process in which sensory evidence accumulates over time until an internal decision bound is reached. Here we used repetitive transcranial magnetic stimulation (rTMS) to provide causal support for the role of the dorsolateral prefrontal cortex (DLPFC) in this integrative process. Specifically, we used a speeded perceptual categorization task designed to induce a time-dependent accumulation of sensory evidence through rapidly updating dynamic stimuli and found that disruption of the left DLPFC with low-frequency rTMS reduced accuracy and increased response times relative to a sham condition. Importantly, using the drift-diffusion model, we show that these behavioral effects correspond to a decrease in drift rate, a parameter describing the rate and thereby the efficiency of the sensory evidence integration in the decision process. These results provide causal evidence linking the DLPFC to the mechanism of evidence accumulation during perceptual decision making.

Review

They apply repetitive TMS to the dorsolateralprefrontal cortex (DLPFC) assuming that this inhibits the decision making ability of subjects, because DLPFC has been shown to be involved in perceptual decision making. Indeed, they find a significant effect of TMS vs. SHAM on the responses of subjects (after TMS responses of subjects are less accurate and take longer). They also argue that the effect is particular to TMS, because it reduces over time, but I wonder why they did not compute the corresponding interaction (they just report that the effect of TMS vs. SHAM is significant earlier, but not significant later).

Furthermore, they hypothesised that TMS disrupted the accumulation process of noisy evidence over time by decreasing the rate of evidence increase. This is based on the previous finding that the DLPFC has higher BOLD activation for less noisy stimuli which suggests that, when DLPFC is disrupted, the evidence coming from less noisy stimuli cannot be optimally processed anymore.

They investigated the evidence accumulation hypothesis by fitting a drift-diffusion model (DDM) to response data. The DDM has more parameters than are necessary to explain the variations of response data for the different experimental conditions. Hence, they use the Bayesian information criterion (BIC) to select parameters which should be fitted for each experimental condition separately, i.e., to be able to say which parameters are affected by the experimental manipulations. The other parameters are still fitted but to all data across experimental conditions. The problem is that the BIC is a very crude approximation just taking the number of freely varying parameters into account. For example, an assumption underlying the BIC is that the Hessian of the likelihood evaluated at the fitted parameter values has full rank (Bishop, 2006, p. 217), but for highly correlated parameters this may not be the case. The used DMAT fitting toolbox actually approximates the Hessian matrix, checks whether a local minimum has been found (instead of a valley) and computes confidence intervals from the approximated Hessian, but the authors report no results for this apart from error bars on the plot for drift rate and nondecision time.

Anyway, the BIC analysis conveniently indicates that drift rate and nondecision time best explain the variations in response data across conditions. However, it has to be kept in mind that these results have been obtained by (presumably) assuming that the diffusion is fixed across conditions which is the standard when fitting a DDM [private correspondence with Rafal Bogacz, 09/2012], because drift rate, diffusion and threshold are redundant (a change in one of them can be reverted by a suitable change in the others). The interpretation of the BIC analysis probably should be that drift rate and nondecision time are the smallest set of parameters which still allow a good fit of the data given that diffusion is fixed.

You need to be careful when interpreting the fitted parameter values in the different conditions. In particular, fitting a DDM to data assumes that the evidence accumulation still works like a DDM, just with different parameters. However, it is not clear what TMS does to the affected processes in the brain. Hence, we can only say from the fitting results that TMS has an effect which is equivalent to a reduction of the drift rate (no clear effect on nondecision time) in a normally functioning DDM.

Similarly, the interpretation of the results for nondecision time is not straightforward. There, the main finding is that nondecision time decreases for high-evidence stimuli which the authors interpret as a reduced time of low-level sensory processing which provides input to evidence accumulation. However, it should be kept in mind that the total amount of time necessary to make a decision is also reduced for high-evidence stimuli. Also, part of the processes which are collected under ‘nondecision time’ may actually work in parallel to evidence accumulation, e.g., movement preparation. If you look at the percentage of RT that is explained by the nondecision time, then the picture is reversed: for high-evidence stimuli nondecision time explains about 82% of RTs while for low-evidence stimuli it explains only about 75% which is consistent with the basic idea that evidence accumulation takes longer for noisier stimuli. In general, these percentages are surprisingly high. Does the evidence accumulation really only account for about 25% of total RTs? But it’s good that we have a number to compare now.

So what do these findings mean for the DLPFC? We cannot draw any definite conclusions. The hypothesis that TMS over DLPFC affects drift rate is somewhat built into the analysis, because the authors use a DDM to fit the responses. Of course, other parameters could have been affected stronger such that the finding of the BIC analysis that drift rate explains the changes best can indeed be taken as evidence for the drift rate hypothesis. However, it is not possible to exclude other explanations which lie outside the parameter space of the DDM. What, for example, if the DLPFC has indeed a somewhat attentional effect on evidence accumulation in the sense that it not only accumulates evidence, but also modulates how big the individual peaces of evidence are by modulating lower-level sensory processing? Then, interrupting the DLPFC may still have a similar effect as observed here, but the interpretation of the role of the DLPFC would be slightly different. Actually, the authors argue against a role of the DLPFC (at least the part of DLPFC they found) in attentional processing, but I’m not entirely convinced. Their main argument is based on the assumption that a top-down attentional effect of the DLPFC on low-level sensory processing would increase the nondecision time, but this is not necessarily true. A) there is the previously mentioned issue of parallel processing and the general problems of fitting a standard model to a disturbed process which makes me doubt the reliability of the fitted nondecision times and B) I can easily conceive a system in which attentional modulation would not delay low-level sensory processing.

Perceptions as hypotheses: saccades as experiments.

Friston, K., Adams, R. A., Perrinet, L., and Breakspear, M.
Front Psychol, 3:151, 2012
DOI, Google Scholar

Abstract

If perception corresponds to hypothesis testing (Gregory, 1980); then visual searches might be construed as experiments that generate sensory data. In this work, we explore the idea that saccadic eye movements are optimal experiments, in which data are gathered to test hypotheses or beliefs about how those data are caused. This provides a plausible model of visual search that can be motivated from the basic principles of self-organized behavior: namely, the imperative to minimize the entropy of hidden states of the world and their sensory consequences. This imperative is met if agents sample hidden states of the world efficiently. This efficient sampling of salient information can be derived in a fairly straightforward way, using approximate Bayesian inference and variational free-energy minimization. Simulations of the resulting active inference scheme reproduce sequential eye movements that are reminiscent of empirically observed saccades and provide some counterintuitive insights into the way that sensory evidence is accumulated or assimilated into beliefs about the world.

Review

In this paper Friston et al. introduce the notion that an agent (such as the brain) minimizes uncertainty about its state in the world by actively sampling those states which minimise the uncertainty of the agent’s posterior beliefs, when visited some time in the future. The presented ideas can also be seen as reply to the commonly formulated dark-room-critique of Friston’s free energy principle which states that under the free energy principle an agent would try to find a dark, stimulus-free room in which sensory input can be perfectly predicted. Here, I review these ideas together with the technical background (see also a related post about Friston et al., 2011). Although I find the presented theoretical argument very interesting and sound (and compatible with other proposals for the origin of autonomous behaviour), I do not think that the presented simulations conclusively show that the extended free energy principle as instantiated by the particular model chosen in the paper leads to the desired exploratory behaviour.

Introduction: free energy principle and the dark room

Friston’s free energy principle has gained considerable momentum in the field of cognitive neuroscience as a unifying framework under which many cognitive phenomena may be understood. Its main axiom is that an agent tries to minimise the long-term uncertainty about its state in the world by executing actions which make prediction of changes in the agent’s world more precise, i.e., which minimise surprises. In other words, the agent tries to maintain a sort of homeostasis with its environment.

While homeostasis is a concept which most people happily associate with bodily functions, it is harder to reconcile with cognitive functions which produce behaviour. Typically, the counter-argument for the free energy principle is the dark-room-problem: changes in a dark room can be perfectly predicted (= no changes), so shouldn’t we all just try to lock ourselves into dark rooms instead of frequently exploring our environment for new things?

The shortcoming of the dark-room-problem is that an agent cannot maintain homeostasis in a dark room, because, for example, its bodily functions will stop working properly after some time without water. There may be many more environmental factors which may disturb the agent’s dark room pleasure. An experienced agent knows this and has developed a corresponding model about its world which tells it that the state of its world becomes increasingly uncertain as long as the agent only samples a small fraction of the state space of the world, as it is the case when you are in a dark room and don’t notice what happens outside of the room.

The present paper formalises this idea. It assumes that an agent only observes a small part of the world in its local surroundings, but also maintains a more comprehensive model of its world. To decrease uncertainty about the global state of the world, the agent then explores other parts of the state space which it beliefs to be informative according to its current estimate of the global world state. In the remainder I will present the technical argument in more detail, discuss the supporting experiments and conclude with my opinion about the presented approach.

Review of theoretical argument

In previous publications Friston postulated that agents try to minimise the entropy of the world states which they encounter in their life and that this minimisation is equivalent to minimising the entropy of their sensory observations (by essentially assuming that the state-observation mapping is linear). The sensory entropy can be estimated by the average of sensory surprise (negative model evidence) across (a very long) time. So the argument goes that an agent should minimise sensory surprise at all times. Because sensory surprise cannot usually be computed directly, Friston suggests a variational approximation in which the posterior distribution over world states (posterior beliefs) and model parameters is separated. Further, the posterior distributions are approximated with Gaussian distributions (Laplace approximation). Then, minimisation of surprise is approximated by minimisation of Friston’s free energy. This minimisation is done with respect to the posterior over world states and with respect to action. The former corresponds to perception and ensures that the agent maintains a good estimate of the state of the world and the latter implements how the agent manipulates its environment, i.e., produces behaviour. While the former is a particular instantiation of the Bayesian brain hypothesis, and hence not necessarily a new idea, the latter had not previously been proposed and subsequently spurred some controversy (cf. above).

At this point it is important to note that the action variables are defined on the level of primitive reflex arcs, i.e., they directly control muscles in response to unexpected basic sensations. Yet, the agent can produce arbitrary complex actions by suitably setting sensory expectations which can be done via priors in the model of the agent. In comparison with reinforcement learning, the priors of the agent about states of the world (the probability mass attributed by the prior to the states), therefore, replace values or costs. But how does the agent choose its priors? This is the main question addressed by the present paper, however, only in the context of a freely exploring (i.e., task-free) agent.

In this paper, Friston et al. postulate that an agent minimises the joint entropy of world states and sensory observations instead of only the entropy of world states. Because the joint entropy is the sum of sensory entropy and conditional entropy (world states conditioned on sensory observations), the agent needs to implement two minimisations. The minimisation of sensory entropy is exactly the same as before implementing perception and action. However, conditional entropy is minimised with respect to the priors of the agent’s model, implementing higher-level action selection.

In Friston’s dynamic free energy framework (and other filters) priors correspond to predictive distributions, i.e., distributions over the world states some time in the future given their current estimate. Friston also assumes that the prior densities are Gaussian. Hence, priors are parameterised by their mean and covariance. To manipulate the probability mass attributed by the prior to the states he, thus, has to change prior mean or covariance of the world states. In the present paper the authors use a fixed covariance (as far as I can tell) and implement changes in the prior by manipulating its mean. They do this indicrectly by introducing new, independent control variables (“controls” from here on) which parameterise the dynamics of the world states without having a dynamics associated with themselves. The controls are treated like the other hidden variables in the agent model and their values are inferred from the sensory observations via free energy minimisation. However, I guess, that the idea is to more or less fix the controls to their prior means, because the second entropy minimisation, i.e., minimisation of the conditional entropy, is with respect to these prior means. Note that the controls are pretty arbitrary and can only be interpreted once a particular model is considered (as is the case for the remaining variables mentioned so far).

As with the sensory entropy, the agent has no direct access to the conditional entropy. However, it can use the posterior over world states given by the variational approximation to approximate the conditional entropy, too. In particular, Friston et al. suggest to approximate the conditional entropy using a predictive density which looks ahead in time from the current posterior and which they call counterfactual density. The entropy of this counterfactual density tells the agent how much uncertainty about the global state of the world it can expect in the future based on its current estimate of the world state. The authors do not specify how far in the future the counterfactual density looks. They here use the denotational trick to call negative conditional entropy ‘saliency’ to make the correspondence between the suggested framework and experimental variables in their example more intuitive, i.e., minimisation of conditional entropy becomes maximisation of saliency. The actual implementation of this nonlinear optimisation is computationally demanding. In particular, it will be very hard to find global optima using gradient-based approaches. In this paper Friston et al. bypass this problem by discretising the space spanned by the controls (which are the variables with respect to which they optimise), computing conditional entropy at each discrete location and simply selecting the location with minimal entropy, i.e., they do grid search.

In summary, the present paper extends previous versions of Friston’s free energy principle by adding prior selection, or, say, high-level action, to perception and action. This is done by adding new control variables representing high-level actions and setting these variables using a new optimisation which minimises future uncertainty about the state of the world. The descriptions in the paper implicitly suggest that the three processes happen sequentially: first the agent perceives to get the best estimate of the current world state, then it produces action to take the world state closer to its expectations and then it reevaluates expectations and thus sets high-level actions (goals). However, Friston’s formulations are in continuous time such that all these processes supposedly happen in parallel. For perception and action alone this leads to unexpected interactions. (Do you rather perceive the true state of the world as it is, or change it such that it corresponds to your expectations?) Adding control variables certainly doesn’t reduce this problem, if their values are inferred (perceived), too, but if perception cannot change them, only action can reduce the part of free energy contributed by them, thereby disentangling perception and action again. Therefore, the new control variables may be a necessary extension, if used properly. To me, it does not seem plausible that high-level actions are reevaluated continuously. Shouldn’t you wait until, e.g., a goal is reached? Such a mechanism is still missing in the present proposal. Instead the authors simply reevaluate high-level actions (minimise conditional entropy with respect to control variable priors) at fixed, ad-hoc intervals spanning sufficiently large amounts of time.

Review of presented experiments (saccade model)

To illustrate the theoretical points, Friston et al. present a model for saccadic eye movements. This model is very basic and is only supposed to show in principle that the new minimisation of conditional entropy can provide sensible high-level action. The model consists of two main parts: 1) the world, which defines how sensory input changes based on the true underlying state of the world and 2) the agent, which defines how the agent believes the world behaves. In this case, the state of the world is the position in a viewed image which is currently fixated by the eye of the agent. This position, hence, determines what input the visual sensors of the agent currently get (the field of view around the fixation position is restricted), but additionally there are proprioceptive sensors which give direct feedback about the position. Action changes the fixation position. The agent has a similar, but extended model of the world. In it, the fixation position depends on the hidden controls. Additionally, the model of the agent contains several images such that the agent has to infer what image it sees based on its sensory input.

In Friston’s framework, inference results heavily depend on the setting of prior uncertainties of the agent. Here, the agent is assumed to have certain proprioception, but uncertain vision such that it tends to update its beliefs of what it sees (which image) rather than trying to update its beliefs of where it looks. [I guess, this refers to the uncertainties of the hidden states and not the uncertainties of the actual sensory input which was probably chosen to be quite certain. The text does not differentiate between these and, unfortunately, the code was not yet available within the SPM toolbox at the time of writing (08.09.2012).]

As mentioned above, every 16 time steps the prior for the hidden controls of the agent is recomputed by minimising the conditional entropy of the hidden states given sensory input (minimising the uncertainty over future states given the sensory observations up to that time point). This is implemented by defining a grid of fixation positions and computing the entropy of the counterfactual density (uncertainty of future states) while setting the mean of the prior to one of the positions. In effect, this translates for the agent into: ‘Use your internal model of the world to simulate how your estimate of the world will change when you execute a particular high-level action. (What will be your beliefs about what image you see, when fixating a particular position?) Then choose the high-level action which reduces your uncertainty about the world most. (Which position gives you most information about what image you see?)’ Up to here, the theoretical ideas were self-contained and derived from first principles, but then Friston et al. introduce inhibition of return to make their results ‘more realistic’. In particular, they introduce an inhibition of return map which is a kind of fading memory of which positions were previously chosen as saccade targets and which is subtracted from the computed conditional entropy values. [The particular form of the inhibition of return computations, especially the initial substraction of the minimal conditional entropy value, is not motivated by the authors.]

For the presented experiments the authors use an agent model which contains three images as hypotheses of what the agent observes: a face and its 90° and 180° rotated versions. The first experiment is supposed to show that the agent can correctly infer which image it observes by making saccades to low conditional entropy (‘salient’) positions. The second experiment is supposed to show that, when an image is observed which is unknown to the agent, the agent cannot be certain of which of the three images it observes. The third experiment is supposed to show that the uncertainty of the agent increases when high entropy high-level actions are chosen instead of low entropy ones (when the agent chooses positions which contain very little information). I’ll discuss them in turn.

In the first experiment, the presented posterior beliefs of the agent about the identity of the observed image show that the agent indeed identifies the correct image and becomes more certain about it. Figure 5 of the paper also shows us the fixated positions and inhibition of return adapted conditional entropy maps. The presented ‘saccadic eye movements’ are misleading: the points only show the stabilised fixated positions and the lines only connect these without showing the large overshoots which occur according to the plot of ‘hidden (oculomotor) states’. Most critically, however, it appears that the agent already had identified the right image with relative certainty before any saccade was made (time until about 200ms). The results, therefore, do not clearly show that the saccade selection is beneficial for identifying the observed image, also because the presented example is only a single trial with a particular initial fixation point and with a noiseless observed image. Also, because the image was clearly identified very quickly, my guess is that the conditional entropy maps would be very similar after each saccade without inhibition of return, i.e., always the same fixation position would be chosen and no exploratory behaviour (saccades) would be seen, but this needs to be confirmed by running the experiment without inhibition of return. My overall impression of this experiment is that it presents a single, trivial example which does not allow me to draw general conclusions about the suggested theoretical framework.

The second experiment acts like a sanity check: the agent shouldn’t be able to identify one of its three images, when it observes a fourth one. Whether the experiment shows that, depends on the interpretation of the inferred hidden states. The way these states were defined their values can be directly interpreted as the probability of observing one of the three images. If only these are considered the agent appears to be very certain at times (it doesn’t help that the scale of the posterior belief plot in Figure 6 is 4 times larger than that of the same plot in Figure 5). However, the posterior uncertainty directly associated with the hidden states appears to be indeed considerably larger than in experiment 1, but, again, this is only a single example. Something that is rather strange: the sequence of fixation positions is almost exactly the same as in experiment 1 even though the observed image and the resulting posterior beliefs were completely different. Why?

Finally, experiment three is more like a thought experiment: what would happen, if an agent chooses high-level actions which maximise future uncertainty instead of minimising it. Well, the uncertainty of the agent’s posterior beliefs increases as shown in Figure 7, which is the expected behaviour. One thing that I wonder, though, and it applies to the presented results of all experiments: In Friston’s Bayesian filtering framework the uncertainty of the posterior hidden states is a direct function of their mean values. Hence, as long as the mean values do not change, the posterior uncertainty should stay constant, too. However, we see in Figure 7 that the posterior uncertainty increases even though the posterior means stay more or less constant. So there must be an additional (unexplained) mechanism at work, or we are not shown the distribution of posterior hidden states, but something slightly different. In both cases, it would be important to know what exactly resulted in the presented plots to be able to interpret the experiments in the correct way.

Conclusion

The paper presents an important theoretical extension to Friston’s free energy framework. This extension consists of adding a new layer of computations which can be interpreted as a mechanism for how an agent (autonomously) chooses its high-level actions. These high-level actions are defined in terms of desired future states encoded by the probability mass which is assigned to these states by the prior state distribution. Conceptually, these ideas translate into choosing maximally informative actions given the agent’s model of the world and its current state estimate. As discussed by Friston et al. such approaches to action selection are not new (see also Tishby and Polani, 2011). So, the author’s contribution is to show that these ideas are compatible with Friston’s free energy framework. Hence, on the abstract, theoretical level this paper makes sense. It also provides a sound theoretical argument for why an agent would not seek sensory deprivation in a dark room, as feared by critics of the free energy principle. However, the presented framework heavily relies on the agent’s model of the world and it leaves open how the agent has attained this model. Although the free energy principle also provides a way for the agent to learn parameters of its model, I still, for example, haven’t seen a convincing application in which the agent actually learnt the dynamics of an unknown process in the world. Probably Friston would here also refer to evolution as providing a good initialisation for process dynamics, but I find that a too cheap way out.

From a technical point of view the paper leaves a few questions open, for example: How far does the counterfactual distribution look into the future? What does it mean for high-level actions to change how far the agent looks into his subjective future? How well does the presented approach scale? Is it important to choose the global minimum of the conditional entropy (this would be bad, as it’s probably extremely hard to find in a general setting)? When, or how often, does the agent minimise conditional entropy to set high-level actions? What happens with more than one control variables (several possible high-level actions)? How can you model discrete high-level actions in Friston’s continuous Gaussian framework? How do results depend on the setting of prior covariances / uncertainties. And many more.

Finally, I have to say that I find the presented experiments quite poor. Although providing the agent with a limited field of view such that it has to explore different regions of a presented image is a suitable setting to test the proposed ideas, the trivial example and introduction of ad-hoc inhibition of return make it impossible to judge whether the underlying principle is successfully at work, or the simulations have been engineered to work in this particular case.

Larry Wasserman, post-publication peer review and the neglect of time and politics

With the start of Larry Wasserman’s blog (for those who don’t know him: he’s a renown statistician and machine learner) I also had a look at his homepage. It turns out that he is a proponent of post-publication peer review. To be precise he published a short essay in which he gives arguments for why the current peer review process should be abolished. He mainly argues that the quality of the output of the current system is bad and that it is unnecessarily exclusive. So he proposes free, not reviewed publication with subsequent quality control which essentially corresponds to post-publication peer review.

I generally agree with his criticism and wish that his proposal will become reality at some point. He notes in his conclusion:

When I criticize the peer review process I find that people are quick to agree with me. But when I suggest getting rid of it, I usually find that people rush to defend it.

I have done the first part, but now, instead of defending the current peer review process directly, I’ll try to illuminate the reasons why people find it hard to turn away from it.

In my view, the main function of peer review is filtering: the good into the pot, the bad into the crop, but instead of making binary decisions only, the established system of differently valued journals essentially implements a rating for papers. At least this is, I think, what people have implicitly in mind, even though it has long been argued that the impact of a journal has nothing to do with the value of an individual paper published in the journal. Probably most people know about the flaws of this evaluation process, but they accept the errors in return for at least a rough, implicit rating of quality.

Therefore, any system which replaces the current peer review process has to implement a rating for individual papers. In about this blog I discuss why scientists may be reluctant to publicly criticize (or even explicitly rate) a paper. Then again, the rating could just consist of counts of positive mentions (cf. Facebook-likes). This goes into the direction of bibliometrics, an apparently old field trying to quantitatively analyze scientific literature which became more important in the internet age. While a few people seem to work on it, I have so far not seen a convincing, i.e., comparable and robust metric on the level of an individual paper. I’m confident, though, that we get there at some point.

There is one dimension that is usually neglected in this discussion: time. The current peer review process is relatively quick. It may take one year in some cases before a paper gets published, but then it immediately has the mentioned implicit rating based on the prestige of the journal. Usually its even faster. In post-publication peer review the value of a paper may only stabilize very slowly depending on who promotes it initially. For example, citation counts may only be meaningful after 2 to 5 years. This poses a problem in the practical world of science where the next, short-term job of a young researcher depends on his output and how it is valued.

The issue of time is particularly prominent in the evaluation of conference submissions.  For example, at NIPS the evaluation process for submitted papers merely takes about 3 months after which final decisions are made. Can a post-publication peer review process converge to a stable evaluation within 3 months?

Finally, there is an additional function of peer review which I have not mentioned so far: confidential feedback. Many scientists don’t want to publish half-cooked research and try to make their publications (and the research therein) as good as they can before anyone else, especially the public, sees it. In the best case, a closed pre-publication peer review then acts as an additional sanity check which prevents potential mistakes from becoming public and, therefore, saves the authors from the embarrassment of publicly admitting to having made a mistake (just think about the disrepute currently associated with having to publish an erratum, let alone to retract a paper). Nobody likes to make mistakes and often enough we like it even less to have to admit to one.

In conclusion, I do agree with most criticism of the current peer review process, but I also believe that scientists won’t readily change to a new process unless it implements the functions I discussed here. In particular, such a new process needs to provide a timely, but also accurate and stable, evaluation of presented research. In my opinion, post-publication peer review (or indirectly bibliometrics) cannot currently provide these functions, but it may in the future. What remains are the social constraints of the scientists: the political reasons making individual scientists reluctant to openly criticize the work of others or to make and admit to mistakes. I have the impression that these constraints are deeply rooted in human nature and, hence, are difficult to overcome. If such a feat could be achieved, then only through concerted action of the whole scientific community which would need to adjust how research, contributions to discussions and mistakes are evaluated.

Inhibitory plasticity balances excitation and inhibition in sensory pathways and memory networks.

Vogels, T. P., Sprekeler, H., Zenke, F., Clopath, C., and Gerstner, W.
Science, 334:1569–1573, 2011
DOI, Google Scholar

Abstract

Cortical neurons receive balanced excitatory and inhibitory synaptic currents. Such a balance could be established and maintained in an experience-dependent manner by synaptic plasticity at inhibitory synapses. We show that this mechanism provides an explanation for the sparse firing patterns observed in response to natural stimuli and fits well with a recently observed interaction of excitatory and inhibitory receptive field plasticity. The introduction of inhibitory plasticity in suitable recurrent networks provides a homeostatic mechanism that leads to asynchronous irregular network states. Further, it can accommodate synaptic memories with activity patterns that become indiscernible from the background state but can be reactivated by external stimuli. Our results suggest an essential role of inhibitory plasticity in the formation and maintenance of functional cortical circuitry.

Review

The authors show that, if the same input to an output neuron arrives through an excitatory and a delayed inhibitory channel, synaptic plasticity (a symmetric STDP rule) at the inhibitory synapses leads to “detailed balance”, i.e., to cancellation of excitatory and inhibitory input currents. Then, the output neuron fires sparsely and irregularly (as observed for real neurons) only when an excitatory input was not predicted by the implicit model encoded by the synaptic weights of the inhibitory inputs. The adaptation of the inhibitory synapses also matches potential changes in the excitatory synapses, although here they only present simulations in which excitatory synapses changed abruptly and stayed constant afterwards. (What happens when excitatory and inhibitory synapses change concurrently?) Finally, the authors show that similar results apply to recurrently connected networks of neurons with dedicated inhibitory neurons (balanced networks). Arbitrary activity patterns can be encoded by the excitatory connections, activity in these patterns is then suppressed by the inhibitory neurons, while partial activation of the patterns through external input reactivates the whole patterns (cf. recall of memory) without suppressing potential reactivation of other patterns in the network.

These are interesting ideas, clearly presented and with very detailed supplementary information. The large number of inhibitory neurons in cortex makes the assumed pairing of excitatory and inhibitory input at least possible, but I don’t know how prevalent this really is. Another important assumption here is that the inhibitory input is a bit slower than the excitatory input. This makes intuitive sense, if you assume that the inhibitory input needs to be relayed through an additional inhibitory neuron, but I’ve seen the opposite assumption before, too.

Representational switching by dynamical reorganization of attractor structure in a network model of the prefrontal cortex.

Katori, Y., Sakamoto, K., Saito, N., Tanji, J., Mushiake, H., and Aihara, K.
PLoS Comput Biol, 7:e1002266, 2011
DOI, Google Scholar

Abstract

The prefrontal cortex (PFC) plays a crucial role in flexible cognitive behavior by representing task relevant information with its working memory. The working memory with sustained neural activity is described as a neural dynamical system composed of multiple attractors, each attractor of which corresponds to an active state of a cell assembly, representing a fragment of information. Recent studies have revealed that the PFC not only represents multiple sets of information but also switches multiple representations and transforms a set of information to another set depending on a given task context. This representational switching between different sets of information is possibly generated endogenously by flexible network dynamics but details of underlying mechanisms are unclear. Here we propose a dynamically reorganizable attractor network model based on certain internal changes in synaptic connectivity, or short-term plasticity. We construct a network model based on a spiking neuron model with dynamical synapses, which can qualitatively reproduce experimentally demonstrated representational switching in the PFC when a monkey was performing a goal-oriented action-planning task. The model holds multiple sets of information that are required for action planning before and after representational switching by reconfiguration of functional cell assemblies. Furthermore, we analyzed population dynamics of this model with a mean field model and show that the changes in cell assemblies’ configuration correspond to those in attractor structure that can be viewed as a bifurcation process of the dynamical system. This dynamical reorganization of a neural network could be a key to uncovering the mechanism of flexible information processing in the PFC.

Review

Based on firing properties of certain prefrontal cortex neurons the authors suggest a network model in which short-term plasticity implements switches of what the neurons in the network represent. In particular, neurons in prefrontal cortex have been found which switch from representing goals to representing actions (first, their firing varies depending on which goal is shown, then it varies depending on which action is executed afterwards while firing equally for all goals). The authors call this representational switches and they assume that these are implemented via changes in the connection strengths of neurons in a recurrently connected neural network. The network is setup such that network activity always converges to one of several fixed point attractors. A suitable change in connection strengths then leads to a change in the attractor landscape which may be interpreted as a change in what the network represents. The main contribution of the authors is to suggest a particular pattern of short-term plasticity at synapses in the network such that the network exhibits the desired representational switching. Another important aspect of this model is its structure: the network consists of separate cell assemblies, different subsets of which are assumed to be active when either goals or actions are represented and the goal and action subsets are partially overlapping. For example, in their model they have four cell assemblies (A,B,C,D) and the subsets (A,B) and (C,D) are associated with goals while subsets (A,D) and (B,C) are associated with actions. Initially the network is assumed to be in the goal state in which the connection strenghts A-B and C-D are large. The presentation of one of two goals then makes the network activity converge to strong activation of (A,B) or (C,D). Synaptic depression of connections A-B (assuming that this is the active subset) with simultaneous facilitation of connections A-D and B-C then leads to the desired change of connection strengths which implements the representational switch and then makes either subset (A-D), or subset (B-C) the active subset. It is not entirely clear to me why only one action subset becomes active. Maybe this is what the inhibitory units in the model are for (their function is not explained by the authors). In further analysis and experiments the authors confirm the attractor landscape of the model (and how it changes), show that the timing of the representational switch can be influenced by input to the network and show that the probability of changing from a particular goal to a particular action can be manipulated by changing the number of prior connections between the corresponding cell assemblies.

The authors show a nice qualitative correspondence between experimental findings and simulated network behaviour (although some qualitative differences are left, too, e.g., a general increase of firing also for the non-preferred goal and action in the experimental findings). In essence, the authors present a mechanism which could implement the (seemingly) autonomous switching of representations in prefrontal cortex neurons. Whether this mechanism is used by the brain is an entirely different question. I don’t know of evidence backing the chosen special wiring of neurons and distribution of short-term placticity, but this might just reflect my lack of knowledge of the field. Additionally, I wouldn’t exclude the possibility of a hierarchical model. The authors argue against this by presuming that prefrontal cortex already should be the top of the hierarchy, but nothing prevents us to make hierarchical models of prefrontal cortex itself. This points to the mixing of levels of description in the paper: On the one hand, the main contributions of the paper are on the algorithmic level describing the necessary wiring in a network of a few units and how it needs to change to reproduce the behaviour observed in experiments. On the other hand, the main model is on an implementational level showing how these ideas could be implemented in a network of leaky integrate and fire (LIF) neurons. In my opinion, the LIF neuron network doesn’t add anything interesting to the paper apart from the proof that the algorithmic ideas can be implemented by such a network. On the contrary, it masks a bit the main points of the paper by introducing an abundance of additional parameters which needed to be chosen by the authors, but for which we don’t know which of these settings are important. Finally, I wonder how the described network is reset in order to be ready for the next trial. The problem is the following: the authors initialise the network such that the goal subsets have a high synaptic efficacy at the start of the trial. The short-term plasticity then reduces these synaptic efficacies while simultaneously increasing those of the action subsets. At the end of a trial they all end up in a similar range (see Fig. 3A bottom). In order for the network to work as expected in the next trial, it somehow needs to reset to the initial synaptic efficacies.

Probabilistic population codes for Bayesian decision making.

Beck, J. M., Ma, W. J., Kiani, R., Hanks, T., Churchland, A. K., Roitman, J., Shadlen, M. N., Latham, P. E., and Pouget, A.
Neuron, 60:1142–1152, 2008
DOI, Google Scholar

Abstract

When making a decision, one must first accumulate evidence, often over time, and then select the appropriate action. Here, we present a neural model of decision making that can perform both evidence accumulation and action selection optimally. More specifically, we show that, given a Poisson-like distribution of spike counts, biological neural networks can accumulate evidence without loss of information through linear integration of neural activity and can select the most likely action through attractor dynamics. This holds for arbitrary correlations, any tuning curves, continuous and discrete variables, and sensory evidence whose reliability varies over time. Our model predicts that the neurons in the lateral intraparietal cortex involved in evidence accumulation encode, on every trial, a probability distribution which predicts the animal’s performance. We present experimental evidence consistent with this prediction and discuss other predictions applicable to more general settings.

Review

In this article the authors apply probabilistic population coding as presented in Ma et al. (2006) to perceptual decision making. In particular, they suggest a hierarchical network with a MT and LIP layer in which the firing rates of MT neurons encode the current evidence for a stimulus while the firing rates of LIP neurons encode the evidence accumulated over time. Under the made assumptions it turns out that the accumulated evidence is independent of nuisance parameters of the stimuli (when they can be interpreted as contrasts) and that LIP neurons only need to sum (integrate) the activity of MT neurons in order to represent the correct posterior of the stimulus given the history of evidence. They also suggest a readout layer implementing a line attractor which reads out the maximum of the posterior under some conditions.

Details

Probabilistic population coding is based on the definition of the likelihood of stimulus features p(r|s,c) as an exponential family distribution of firing rates r. A crucial requirement for the central result of the paper (that LIP only needs to integrate the activity of MT) is that nuisance parameters c of the stimulus s do not occur in the exponential itself while the actual parameters of s only occur in the exponential. This restricts the exponential family distribution to the “Poisson-like family”, as they call it, which requires that the tuning curves of the neurons and their covariance are proportional to the nuisance parameters c (details for this need to be read up in Ma et al., 2006). The point is that this is the case when c corresponds to contrast, or gain, of the stimulus. For the considered random dot stimuli the coherence of the dots may indeed be interpreted as the contrast of the motion in the sense that I can imagine that the tuning curves of the MT neurons are multiplicatively related to the coherence of the dots.

The probabilistic model of the network activities is setup such that the firing of neurons in the network is an indirect, noisy observation of the underlying stimulus, but what we are really interested in is the posterior of the stimulus. So the question is how you can estimate this posterior from the network firing rates. The trick is that under the Poisson-like distribution the likelihood and posterior share the same exponential such that the posterior becomes proportional to this exponential, because the other parts of the likelihood do not depend on the stimulus s (they assume a flat prior of s such that you don’t need to consider it when computing the posterior). Thus, the probability of firing in the network is determined from the likelihood while the resulting firing rates simultaneously encode the posterior. Mind-boggling. The main contribution from the authors then is to show, assuming that firing rates of MT neurons are driven from the stimulus via the corresponding Poisson-like likelihood, that LIP neurons only need to integrate the spikes of MT neurons in order to correctly represent the posterior of the stimulus given all previous evidence (firing of MT neurons). Notice, that they also assume that LIP neurons have the same tuning curves with respect to the stimulus as MT neurons and that the neurons in LIP sum the activity of this MT neuron which they share a tuning curve with. They note that a naive procedure like that, i.e. a single neuron integrating MT firing over time, would quickly saturate its activity. So they show, and that is really cool, that global inhibition in the LIP network does not affect the representation of the posterior, allowing them to prevent saturation of firing while maintaining the probabilistic interpretation.

So far to the theory. In practice, i.e. experiments, the authors do something entirely different, because “these results are important, but they are based on assumptions that are not necessarily exactly true in vivo. […] It is therefore essential that we test our theory in biologically realistic networks.” Now, this is a noble aim, but what exactly do we learn about this theory, if all results are obtained using methods which violate the assumptions of the theory? For example, neither the probability of firing in MT nor LIP is Poisson-like, LIP neurons not just integrate MT activity, but are also recurrently connected, LIP neurons have local inhibition (they are leaky integrators, inhibition between LIP neurons depending on tuning properties) instead of global inhibition and LIP neurons have an otherwise completely unmotivated “urgency signal” whose contribution increases with time (this stems from experimental observations). Without any concrete links between the two models in theory (I guess, the main ideas are similar, but the details are very different) it has to be shown that they are similar using experimental results. In any case, it is hard to differentiate between contributions from the probabilistic theory and the network implementation, i.e., how much of the fit between experimental findings in monkeys and the behaviour of the model is due to the chosen implementation and how much is due to the probabilistic interpretation?

Results

The overall aim of the experiments / simulations in the paper is to show that the proposed probabilistic interpretation is compatible with the experimental findings in monkey LIP. The hypothesis is that LIP neurons encode the posterior of the stimulus as suggested in the theory. This hypothesis is false from the start, because some assumptions of the theory apparently don’t apply to neurons (as acknowledged by the authors). So the new hypothesis is that LIP neurons approximately encode some posterior of the stimulus. The requirement for this posterior is that updates of the posterior should take the uncertainty of the evidence and the uncertainty of the previous estimate of the posterior into account which the authors measure as a linear increase of the log odds of making a correct choice, log[ p(correct) / (1-p(correct)) ], with time together with the dependence of the slope of this linear increase on the coherence (contrast) of the stimulus. I did not follow up why the previous requirement is related to the log odds in this way, but it sounds ok. Remains the question how to estimate the log odds from simulated and real neurons. For the simulated neurons the authors approximate the likelihood with a Poisson-like distribution whose kernel (parameters) were estimated from the simulated firing rates. They argue that it is a good approximation, because linear estimates of the Fisher information appear to be sufficient (I can’t comment on the validity of this argument). A similar approximation of the posterior cannot be done for real LIP neurons, because of a lack of multi-unit recordings which estimate the response of the whole LIP population. Instead, the authors approximate the log odds from measured firing rates of neurons tuned to motion in direction 0 and 180 degrees via a linear regression approach described in the supplemental data.

The authors show that the log-odds computed from the simulated network exhibit the desired properties, i.e., the log-odds linearly increase with time (although there’s a kink at 50ms which supposedly is due to the discretisation) and depend on the coherence of the motion such that the slope of the log-odds increases also when coherence is increased within a trial. The corresponding log-odds of real LIP neurons are far noisier and, thus, do not allow to make definite judgements about linearity. Also, we don’t know whether their slopes would actually change after a change in motion coherence during a trial, as this was never tested (it’s likely, though).

In order to test whether the proposed line attractor network is sufficient to read out the maximum of the posterior in all conditions (readout time and motion coherence) the authors compare a single (global) readout with local readouts adapted for a particular pair of readout time and motion coherence. However, the authors don’t actually use attractor networks in these experiments, but note that these are equivalent to local linear estimators and so use these. Instead of comparing the readouts from these estimators with the actual maximum of the posterior, they only compare the variance of the estimators (Fisher information) which they show to be roughly the same for the local and global estimators. From this they conclude that a single, global attractor network could read out the maximum of the (approximated) posterior. However, this is only true, if there’s no additional bias of the global estimator which we cannot see from these results.

In an additional analysis the authors show that the model qualitatively replicates the behavioural variables (probability correct and reaction time). However, these are determined from the LIP activities in a surprisingly ad-hoc way: the decision time is determined as the time when any one of the simulated LIP neurons reaches a threshold defined on the probability of firing and the decision is determined as the preferred direction of the neuron hitting the threshold (for 2 and 4 choice tasks the response is determined as the quadrant of the motion direction in which the preferred direction of the neuron falls). Why do the authors not use the attractor network to readout the response here? Also, the authors use a lower threshold for the 4-choice task than for the 2-choice task. This is strange, because one of the main findings of the Churchland et al. (2008) paper was that the decision in both, 2- and 4-choice tasks, appears to be determined by a common decision threshold while the initial firing rates of LIP neurons were lower for 4-choice tasks. Here, they also initialise with lower firing rates in the 4-choice task, but additionally choose a lower threshold. They don’t motivate this. Maybe it was necessary to fit the data from Churchland et al. (2008). This discrepancy between data and model is even more striking as the authors of the two papers partially overlap. So, do they deem the corresponding findings of Churchland et al. (2008) not important enough to be modelled, is it impossible to be modelled within their framework, or did they simply forget?

Finally, also the build-up rates of LIP neurons seem to be qualitatively similar in the simulation and the data, although they are consistently lower in the model. The build-up rates for the model are estimated from the first 50ms within each trial. However, the log-odds ratio had this kink at 50ms after which its slope was larger. So, if this effect is also seen directly in the firing rates, the fit of the build-up rates to the data may even be better, if probability of firing after 50ms is used. In Fig. 2C no such kink can be seen in the firing rates, but this is only data for 2 neurons in the model.

Conclusion

Overall the paper is very interesting and stimulating. It is well written and full of sound theoretical results which originate from previous work of the authors. Unfortunately, biological nature does not completely fit the beautiful theory. Consequently, the authors run experiments with more plausible neural networks which only approximately implement the theory. So what conclusions can we draw from the presented results? As long as the firing of MT neurons reflects the likelihood of a stimulus (their MT network is setup in this way), probably a wide range of networks which accumulate this firing will show responses similar to real LIP neurons. It is not easy to say whether this is a consequence of the theory, which states that MT firing rates should be simply summed over time in order to get the right posterior, because of the violation of the assumptions of the theory in more realistic networks. It could also be that more complicated forms of accumulation are necessary such that LIP firing represents the correct posterior. Simple summing then just represents a simple approximation. Also, I don’t believe that the presented results can rule out the possibility of sampling based coding of probabilities (see Fiser et al., 2010) for decision making as long as also the sampling approach would implement some kind of accumulation procedure (think of particle filters – the implementation in a recurrent neural network would probably be quite similar).

Nevertheless, the main point of the paper is that the activity in LIP represents the full posterior and not only MAP estimates or log-odds. Consequently, the model very easily extends to the case of continuous directions of motion which is in contrast to previous, e.g., attractor-based, neural models. I like this idea. However, I cannot determine from the experiments whether their network actually implements the correct posterior, because all their tests yield only indirect measures based on approximated analyses. Even so, it is pretty much impossible to verify that the firing of LIP neurons fits to the simulated results as long as we cannot measure firing of a large part of the corresponding neural population in LIP.

Why don’t we use Bayesian statistics to analyse experimental data?

This paper decoder post is a little different as it doesn’t relate to a particular paper. Rather it’s my answer to the question in the title of this post which was triggered by a colleague of mine. The colleague has a psychology background and just got to know about Bayesian statistics when the following question crossed his mind:

Question

You
do Bayesian stuff, right? Trying to learn about it now, can’t quite get my head
around it yet, but it sounds like how I should be analysing data. In
psychophysics we usually collect a lot of data from a small number of subjects,
but then collapse all this data into a small number of points per subject for
the purposes of stats. This loses quite a lot of fine detail: for instance,
four steep psychometric functions with widely different means average together
to create a shallow function, which is not a good representation of the data.
Usually, the way psychoacousticians in particular get around this problem is
not to bother with the stats. This, of course, is not optimal either! As far as
I can tell the Bayesian approach to stats allows you to retain the variance
(and thus the detail) from each stage of analysis, which sounds perfect for my
old phd data and for the data i’m collecting now. it also sounds like the thing
to do for neuroimaging data: we collect a HUGE amount of data per subject in
the scanner, but then create these extremely course averages, leading people to
become very happy when they see something at the single-subject level. But of
course all effects should REALLY be at the single-subject level, we assume they
aren’t visible due to noise. So I’m wondering why everyone doesn’t employ this
Bayesian approach, even in fMRI etc..

In short, my answer is twofold: 1) Bayesian statistics can be computationally very hard and, conceptually critical, 2) choosing a prior influences the results of your statistical inference which makes experimenters uneasy.

The following is my full answer. It contains a basic introduction to Bayesian statistics targeted to people who just realised that this exists. I bet that a simple search for “Bayesian frequentist” brings up a lot more valuable information.

Answer

You’re right: the best way to analyse any data is to maintain the full distribution of your variables of interest throughout all analysis steps. You nicely described the reasons for this. The problem only is that this can be really hard depending on your statistical model, i.e., data. So you’ll need to make approximations. One way of doing this is to summarise the distribution by its mean and variance. The Gaussian distribution is so cool, because these two variables are actually sufficient to represent the whole distribution. For other probability distributions the mean and variance are not sufficient representations so that when you summarise the distribution with them you make an approximation. Therefore, you could say that the standard analysis methods you mention are valid approximations in the sense that they summarise the desired distribution with its mean. Then the question becomes: Can you make better approximations for the model you consider? This is where the expertise of the statistician comes into play, because what you can do really depends on the particular situation with your data. It’s most of the time impossible to come up with the right distribution analytically, but actually many things could be solved numerically in the computer these days.

Now a little clarification what I understand under the Bayesian approach. Here’s a hypothetical example: your variable of interest, x, is whether person A is a genius. You can’t really tell directly whether a person is a genius and you have to collect indirect evidence, y, from their behaviour (might be the questions they ask, the answers they give, or indeed a battery of psychological tests). So x can take values 0 (no genius) and 1 (genius). Your inference will be based on a statistical model of behaviour given genius or no genius (in words: if A is a genius then with probability p(y|x=1) he will exhibit behaviour y):

p(y|x=1) and p(y|x=0).

In a frequentist (classic) approach you will make a maximum likelihood estimate for x which will end up in a simple procedure where you sum up the log-probabilities of your evidence and compare which sum is larger:

sum over i log(p(y_i|x=1)) > sum over i log(p(y_i|x=0)) ???

If this statement is true, you’ll believe that A is a genius. Now, the problem is that, if you only have a few pieces of evidence, you can easily make false judgements with this procedure. Bayesians therefore take one additional source of information into account: the prior probability of someone being a genius, p(x=1), which is quite low. We can then get something called a maximum a posteriori estimate in which you weight evidence by the prior probability which leads to the following decision procedure:

sum over i log(p(y_i|x=1)p(x=1)) > sum over i log(p(y_i|x=0)p(x=0)) ???

Because p(x=1) is much smaller than p(x=0) this means that you now have to collect much more evidence where the probability of behaviour given that A is a genius, p(y_i|x=1), is larger than the probability of behaviour given that A is no genius, p(y_i|x=0), before you believe that A is a genius. In the full Bayesian approach you would actually not make a judgement, but estimate the posterior probability of A being a genius:

p(x=1|y) = p(y|x=1)p(x=1) / p(y).

This is the distribution which I said is hard to estimate above. The thing that makes it hard is p(y). In this case, where x can only take two values it is actually very easy to compute:

p(y) = p(y|x=1)p(x=1) + p(y|x=0)p(x=0)

but for each additional value x can take you’ll have to add a term to this equation and when x is a continuous variable this sum will become an integral and integration is hard.

One more, but very important thing: the technical problems aside, the biggest criticism of the Baysian approach is the use of the prior. In my example it helped us from making a premature judgement, but only because we had a suitable estimate of the prior probability of someone being a genius. The question is where does the prior come from? Well, it’s prior information that enters your inference. If you don’t have prior information about your variable of interest, you’ll use an uninformative prior which assigns equal probability to each value of x. Then the maximum likelihood and maximum a posteriori estimators above become equal, but what does it mean for the posterior distribution p(x|y)? It changes its interpretation. The posterior becomes an entity representing a belief over the corresponding statement (A is a genius) given the prior information provided by the prior. If the prior measures the true frequency of the corresponding event in the real world, the posterior is a statement about the state of the world. But if the prior has no such interpretation, the posterior is just the mentioned belief under the assumed prior. These arguments are very subtle. Think about my example. The prior could be paraphrased as the prior probabilty that person A is a genius. This prior cannot represent a frequency in the world, because person A exists only once in the world. So whatever we choose as prior merely is a prior belief. While frequentists often argue that the posterior does not faithfully represent the world, because of a potentially unsuitable prior, in my example the Bayesian approach allowed us to incorporate information in the inference that is inaccessible to the frequentist approach. We did this by transferring the frequency of being a genius in the whole population to our a priori belief that person A is a genius.

Note that there really is no “correct” prior in my example and any prior will correspond to a particular prior assumption. Furthermore, the frequentist maximum likelihood estimator is equivalent to a maximum a posteriori estimator with a particular (uninformative) prior. Therefore, it has been argued that the Bayesian approach just makes the prior assumptions explicit which are implicit also in the more common (frequentist) statistical analyses. Unfortunately, it seems to be a bitter pill to swallow for experimenters to admit that their statistical analysis (and thus outcome) of their experiment depends on prior assumptions (although they appear to be happy to do this in other contexts, for example, when making Gaussian assumptions when doing an ANOVA). Also, remember that the prior will ultimately be overwritten by sufficient evidence (even for a very low prior probability of A being a genius we’ll at some point belief that A is a genius, if A behaves accordingly). Given these considerations, the prior shouldn’t be a hindrance for using a Bayesian analyis of experimental data, but the technical issues remain.

Action understanding and active inference.

Friston, K., Mattout, J., and Kilner, J.
Biol Cybern, 104:137–160, 2011
DOI, Google Scholar

Abstract

We have suggested that the mirror-neuron system might be usefully understood as implementing Bayes-optimal perception of actions emitted by oneself or others. To substantiate this claim, we present neuronal simulations that show the same representations can prescribe motor behavior and encode motor intentions during action-observation. These simulations are based on the free-energy formulation of active inference, which is formally related to predictive coding. In this scheme, (generalised) states of the world are represented as trajectories. When these states include motor trajectories they implicitly entail intentions (future motor states). Optimizing the representation of these intentions enables predictive coding in a prospective sense. Crucially, the same generative models used to make predictions can be deployed to predict the actions of self or others by simply changing the bias or precision (i.e. attention) afforded to proprioceptive signals. We illustrate these points using simulations of handwriting to illustrate neuronally plausible generation and recognition of itinerant (wandering) motor trajectories. We then use the same simulations to produce synthetic electrophysiological responses to violations of intentional expectations. Our results affirm that a Bayes-optimal approach provides a principled framework, which accommodates current thinking about the mirror-neuron system. Furthermore, it endorses the general formulation of action as active inference.

Review

In this paper the authors try to convince the reader that the function of the mirror neuron system may be to provide amodal expectations for how an agent’s body will change, or interact with the world. In other words, they propose that the mirror neuron system represents, more or less abstract, intentions of an agent. This interpretation results from identifying the mirror neuron system with hidden states in a dynamic model within Friston’s active inference framework. I will first comment on the active inference framework and the particular model used and will then discuss the biological interpretation.

Active inference framework:

Active inference has been described by Friston elsewhere (Friston et al. PLoS One, 2009; Friston et al. Biol Cyb, 2010). Note that all variables are continuous. The main idea is that an agent maximises the likelihood of its internal model of the world as experienced by its sensors by (1) updating the hidden states of this model and (2) producing actions on the world. Under the Gaussian assumptions made by Friston both ways to maximise the likelihood of the model are equivalent to minimising the precision-weighted prediction errors defined in the model. Potentially the models are hierarchical, but here only a single layer is used which consists of sensory states and hidden states. The prediction errors on sensory states are simply defined as the difference between sensory observations and sensory predictions from the model as you would intuitively do. The model also defines prediction errors on hidden states (*). Both types of prediction errors are used to infer hidden states (1) which explain sensory observations, but action is only produced (2) from sensory state prediction errors, because action is not part of the agent’s model and only affects sensory observations produced by the world.

Well, actually the agent needs a whole other model for action which implements the gradient of sensory observations with respect to action, i.e., which tells the agent how sensory observations change when it exerts action. However, Friston restricts sensory obervations in this context to proprioceptive observations, i.e., muscle feedback, and argues that the corresponding gradient may be sufficiently simple to learn and represent so that we don’t have to worry about it (in the simulation he just provides the gradient to the agent). Therefore, action solely tries to implement proprioceptive predictions. On the other hand, proprioceptive predictions may be coupled to predictions in other modalities (e.g. vision) through the agent’s model which allows the agent to execute (seemingly) higher-level actions. For example, if an agent sees its hand move from a cup to a glass on a table in front of it, its generative model must also represent the corresponding proprioceptive signals. If then the agent predicts this movement of its hand in visual space, the generative model must automatically predict the corresponding proprioceptive signals, because they always accompanied the seen movement. Action then minimises the resulting precision-weighted proprioceptive prediction error and so implements the hand movement from cup to glass.

Notice that the agent minimises the *precision-weighted* prediction errors. Precision here means the inverse *prior* covariance, i.e., it is a measure for how certain the agent *expects* to be about its observations. By changing the precisions, qualitatively very different results can be obtained within the active inference framework. Indeed, here they implement the switch from action generation to action observation by heavily reducing the precision of the proprioceptive observations. This makes the agent ignore any proprioceptive prediction errors when both updating hidden states (1) and generating action (2). This leads to an interesting prediction: when you observe an action by somebody else, you shouldn’t notice when the corresponding body part is moved externally, or alternatively, when you observe somebody elses movement, you shouldn’t be able to move the corresponding body part yourself (in a different way than the observed). In this strict formulation this prediction appears to be very unlikely, but formulating it more softly, that you should see interference effects in these situations, you may be able to find evidence for it.

This thought also points to the general problem of finding suitable precisions: How do you strike a balance between action (2) and perception (1)? Because they are both trying to reduce the same prediction errors, the agent has to tradeoff recognising the world as it is (1) and changing it so that it corresponds to his expectations (2). This dichotomy is not easily resolved. When asked about it, Friston usually points to empirical priors, i.e., that the agent has learnt to choose suitable precisions based on his past experience (not very helpful, if you want to know how they are chosen). I guess, it’s really a question about how strongly the agent expects (wants) a certain outcome. A useful practical consideration also is that action is constrained, e.g., an agent can’t move infinitely fast, which means that enough prediction error should be left over for perceiving changes in the world (1), in particular those that are not within reach of the agent’s actions on the expected time scale.

I do not discuss the most common reservation against Friston’s free-energy principle / active inference framework (that people seem to have an intrinsic curiosity towards new things as well), because it has been covered elsewhere (John Langford’s blogNature Neuroscience).

Handwriting model:

In this paper the particular model used is interpreted as a model for handwriting although neither a hand is modeled, nor actual writing. Rather, a two-joint system (arm) is used where the movement of the end-effector position (tip) is designed such that it is qualitatively similar to hand-writing without actually producing common letters. The dynamic model of the agent consists of two parts: (a) a stable heteroclinic channel (SHC) which produces a periodic sequence of 6 continuously changing states and (b) a linear attractor dynamics in joint angle space of the arm which is attracted to a rest position, but modulated by the distance of the tip to a desired point in Cartesian space which is determined by the SHC state. Thus, the agent expects that the tip of its arm moves along a sequence of 6 desired points where the dynamics of the arm movement is determined by the linear attractor. The agent observes the joint angle positions and velocities (proprioceptive) and the Cartesian positions of the elbow joint and tip (visual). The dynamic model of the world (so to say implementing the underlying physics) lacks the SHC dynamics and only defines the linear attractor in joint space which is modulated by action and some (unspecified) external variables which can be used to perturb the system. Interestingly, the arm is stronger attracted to its rest position in the world model than in the agent model. The reason for this is not clear to me, but it might not be important, because action could correct for this.

Biological interpretation:

The system is setup such that the agent model contains additional hidden states compared to the world which may be interpreted as intentions of the agent, because they determine the order of the points that the tip moves to. In simulations the authors show that the described models within the active inference framework indeed lead to actions of the agent which implement a “writing” movement even though the world model did not know anything about “writing” at all. This effect has already been shown in the previously mentioned publications.

Here is new that they show that the same model can be used to observe an action without generating action at the same time. As mentioned before, they simply reduce the precision of the proprioceptive observations to achieve this. They then replay the previously recorded actions of the agent in the world by providing them via the external variables. This produces an equivalent movement of the arm in the world without any action being exerted by the agent. Instead of generating its own movement the agent then has the task to recognise a movement executed by somebody/something else. This works, because the precision of the visual obserations was kept high such that the hidden SHC states can be inferred correctly (1). The authors mention a delay before the SHC states catch up with the equivalent trajectory under action. This should not be over-interpreted, because other than mentioned in the text the initial conditions for the two simulations were not the same (see figures and code). The important argument the authors try to make here is that the same set of variables (SHC states) are equally active during action as well as action observation and, therefore, provide a potential functional explanation for activity in the mirror neuron system.

Furthermore, the authors argue that SHC states represent the intentions of the agent, or, equivalently, the intentions of the agent which is observed, by noting that the desired tip positions as specified by the SHC states are only (approximately) reached at a later point in time in the world. This probably results from the inertia built into the joint angle dynamics. Probably there are dynamic models for which this effect disappears, but it sounds plausible to me to assume that when one dynamic system d1 influences the parameters of another dynamic system d2 (as here), that d2 first needs to catch up with its state to the new parameter setting. So these delays would be expected for most hierarchical dynamic systems.

Another line of argument of the authors is to relate prediction errors in the model with electrophysiological (EEG) findings. This is based on Friston’s previous suggestion that superficial pyramidal cells are likely candidates for implementing prediction error units. At the same time, activity of these cells is thought to dominate EEG signals. I cannot judge the validity of both hypothesis, although the former seems to have less experimental support than the latter. In any case, I find the corresponding arguments in this paper quite weak. The problem is that results from exactly one run with one particular setting of parameters of one particular model is used to make very general statements based on a mere qualitative fit of parts of the data to general experimental findings. In other words, I’m not confident that similar (desired) patterns would be seen in the prediction errors, if other settings of precisions, or parameters of the dynamical systems would be chosen.

Conclusion:

The authors suggest how the mirror neuron system can be understood within Friston’s active inference framework. These conceptual considerations make sense. In general, the active inference framework provides large explanatory power and many phenomena may be understood in its context. However, in my point of view, it is an entirely open question how the functional considerations of the active inference framework may be implemented in neurobiological substrate. The superficial arguments based on prediction errors generated by the model, which are presented in the paper, are not convincing. More evidence needs to be found which robustly links variables in an active inference model with neuroscientific measurements.

But also conceptually it is not clear whether the active inference solution correctly describes the computations of the brain. On the one hand, it potentially explains many important and otherwise disparate phenomena under a common principle (e.g. perception, action, learning, computing with noise, dynamics, internal models, prediction; this paper adds action understanding). On the other hand, we don’t know whether all brain functions actually follow a common principle and whether functionally equivalent solutions for subsets of phenomena may be better descriptions of the underlying computations.

An important issue for future studies which aim to discern these possibilities is that active inference is a general framework which needs to be instantiated with a particular model before its properties can be compared to experimental data. However, little is known about the kind of hierarchical, dynamic, functional models itself, which must serve as generative models for active inference. As in this paper, it then is hard to discern the properties of the chosen model from the properties imposed by the active inference framework. Therefore, great care has to be taken in the interpretation of corresponding results, but it would be exciting to learn about which properties of the active inference framework are crucial in brain function and which would need to be added, adapted, or dropped in a faithful description of (subsets of) brain function.

(*) Hidden state prediction errors result from Friston’s special treatment of dynamical systems by extending states by their temporal derivatives to obtain generalised states which represent a local trajectory of the states through time. The hidden state prediction errors, thus, can be seen, intuitively, as the difference between the velocity of the (previously inferred) hidden states as represented by the trajectory in generalised coordinates and the velocity predicted by the dynamic model.

Information Theory of Decisions and Actions.

Tishby, N. and Polani, D.
in: Perception-Action Cycle, Springer New York, pp. 601–636, 2011
URL, Google Scholar

Abstract

The perception–action cycle is often defined as “the circular flow of information between an organism and its environment in the course of a sensory guided sequence of actions towards a goal” (Fuster, Neuron 30:319–333, 2001; International Journal of Psychophysiology 60(2):125–132, 2006). The question we address in this chapter is in what sense this “flow of information” can be described by Shannon’s measures of information introduced in his mathematical theory of communication. We provide an affirmative answer to this question using an intriguing analogy between Shannon’s classical model of communication and the perception–action cycle. In particular, decision and action sequences turn out to be directly analogous to codes in communication, and their complexity – the minimal number of (binary) decisions required for reaching a goal – directly bounded by information measures, as in communication. This analogy allows us to extend the standard reinforcement learning framework. The latter considers the future expected reward in the course of a behaviour sequence towards a goal (value-to-go). Here, we additionally incorporate a measure of information associated with this sequence: the cumulated information processing cost or bandwidth required to specify the future decision and action sequence (information-to-go). Using a graphical model, we derive a recursive Bellman optimality equation for information measures, in analogy to reinforcement learning; from this, we obtain new algorithms for calculating the optimal trade-off between the value-to-go and the required information-to-go, unifying the ideas behind the Bellman and the Blahut–Arimoto iterations. This trade-off between value-to-go and information-to-go provides a complete analogy with the compression–distortion trade-off in source coding. The present new formulation connects seemingly unrelated optimization problems. The algorithm is demonstrated on grid world examples.

Review

Peter Dayan pointed me to this paper (which is actually a book chapter) when I told him that I find the continuous interaction between perception and action important and that Friston’s free energy framework is one of the few which covers this case. Now, this paper covers only discrete time (and states and actions), but certainly it addresses the issue that perception and action influence each other.

The main idea of the paper is to take the informational effort (they call it information-to-go) into account when finding a policy for a Markov decision process. A central finding is a recursive equation analogous to the (Bellman) equation for the Q-function in reinforcement learning which captures the expected (over all possible future state-action trajectories) informational effort of a certain state-action pair. Informational effort is defined as the KL-divergence between a factorising prior distribution over future states and actions (making them independent across time) and their true distribution. This means that the informational effort is the expected number of bits of information that you have to consider in addition to your prior when moving through the future. They then propose a free energy (also a recursive equation) which combines the informational effort with the Q-function of the underlying MDP and thus allows simultaneous optimisation of informational effort and reward where the two are traded off against each other.

Practically, this leads to “soft vs. sharp policies”: sharp policies which always choose the action with highest expected reward and soft policies which choose actions probabilistically with an associated penalty on reward compared to sharp policies. The softness of the resulting policy is controlled by the tradeoff parameter between informational effort and reward which can be interpreted as the informational capacity of the system under consideration. I understand it this way: the tradeoff parameter stands for the informational complexity/capacity of the distributions representing the internal model of the world in the agent and the optimal policy with a particular setting of tradeoff parameter is the optimal policy with respect to reward alone that a corresponding agent can achieve. This is easily seen when considering that informational effort depends on the prior for future state-action trajectories. For a given prior, tradeoff parameter and resulting policy you can find the corresponding more complex prior for which the same policy can be found for 0 informational effort. The prior here obviously corresponds to the internal model of the agent. Consequently, the authors present a general framework with which you can ask questions such as: “How much informational capacity does my agent need to solve a given task with a desired level of performance?” Or, in other words: “How complex does my agent need to be in order to solve the given task?” Or: “How well can my agent solve the given task?” Although this latter question is the standard question in RL. In particular, my intuition tells me that for every setting of the tradeoff parameter there probably is an equivalent POMDP formulation (which makes the corresponding difference between world and agent model explicit).

A particularly interesting discussion is that about “perfectly adapted environments” which seems to be directed towards Friston without mentioning him, though. The discussion results from the ability to optimise their free energy combined from informational effort and reward not only with respect to the policy, but also with respect to the (true) transition probabilities. The outcome of such an optimisation is an environment in which transition probabilities are directly related to rewards, or, in other words, an environment in which informational effort is equal to something like negative reward. In such an environment “minimizing the statistical surprise or maximizing the predictive information is equivalent to maximizing reward” which is what Friston argues (see also the associated discussion on hunch.net). Needless to say that they consider this as a very special case while in most other cases the environment contains information that is irrelevant in terms of reward. Nevertheless, they consider the possibility that the environments of living organisms are indeed perfectly or at least well adapted through millions of years of coevolution and they suggest to direct future research towards this issue. The question really is what is reward in this general sense? What is it that living organisms try to achieve? The more concrete reward is, for example, reward for a particular task, the less relevant most information in the environment will be. I’m tempted to say that the combined optimisation of informational effort and rewards, as presented here, will then lead to policies which particularly seak out relevant information, but I’m not sure whether this is a correct interpretation.

To sum up Tishby and Polani present a new theoretical framework which generalises reinforcement learning by incorporating ideas from information theory. They provide an interesting new perspective which is presented in a pleasingly accessible way. I do not think that they solved any particular problem in reinforcement learning, but they broadened the view by postulating that agents tradeoff informational effort (capacity?) and reward. Practically, computations derived from their framework may not be feasible in most cases, because original reinforcement learning is already hard and here a few expectations have been added. Or, maybe it’s not so bad, because you can do them together.