Paper Decoder

The Influence of Spatiotemporal Structure of Noisy Stimuli in Decision Making.

Insabato, A., Dempere-Marco, L., Pannunzi, M., Deco, G., and Romo, R.
PLoS Comput Biol, 10:e1003492, 2014
DOI, Google Scholar


Decision making is a process of utmost importance in our daily lives, the study of which has been receiving notable attention for decades. Nevertheless, the neural mechanisms underlying decision making are still not fully understood. Computational modeling has revealed itself as a valuable asset to address some of the fundamental questions. Biophysically plausible models, in particular, are useful in bridging the different levels of description that experimental studies provide, from the neural spiking activity recorded at the cellular level to the performance reported at the behavioral level. In this article, we have reviewed some of the recent progress made in the understanding of the neural mechanisms that underlie decision making. We have performed a critical evaluation of the available results and address, from a computational perspective, aspects of both experimentation and modeling that so far have eluded comprehension. To guide the discussion, we have selected a central theme which revolves around the following question: how does the spatiotemporal structure of sensory stimuli affect the perceptual decision-making process? This question is a timely one as several issues that still remain unresolved stem from this central theme. These include: (i) the role of spatiotemporal input fluctuations in perceptual decision making, (ii) how to extend the current results and models derived from two-alternative choice studies to scenarios with multiple competing evidences, and (iii) to establish whether different types of spatiotemporal input fluctuations affect decision-making outcomes in distinctive ways. And although we have restricted our discussion mostly to visual decisions, our main conclusions are arguably generalizable; hence, their possible extension to other sensory modalities is one of the points in our discussion.


They review previous findings about perceptual decision making from a computational perspective, mostly related to attractor models of decision making. The focus here, however, is how the noisy stimulus influences the decision. They mostly restrict themselves to experiments with random dot motion, because these provided most relevant results for their discussion which mainly included three points: 1) specifics of decision input in decisions with multiple alternatives, 2) the relation of the activity of sensory neurons to decisions (cf. CP – choice probability) and 3) in what way sensory neurons reflect fluctuations of the particular stimulus. See also first paragraph of Final Remarks for summary, but note that I have made slightly different points. Their 3rd point derives from mine by applying mine to the specifics of the random dot motion stimuli. In particular, they suggest to investigate in how far different definitions of spatial noise in the random dot stimulus affect decisions differently.

With 2) they discuss the interesting finding that already the activity of sensory neurons can, to some extent, predict final decisions even when the evidence in the stimulus does not favour any decision alternative. So where does the variance in sensory neurons come from which eventually leads to a decision? Obviously, it could come from the stimulus itself. It has been found, however, that the ratio of variance to mean activity is the same when computed over trials with different stimuli compared to when computed over trials in which exactly the same stimulus with a particular realisation of noise was repeated. You would like to see a reduction of variance when the same stimulus is repeated, but it’s not there. I’m unsure, though, whether this is the correct interpretation of the variance-mean-ratio. I would have to check the original papers by Britten (Britten, 1993 and Britten, 1996). The seemingly constant variance of sensory neuron activity suggests that the particular noise realisation of a random dot stimulus does not affect decisions. Rather, the intrinsic activity of sensory neurons drives decisions in the case of no clear evidence. The authors argue that this is not a complete description of the situation, because it has also been found that you can see an effect of the particular stimulus on the variance of sensory neuron activity when considering small time windows instead of the whole trial. Unfortunately, the argument is mostly based on results presented in a SfN meeting abstracts in 2012. I wonder why there is no corresponding paper.

Probabilistic reasoning by neurons.

Yang, T. and Shadlen, M. N.
Nature, 447:1075–1080, 2007
DOI, Google Scholar


Our brains allow us to reason about alternatives and to make choices that are likely to pay off. Often there is no one correct answer, but instead one that is favoured simply because it is more likely to lead to reward. A variety of probabilistic classification tasks probe the covert strategies that humans use to decide among alternatives based on evidence that bears only probabilistically on outcome. Here we show that rhesus monkeys can also achieve such reasoning. We have trained two monkeys to choose between a pair of coloured targets after viewing four shapes, shown sequentially, that governed the probability that one of the targets would furnish reward. Monkeys learned to combine probabilistic information from the shape combinations. Moreover, neurons in the parietal cortex reveal the addition and subtraction of probabilistic quantities that underlie decision-making on this task.


The authors argue that the brain reasons probabilistically, because they find that single neuron responses (firing rates) correlate with a measure of probabilistic evidence derived from the probabilistic task setup. It is certainly true that the monkeys could learn the task (a variant of the weather prediction task) and I also find the evidence presented in the paper generally compelling, but the authors note themselves that similar correlations with firing rate may result from other quantitative measures with similar properties as the one considered here. May, for example, firing rates correlate similarly with a measure of expected value of a shape combination as derived from a reinforcement learning model?

What did they do in detail? They trained monkeys on a task in which they had to predict which of two targets will be rewarded based on a set of four shapes presented on the screen. Each shape contributed a certain weight to the probability of rewarding a target as defined by the experimenters. The monkeys had to learn these weights. Then they also had to learn (implicitly) how the weights of shapes are combined to produce the probability of reward. After about 130,000 trials the monkeys were good enough to be tested. The trick in the experiment was that the four shapes were not presented simultaneously, but appeared one after the other. The question was whether neurons in lateral intraparietal (LIP) area of the monkeys’ brains would represent the updated probabilities of reward after addition of each new shape within a trial. That the neurons would do that was hypothesised, because results from previous experiments suggested (see Gold & Shalden, 2007 for review) that neurons in LIP represent accumulated evidence in a perceptual decision making paradigm.

Now Shadlen seems convinced that these neurons do not directly represent the relevant probabilities, but rather represent the log likelihood ratio (logLR) of one choice option over the other (see, e.g., Gold & Shadlen, 2001 and Shadlen et al., 2008). Hence, these ‘posterior’ probabilities play no role in the paper. Instead all results are obtained for the logLR. Funnily the task is defined solely in terms of the posterior probability of reward for a particular combination of four shapes and the logLR needs to be computed from the posterior probabilities (Yang & Shadlen don’t lay out this detail in the paper or the supplementary information). I’m more open about the representation of posterior probabilities directly and I wondered how the correlation with logLR would look like, if the firing rates would respresent posterior probabilities. This is easy to simulate in Matlab (see Yang2007.m). Such a simulation shows that, as a function of logLR, the firing rate (representing posterior probabilities) should follow a sigmoid function. Compare this prediction to Figures 2c and 3b for epoch 4. Such a sigmoidal relationship derives from the boundedness of the posterior probabilities which is obviously reflected in firing rates of neurons as they cannot drop or rise indefinitely. So there could be simple reasons for the boundedness of firing rates other than that they represent probabilities, but in any case it appears unlikely that they represent unbounded log likelihood ratios.

The Cost of Accumulating Evidence in Perceptual Decision Making.

Drugowitsch, J., Moreno-Bote, R., Churchland, A. K., Shadlen, M. N., and Pouget, A.
The Journal of Neuroscience, 32:3612–3628, 2012
DOI, Google Scholar


Decision making often involves the accumulation of information over time, but acquiring information typically comes at a cost. Little is known about the cost incurred by animals and humans for acquiring additional information from sensory variables due, for instance, to attentional efforts. Through a novel integration of diffusion models and dynamic programming, we were able to estimate the cost of making additional observations per unit of time from two monkeys and six humans in a reaction time (RT) random-dot motion discrimination task. Surprisingly, we find that the cost is neither zero nor constant over time, but for the animals and humans features a brief period in which it is constant but increases thereafter. In addition, we show that our theory accurately matches the observed reaction time distributions for each stimulus condition, the time-dependent choice accuracy both conditional on stimulus strength and independent of it, and choice accuracy and mean reaction times as a function of stimulus strength. The theory also correctly predicts that urgency signals in the brain should be independent of the difficulty, or stimulus strength, at each trial.


The authors show equivalence between a probabilistic and a diffusion model of perceptual decision making and consequently explain experimentally observed behaviour in the random dot motion task in terms of varying bounds in the diffusion model which correspond to varying costs in the probabilistic model. Here, I discuss their model in detail and outline its limits. My main worry with the presented model is that it may be too powerful to have real explanatory power. Impatient readers may want to skip to the conclusion below.

Perceptual model

The presented model is tailored to the two-alternative, forced choice random dot motion task. The fundamental assumption for the model is that at each point in discrete time, or equivalently, for each successive time period in continuous time the perceptual process of the decision maker produces an independent sample of evidence whose mean, mu*dt, reflects the strength (coherence) and direction (only through sign of evidence) of random dot motion while its variance, sigma2, reflects the passage of time (sigma2 = dt, the time period between observations). This definition of input to the decision model as independent samples of motion strength in either one of two (unspecified) directions restricts the model to two decision alternatives. Consequently, the presented model does not apply to more alternatives, or dependent samples.

The model of noisy, momentary evidence corresponds to a Wiener process with drift which is exactly what standard (drift) diffusion models of perceptual decision making are where drift is equal to mu and diffusion is equal to sigma2. You could wonder why sigma2 is exactly equal to dt and not larger, or smaller, but this is controlled by setting the mean evidence mu to an appropriate level by allowing it to scale: mu = k*c, where k is an arbitrary scaling constant which is fit to data and c is the random dot coherence in the current trial. Therefore, by controlling k you essentially control the signal to noise ratio in the model of the experiment and you would get equivalent results, if you changed sigma2 while fixing mu = c. The difference between the two cases is purely conceptual: In the former case you assume that the neuronal population in MT signals, on average, a scaled motion strength where the scaling may be different for different subjects, but signal variance is the same over subjects while in the latter case you assume that the MT signal, on average, corresponds to motion strength directly, but MT signal variance varies across subjects. Personally, I prefer the latter.

The decision circuit in the author’s model takes the samples of momentary evidence as described above and computes a posterior belief over the two considered alternatives (motion directions). This posterior belief depends on the posterior probability distribution over mean motion strengths mu which is computed from the samples of momentary evidence taking a prior distribution over motion strengths into account. An important assumption in the computation of the posterior is that the decision maker (or decision circuit) has a perfect model of how the samples of momentary evidence are generated (a Gaussian with mean mu*dt and variance dt). If, for example, the decision maker would assume a slightly different variance, that would also explain differences in mean accuracy and decision times. The assumption of the perfect model, however, allows the authors to assert that the experimentally observed fraction of correct choices at a time t is equal to the internal belief of the decision maker (subject) that the chosen alternative is the correct one. This is important, because only with an estimate of this internal belief the authors can later infer the time-varying waiting costs for the subject (see below).

Anyway, under the given model the authors show that for a Gaussian prior you obtain a Gaussian posterior over motion strength mu (Eq. 4) and for a discrete prior you obtain a corresponding discrete posterior (Eq. 7). Importantly, the parameters of the posteriors can be formulated as functions of the current state x(t) of the sample-generating diffusion process and elapsed time t. Consequently, also the posterior belief over decision alternatives can be formulated as a one-to-one, i.e., invertible function of the diffusion state (and time t). By this connection, the authors have shown that, under an appropriate transformation, decisions based on the posterior belief are equivalent to decisions based on the (accumulated) diffusion state x(t) set in relation to elapsed time t.

In summary, the probabilistic perceptual decision model of the authors simply estimates the motion strength from the samples and then decides whether the estimate is positive or negative. Furthermore, this procedure is equivalent to accumulating the samples and deciding whether the accumulated state is very positive or very negative (as determined by hitting a bound). The described diffusion model has been used before to fit accuracies and mean reaction times of subjects, but apparently it was never quite good in fitting the full reaction time distribution (note that it lacks the extensions of the drift diffusion models suggested by Ratcliff, see, e.g., [1]). So here the authors extend the diffusion model by adding time-varying bounds which can be interpreted in the probabilistic model as a time-varying cost of waiting for more samples.

Time-varying bounds and costs

Intuitively, introducing a time-varying bound in a diffusion model introduces great flexibility in shaping the response accuracy and timing at any given time point. However, I currently do not have a good idea of just how flexible the model becomes. For example, if in discrete time changing the bound at each time step could independently modify the accuracy and reaction time distribution at this time step, the bound alone could explain the data. I don’t believe that this extreme case is true, but I would like to know how close you would come. In any case, it appears to be sensible to restrict how much the bound can vary to prevent overfitting of the data, or indeed to prevent making the other model parameters obsolete. In the present paper, the authors control the shape of the bound by using a function made of cosine basis functions. Although this restricts the bound to be a smooth function of time, it still allows considerable flexibility. The authors use two more approaches to control the flexibility of the bound. One is to constrain the bound to be the same for all coherences, meaning that it cannot be used to explain differences between coherences (experimental conditions). The other is to use Bayesian methods for fitting the data. On the one hand, this controls the bound by choosing particular priors. They do this by only considering parameter values in a restricted range, but I do not know how wide or narrow this range is in practice. On the other hand, the Bayesian approach leads to posterior distributions over parameters which means that subsequent analyses can take the uncertainty over parameters into account (see, e.g., the indicated uncertainty over the inferred bound in Fig. 5A). Although I remain with some last doubts about whether the bound was too flexible, I believe that this is not a big issue here.

It is, however, a different question whether the time-varying bound is a good explanation for the observed behaviour in contrast, e.g., to the extensions of the diffusion model introduced by Ratcliff (mostly trial-by-trial parameter variability). There, one might refer to the second, decision-related part of the presented model which considers the rewards and costs associated with decisions. In the Bayesian decision model presented in the paper the subject decides at each time step whether to select alternative 1, or alternative 2, or wait for more evidence in the next time step. This mechanism was already mentioned in [2]. Choosing an alternative will either lead to a reward (correct answer) or punishment (error), but waiting is also associated with a cost which may change throughout the trial. Deciding for the optimal course of action which maximises reward per unit time then is an average-reward reinforcement learning problem which the authors solve using dynamic programming. For a particular setting of reward, punishment and waiting costs this can be translated into an equivalent time-varying bound. More importantly, the procedure can be reversed such that the time-varying cost can be inferred from a bound that had been fitted to data. Apart from the bound, however, the estimate of the cost also depends on the reward/punishment setting and on an estimate of choice accuracy at each time step. Note that the latter differs considerably from the overall accuracy which is usually used to fit diffusion models and requires more data, especially when the error rate is low.

The Bayesian decision model, therefore, allows to translate the time-varying bound to a time-varying cost which then provides an explanation of the particular shape of the reaction time distribution (and accuracy) in terms of the intrinsic motivation (negative cost) of the subject to wait for more evidence. Notice that this intrinsic motivation is really just a value describing how much somebody (dis-)likes to wait and it cannot be interpreted in terms of trying to be better in the task anymore, because all these components have been taken care of by other parts of the decision model. So what does it mean when a subject likes to wait for new evidence just for the sake of it (cf. dip in cost at beginning of trial in human data in Fig. 8)? I don’t know.

Collapsing bounds as found from behavioural data in this paper have been associated with an urgency signal in neural data which drives firing rates of all decision neurons towards a bound at the end of a trial irrespective of the input / evidence. This has been interpreted as a response of the subjects to the approaching deadline (end of trial) that they do not want to miss. The explanation in terms of a waiting cost which rises towards the end of a trial suggests that subjects just have a built-in desire to make (potentially arbitrary) choices before a deadline. To me, this is rather unintuitive. If you’re not punished for making a wrong choice (blue lines in Figs. 7 and 8, but note that there was a small time-punishment in the human experiment) shouldn’t it be always beneficial to make a choice before the deadline, because you trade uncertain reward against certain no reward? This would already be able to explain the urgency signal without consideration of a waiting cost. So why do we see one anyway? It may just all depend on the particular setting of reward and punishment for correct choices and errors, respectively. The authors present different inferred waiting costs with varying amounts of punishment and argue that the results are qualitatively equal, but the three different values of punishment they present hardly exhaust the range of values that could be assumed. Also, they did not vary the amount of reward given for correct choices, but it is likely that only the difference between reward and punishment determines the behaviour of the model such that it doesn’t matter whether you change reward or punishment to explore model predictions.


The main contribution of the paper is to show that accuracy and reaction time distribution can be explained by a time-varying bound in a simple diffusion model in which the drift scales linearly with stimulus intensity (coherence in random dot motion). I tried to point out that this result may not be surprising depending on how much flexibility a time-varying bound adds to the model. Additionally, the authors present a connection between diffusion and Bayesian models of perceptual decision making which allows them to reinterpret the time-varying bounds in terms of the subjective cost of waiting for more evidence to arrive. The authors argue that this cost increases towards the end of a trial, but for two reasons I’m not entirely convinced: 1) Conceptually, it is worth considering the origin of a possible waiting cost. It could correspond to the energetic cost of keeping the inference machinery running and the attention on the task, but there is no reason why this should increase towards a deadline. 2) I’m not convinced by the presented results that the inferred increase of cost towards a deadline is qualitatively independent of the reward/punishment setting. A greater range of punishments should have been tested. Note that you cannot infer the rewards for decisions and the time-varying waiting cost at the same time from the behavioural data. So this issue cannot be settled without some new experiments which measure rewards or costs more directly. Finally, I miss an overview of fitted parameter values in the paper. For example, I would be interested in the inferred lapse trial probabilities p1. The authors go through great lengths to estimate the posterior distributions over diffusion model parameters and I wonder why they don’t share the results with us (at least mean and variance for a start).

In conclusion, the authors follow a trend to explain behaviour in terms of Bayesian ideal observer models extended by flexible cost functions and apply this idea to perceptual decision making via a detour through a diffusion model. Although I appreciate the sound work presented in the paper, I’m worried that the time-varying bound/cost is too flexible and acts as a kind of ‘get out of jail free’ card which blocks the view to other, potentially additional mechanisms underlying the observed behaviour.


[1] Bogacz, R.; Brown, E.; Moehlis, J.; Holmes, P. & Cohen, J. D. The physics of optimal decision making: a formal analysis of models of performance in two-alternative forced-choice tasks. Psychol Rev, 2006, 113, 700-765

[2] Dayan, P. & Daw, N. D. Decision theory, reinforcement learning, and the brain. Cogn Affect Behav Neurosci, 2008, 8, 429-453

Probabilistic vs. non-probabilistic approaches to the neurobiology of perceptual decision-making.

Drugowitsch, J. and Pouget, A.
Curr Opin Neurobiol, 22:963–969, 2012
DOI, Google Scholar


Optimal binary perceptual decision making requires accumulation of evidence in the form of a probability distribution that specifies the probability of the choices being correct given the evidence so far. Reward rates can then be maximized by stopping the accumulation when the confidence about either option reaches a threshold. Behavioral and neuronal evidence suggests that humans and animals follow such a probabilitistic decision strategy, although its neural implementation has yet to be fully characterized. Here we show that that diffusion decision models and attractor network models provide an approximation to the optimal strategy only under certain circumstances. In particular, neither model type is sufficiently flexible to encode the reliability of both the momentary and the accumulated evidence, which is a pre-requisite to accumulate evidence of time-varying reliability. Probabilistic population codes, by contrast, can encode these quantities and, as a consequence, have the potential to implement the optimal strategy accurately.


It’s essentially an advertisement for probabilistic population codes (PPCs) for modelling perceptual decisions. In particular, they contrast PPCs to diffusion models and attractor models without going into details. The main argument against attractor models is that they don’t encode a decision confidence in the attractor state. The main argument against diffusion models is that they are not fit to represent varying evidence reliability, but it’s not fully clear to me what they mean by that. The closest I get is that “[…] the drift is a representation of the reliability of the momentary evidence” and they argue that for varying drift rate the diffusion model becomes suboptimal. Of course, if the diffusion model assumes a constant drift rate, it is suboptimal when the drift rate changes, but I’m not sure whether this is the point they are making. The authors mention one potential weak point of PPCs: They predict that the decision bound is defined on a linear combination of integrated momentary evidence, but the firing of neurons in area LIP indicates that the bound is on the estimated correctness of single decisions, i.e., there is a bound for each decision alternative, as in a race model. I interpret this as evidence for a decision model where the bound is defined on the posterior probability of the decision alternatives.

The paper is a bit sloppily written (frequent, easily avoidable language errors).

Representation of confidence associated with a decision by neurons in the parietal cortex.

Kiani, R. and Shadlen, M. N.
Science, 324:759–764, 2009
DOI, Google Scholar


The degree of confidence in a decision provides a graded and probabilistic assessment of expected outcome. Although neural mechanisms of perceptual decisions have been studied extensively in primates, little is known about the mechanisms underlying choice certainty. We have shown that the same neurons that represent formation of a decision encode certainty about the decision. Rhesus monkeys made decisions about the direction of moving random dots, spanning a range of difficulties. They were rewarded for correct decisions. On some trials, after viewing the stimulus, the monkeys could opt out of the direction decision for a small but certain reward. Monkeys exercised this option in a manner that revealed their degree of certainty. Neurons in parietal cortex represented formation of the direction decision and the degree of certainty underlying the decision to opt out.


The authors used a 2AFC-task with an option to waive the decision in favour of a choice which provides low, but certain reward (the sure option) to investigate the representation of confidence in LIP neurons. Behaviourally the sure option had the expected effect: it was increasingly chosen the harder the decisions were, i.e., the more likely a false response was. Trials in which the sure option was chosen, thus, may be interpreted as those in which the subject was little confident in the upcoming decision. It is important to note that task difficulty here was manipulated by providing limited amounts of information for a limited amount of time, i.e., this was not a reaction time task.

The firing rates of the recorded LIP neurons indicate that selection of the sure option is associated with an intermediate level of activity compared to that of subsequent choices of the actual decision options. For individual trials the authors found that firing rates closer to the mean firing rate (in a short time period before the sure option became available) more frequently lead to selection of the sure option than firing rates further away from the mean, but in absolute terms the activity in this time window could predict choice of the sure option only weakly (probability of 0.4). From these results the authors conclude that the LIP neurons which have previously been found to represent evidence accumulation also encode confidence in a decision. They suggest a simple drift-diffusion model with fixed diffusion parameter to explain the results. Additional to standard diffusion models they define confidence in terms of the log-posterior odds which they compute from the state of the drift-diffusion model. They define posterior as p(S_i|v), the probability that decision option i is correct given that the drift-diffusion state (the decision variable) is v. They compute it from the corresponding likelihood p(v|S_i), but don’t state how they obtained that likelihood. Anyway, the sure option is chosen in the model, when the log-posterior odds is below a certain level. I don’t see why the detour via the log-posterior odds is necessary. You could directly define v as the posterior for decision option i and still be consistent with all the findings in the paper. Of course, then v could not be governed by a linear drift anymore, but why should it in the first place? The authors keenly promote the Bayesian brain, but stop just before the finishing line. Why?

Robust averaging during perceptual judgment.

de Gardelle, V. and Summerfield, C.
Proc Natl Acad Sci U S A, 108:13341–13346, 2011
DOI, Google Scholar


An optimal agent will base judgments on the strength and reliability of decision-relevant evidence. However, previous investigations of the computational mechanisms of perceptual judgments have focused on integration of the evidence mean (i.e., strength), and overlooked the contribution of evidence variance (i.e., reliability). Here, using a multielement averaging task, we show that human observers process heterogeneous decision-relevant evidence more slowly and less accurately, even when signal strength, signal-to-noise ratio, category uncertainty, and low-level perceptual variability are controlled for. Moreover, observers tend to exclude or downweight extreme samples of perceptual evidence, as a statistician might exclude an outlying data point. These phenomena are captured by a probabilistic optimal model in which observers integrate the log odds of each choice option. Robust averaging may have evolved to mitigate the influence of untrustworthy evidence in perceptual judgments.


The authors investigate what influence the variance of evidence has on perceptual decisions. A bit counterintuitively, they implement varying evidence by simultaneously presenting elements with different feature values (e.g. color) to subjects instead of presenting only one element which changes its feature value over time (would be my naive approach). Perhaps they did this to be able to assume constant evidence over time such that the standard drift diffusion model applies. My intuition is that subjects anyway implement a more sequential sampling of the stimulus display by varying attention to individual elements.

The behavioural results show that subjects take both mean presented evidence as well as the variance of evidence into account when making a decision: For larger mean evidence and smaller variance of evidence subjects are faster and make less mistakes. The results are attention dependent: mean and variance in a task-irrelevant feature dimension had no effect on responses.

The behavioural results can be explained by a drift diffusion model with a drift rate which takes the variance of the evidence into account. The authors present two such drift rates. 1) SNR drift = mean / standard deviation (as computed from trial-specific feature values). 2) LPR drift = mean log posterior ratio (also computed from trial-specific feature values). The two cannot be differentiated based on the measured mean RTs and error rates in the different conditions. So the authors provide an additional analysis which estimates the influence of the different presented elements, that is, the influence of the different feature values presented by them, on the given responses. This is done via a generalised linear regression by fitting a model which predicts response probabilites from presented feature values for individual trials. The fitted linear weights suggest that extreme (outlying) feature values have little influence on the final responses compared to the influence that (inlying) feature values close to the categorisation boundary have. Only the LPR model (2) replicates this effect.

Why have inlying feature values greater influence on responses than outlying ones in the LPR model, but not in the other models? The LPR model alone would not predict this, because for more extreme posterior values you get more extreme LPR values which then have a greater influence on the mean LPR value, i.e., the drift rate. Therefore, It is not entirely clear to me yet why they find a greater importance of inlying feature values in the generalised linear regression from feature values to responses. The best explanation I currently have is the influence of the estimated posterior values: Fig. S5 shows that the posterior values are constant for sufficiently outlying feature values and only change for inlying feature values, where the greatest change is at the feature value defining the categorisation boundary. When mapped through the LPR the posterior values lead to LPR values following the same sigmoidal form setting low and high feature values to constants. These constant high and low values may cancel each other out when, on average, they are equally many. Then, only the inlying feature values may have a lasting contribution on the LPR mean; especially those close to the categorisation boundary, because they tend to lead to larger variation in LPR values which may tip the LPR mean (drift rate) towards one of the two responses. This explanation means that the results depend on the estimated posterior values. In particular, that these are set to values of about 0.2, or 0.8, respectively, for a large range of extreme feature values.

I am unsure what conclusions can be drawn from the results. Although, the basic behavioural results are clear, it is not surprising that the responses of subjects depend on the variance of the presented evidence. You can define the feature values varying around the mean as noise. More variance then just means more noise and it is a basic result that people become slower and more error prone when presented with more noise. Perhaps surprisingly, it is here shown that this also works when noisy features are presented simultaneously on the screen instead of sequentially over time.

The DDM analysis shows that the drift rate of subjects decreases with increasing variance of evidence. This makes sense and means that subjects become more cautious in their judgements when confronted with larger variance (more noise). But I find the LPR model rather strange. It’s like pressing a Bayesian model into a mechanistic corset. The posterior ratio is an ad-hoc construct. Ok, it’s equivalent to the log-likelihood ratio, but why making it to a posterior ratio then? The vagueness arises already because of how the task is defined: all information is presented at once, but you want to describe accumulation of evidence over time. Consequently, you have to define some approximate, ad-hoc construct (mean LPR) which you can use to define the temporal integration. That the model based on that construct replicates an aspect of the behavioural data may be an artefact of the particular approximation used (apparently it is important that the estimated posterior values are constant for extreme feature values). So, it remains unclear to me whether an LPR-DDM is a good explanation for the involved processes in this case.

Actually, a large part of the paper (cf. title) concerns the finding that extreme feature values appear to have smaller influence on subject responses than feature values close to the categorisation boundary. This is surprising to me. Although it makes intuitive sense in terms of ‘robust averaging’, I wouldn’t predict it for optimal probabilistic integration of evidence, at least not without making further assumptions. Such assumptions are also implicit in the LPR-DDM and I’m a bit skeptical about it anyway. Thus, a good explanation is still needed, in my opinion. Finally, I wonder how reliable the generalised linear regression analysis, which led to these results, is. On the one hand, the authors report using two different generalised linear models and obtaining equivalent results. On the other hand, they estimate 9 parameters from only one binary response variable and I wonder how the optimisation landscape looks in this case.

Warum ich blogge

Heute mal ein noch informellerer Post auf deutsch. Er ist eine Antwort auf die Blogparade des SOOC. Dort wollen sie wissen: “Warum blogt Ihr eigentlich?” Gemeint sind “Professoren, Dozenten, Universitätsmitarbeiter…”. Da zähle ich mich jetzt mal dazu.

Ich blogge, weil ich irgendwann fand, dass die Gedanken, die ich mir über meine und die Arbeit anderer mache, nicht in einer virtuellen Schublade auf meinem Rechner verstauben sollen. Was soll das heißen und wie kam es dazu?

Während meiner Promotion fing ich an mir systematisch Notizen zu Artikeln zu machen, die ich gelesen hatte. Das wurde notwendig, weil ich schnell vergesse, was ich gelesen habe (ein natürlicher Prozess, wie ich mir versichern ließ). Manche dieser Notizen weiteten sich zu recht detaillierten Reviews aus, die mich doch etwas Zeit kosteten. Zeit, die mir half die Artikel besser zu verstehen, aber mir sonst nichts brachte.

Dann stieß ich darauf, dass verschiedene Leute überlegen wie man die Evaluation wissenschaftlicher Artikel verbessern kann (siehe about this blog). Ein Vorschlag, der mir besonders zuspricht, ist “post-publication peer review”, also die Evaluation von Artikeln nachdem sie publiziert wurden, statt vorher, wie derzeit üblich.

Das gab mir letztlich den ausschlaggebenden Grund meine Reviews zu veröffentlichen. Daran hatte ich schon davor gedacht: Wenn ich schon so viel Zeit rein steckte, dann sollte wenigstens ein größerer Nutzen dabei raus kommen. Die Idee des post-publication peer review bekräftigte dann meine Überzeugung, dass zumindest ein paar Leute außer mir meine Notizen nützlich finden werden.

Seitdem hat sich mein Verhalten etwas geändert. Ich habe nicht mehr so viel Zeit ausführliche Notizen zu verfassen. Also ist die Post-Rate gesunken. Dennoch überfällt mich ab und zu die Lust einen Artikel tiefgründiger zu analysieren, oder mir fällt ein wichtiger Kommentar dazu ein. Schreibe ich dann einen Review, bin ich mir der Veröffentlichung auf dem Blog bewusst und ich passe meinen Text teilweise an (z.B. versuche ich kürzer zu schreiben und klarer zu strukturieren). Hinzu gekommen sind auch Posts, die zwar nichts mit einem bestimmten Artikel zu tun haben, aber im weitesten Sinne Aspekte meiner (Forschungs-)Interessen betreffen.

Und, wie läuft’s? Ich kann mich keiner großen Berühmtheit erfreuen. Allerdings gibt es immer wieder Leute, die sich hauptsächlich dank Google auf meinen Blog verirren. Eigenwerbung in den sozialen Netzwerken hat einen starken, aber schnell vorübergehenden Effekt auf Seitenaufrufe. Insgesamt bin ich ganz zufrieden: der Aufwand ist überschaubar, das, was ich mache, macht Spaß und ab und zu kommt jemand vorbei um sich an zu schauen, was ich produziert habe. (Ob die das dann gut finden, ist eine andere, unbeantwortete Frage.)

A healthy fear of the unknown: perspectives on the interpretation of parameter fits from computational models in neuroscience.

Nassar, M. R. and Gold, J. I.
PLoS Comput Biol, 9:e1003015, 2013
DOI, Google Scholar


Fitting models to behavior is commonly used to infer the latent computational factors responsible for generating behavior. However, the complexity of many behaviors can handicap the interpretation of such models. Here we provide perspectives on problems that can arise when interpreting parameter fits from models that provide incomplete descriptions of behavior. We illustrate these problems by fitting commonly used and neurophysiologically motivated reinforcement-learning models to simulated behavioral data sets from learning tasks. These model fits can pass a host of standard goodness-of-fit tests and other model-selection diagnostics even when the models do not provide a complete description of the behavioral data. We show that such incomplete models can be misleading by yielding biased estimates of the parameters explicitly included in the models. This problem is particularly pernicious when the neglected factors are unknown and therefore not easily identified by model comparisons and similar methods. An obvious conclusion is that a parsimonious description of behavioral data does not necessarily imply an accurate description of the underlying computations. Moreover, general goodness-of-fit measures are not a strong basis to support claims that a particular model can provide a generalized understanding of the computations that govern behavior. To help overcome these challenges, we advocate the design of tasks that provide direct reports of the computational variables of interest. Such direct reports complement model-fitting approaches by providing a more complete, albeit possibly more task-specific, representation of the factors that drive behavior. Computational models then provide a means to connect such task-specific results to a more general algorithmic understanding of the brain.


Nassar and Gold use tasks from their recent experiments (e.g. Nassar et al., 2012) to point to the difficulties of interpreting model fits of behavioural data. The background is that it has become more popular to explain experimental findings (often behaviour) using computational models. But how reliable are those computational interpretations and how to ensure that they are valid? I will briefly review what Nassar and Gold did and point out that researchers investigating reward learning using computational models should think about learning rate adaptation in their experiments, because, in the light of the present paper, their results may else not be interpretable. Further, I will argue that Nassar and Gold’s appeal to more interaction between modelling and task design is just how science should work in principle.


The considered tasks belong to the popular class of reward learning tasks in which a subject has to learn which choices are rewarded to maximise reward. These tasks may be modelled by a simple delta-rule mechanism which updates current (learnt) estimates of reward by an amount proportional to a prediction error where the exact amount of update is determined by a learning rate. This learning rate is one of the parameters that you want to fit to data. The second parameter Nassar and Gold consider is the ‘inverse temperature’ which tells how a subject trades off exploitation (choose to get reward) against exploration (choose randomly).

Nassar and Gold’s tasks are special, because at so-called change points during an experiment the underlying rewards may abruptly change (in addition to smaller variation of reward between single trials). The experimental subject then has to learn the new reward values. Importantly, Nassar and Gold have found that subjects use an adaptive learning rate, i.e., when subjects encounter small prediction errors they tend to reduce the learning rate while they tend to increase learning rate when experiencing large prediction errors. However, typical delta-rule learning models assume a fixed learning rate.

The issue

The issue discussed in the paper is that it will not be easily possible to detect a problem when fitting a fixed learning rate model to choices which were produced with an adaptive learning rate. As shown in the present paper, this issue results from a redundancy between learning rate adaptiveness (a hyperparameter, or hidden factor) and the inverse temperature with respect to subject choices, i.e., a change in learning rate adaptiveness can equivalently be explained by a change in inverse temperature (with fixed learning rate adaptiveness) when such a change is only measured by the choices a subject makes. Statistically, this means that, if you were to fit learning rate adaptiveness with inverse temperature to subject choices, then you should find that the two parameters are highly correlated given the data. Even better, if you were to look at the posterior distribution of the two parameters given subject choices, you should observe a large variance of them together with a strong covariance between them. As a statistician you would then report this variance and acknowledge that interpretation may be difficult. But learning rate adaptiveness is not typically fitted to choices. Instead only learning rate itself is fitted given a particular adaptiveness. Then, the relation between adaptiveness and inverse temperature is hidden from the analysis and investigators may be fooled into thinking that the combination of fitted learning rate and inverse temperature comprehensively explains the data. Well, it does explain the data, but there are potentially many other explanations of this kind which become apparent when the hidden factor learning rate adaptiveness is taken into account.

What does it mean?

The discussed issue exemplifies a general problem of cognitive psychology: that you try to investigate (computational) mechanisms, e.g., decision making, by looking at quite impoverished data, e.g., decisions, which only represent the final product of the mechanisms. So what you do is to guess a mechanism (a model) and see whether it fits the data. In the case of Nassar and Gold there was a prevailing guess which fit the data reasonably well. By investigating decision making in a particular, new situation (environment with change points) they found that they needed to extend that mechanism to account for the new data. However, the extended mechanism now has many explanations for the old impoverished data, because the extended mechanism is more flexible than the old mechanism. To me, this is all just part of the normal progress in science and nothing to be alarmed about in principle. Yet, Nassar and Gold are right to point out that in the light of the extended mechanism fits of the old mechanism to old data may be misleading. Interpreting the parameters of the old mechanism may then be similar to saying that you find that the earth is a disk, because from your window it looks like the ground goes to the horizon in a straight line and then stops.


Essentially, Nassar and Gold try to convince us that when looking at reward learning we should now also take learning rate adaptiveness into account, i.e., that we should interpret subject choices within their extended mechanism. Two questions remain: 1) Do we trust that their extended mechanism is worth pursuing? 2) If yes, what can we do with the old data?

The present paper does not provide evidence that their extended mechanism is a useful model for subject choices (1), because they here assumed that the extended mechanism is true and investigated how you would interpret the new data using the old mechanism. However, their original study and others point to the importance of learning rate adaptiveness [see their refs. 9-11,26-28].

If the extended mechanism is correct, then the present paper shows that the old data is pretty much useless (2) unless learning rate adaptiveness has been, perhaps accidentally, controlled for in previous studies. This is because the old data from previous experiments (probably) does not allow to estimate learning rate adaptiveness. Of course, if you can safely assume that the learning rate of subjects stayed roughly fixed in your experiment, for example, because prediction errors were very similar during the whole experiment, then the old mechanism with fixed learning rate should still apply and your data is interpretable in the light of the extended mechanism. Perhaps it would be useful to investigate how robust fitted parameters are to varying learning rate adaptiveness in a typical experiment producing old data (here we only see results for experiments designed to induce changes in learning rate through large jumps in mean reward values).

Overall the paper has a very general tone. It tries to discuss the difficulties of fitting computational models to behaviour in general. In my opinion, these things should be clear to anyone in science as they just reflect how science progresses: you make models which need to fit an observed phenomenon and you need to refine models when new observations are made. You progress by seeking new observations. There is nothing special about fitting computational models to behaviour with respect to this.

A supramodal accumulation-to-bound signal that determines perceptual decisions in humans.

O’Connell, R. G., Dockree, P. M., and Kelly, S. P.
Nat Neurosci, 15:1729–1735, 2012
DOI, Google Scholar


In theoretical accounts of perceptual decision-making, a decision variable integrates noisy sensory evidence and determines action through a boundary-crossing criterion. Signals bearing these very properties have been characterized in single neurons in monkeys, but have yet to be directly identified in humans. Using a gradual target detection task, we isolated a freely evolving decision variable signal in human subjects that exhibited every aspect of the dynamics observed in its single-neuron counterparts. This signal could be continuously tracked in parallel with fully dissociable sensory encoding and motor preparation signals, and could be systematically perturbed mid-flight during decision formation. Furthermore, we found that the signal was completely domain general: it exhibited the same decision-predictive dynamics regardless of sensory modality and stimulus features and tracked cumulative evidence even in the absence of overt action. These findings provide a uniquely clear view on the neural determinants of simple perceptual decisions in humans.


The authors report EEG signals which may represent 1) instantaneous evidence and 2) accumulated evidence (decision variable) during perceptual decision making. The result promises a big leap for experiments in perceptual decision making with humans, because it is the first time that we can directly observe the decision process as it accumulates evidence with reasonable temporal resolution without sticking needles in participant’s brains. Furthermore, one of the found signals appears to be sensory and response modality independent, i.e., it appears to reflect the decision process alone – something that has not been clearly found in species other than humans, but let’s discuss the study in more detail.

The current belief about the perceptual decision making process is formalised in accumulation to bound models: When presented with a stimulus, the decision maker determines at each time point of the presentation the current amount of evidence for all possible alternatives. This estimate of “instantaneous evidence” is noisy, because of either the noise within the stimulus itself, or because of internal processing noise. Therefore, the decision maker does not immediately make a decision between alternatives, but accumulates evidence over time until the accumulated evidence for one of the alternatives reaches a threshold which is internally set by the decision maker itself and indicates a certain level of certainty, or response urgency. The alternative, for which the threshold was crossed, is the decision outcome and the time the threshold was crossed is the decision time (potentially including an additional delay). The authors argue that they have found signals in the EEG of humans which can be associated with the instantaneous and accumulated evidence variables of these kinds of models.

The paradigm used in this study was different from the perceptual decision making paradigm popular in monkeys (random dot stimuli). Here the authors used stimuli which did not move, but rather gradually changed their intensity or contrast: In the experiments with visual stimuli, participants were continuously viewing a flickering disk which from time to time gradually changed its contrast with the background (the contrast gradually went back to base level after 1.6s). So the participants had to decide whether they observe a contrast different from baseline at the current time. Note that this setup is slightly different from usual trial-based perceptual decision making experiments where a formally new trial begins after a participant’s response. The disk also had a pattern, but it’s unclear why the pattern was necessary. On the other hand, using the other stimulus properties seems reasonable: The flickering induced something like continuous evoked potentials in the EEG ensuring that something stimulus-related could be measured at all times, but the gradual change of contrast “successfully eliminated sensory-evoked deflections from the ERP trace” such that the more subtle accumulated evidence signals were not masked by large deflections solely due to stimulus onsets. In the experiments with sounds, equivalent stimulus changes were implemented by either gradually changing the volume of a presented, envelope-modulated tone or its frequency.

The authors report 4 EEG signals related to perceptual decision making. They argue that the occipital steady-state visual-evoked potential (SSVEP) indicated the estimated instantaneous evidence when visual stimuli were used, because its trajectories directly reflected the changes in constrast. For auditory stimuli, the authors found a corresponding steady-state auditory-evoked potential (SSAEP) which was located at more central EEG electrodes and at 40Hz instead of 20Hz (SSVEP). Further, the authors argue that a left-hemisphere beta (LHB, 22-30Hz) and a centro-parietal potential (CPP, direct electrode measurements) could be interpreted as evidence accumulation signals, because the time of their peaks tightly predicted reaction times and their time courses were better predicted by the cumulative SSVEP instead of the original SSVEP. LHB and CPP also (roughly) showed the expected dependency on whether the participant correctly identified the target, or missed it (lower signals for misses). Furthermore, they reacted expectedly, when contrast varied in more complex ways than just a linear decrease (decrease followed by short increase followed by decrease). CPP was different from LHB by also showing the expected changes when the task did not require an overt response at target detection time whereas LHB showed no relation to the present evidence in this task indicating that it may have something to do with motor preparation of the response while CPP is a more abstract decision signal. Additionally, the CPP showed the characteristic changes measured with visual stimuli also with auditory stimuli and it depended on attentional focus: In one experimental condition the task of the participants was altered (‘detect a transient size change of a central fixation square’), but the original disk stimulus was still presented including the gradual contrast changes. In this ‘non-attend’ condition the SSVEP decreased with contrast as before, but the CPP showed no response reinforcing the idea that the CPP is an abstract decision signal. On a final note, the authors speculate that the CPP could be equal to the standard P300 signal, when transient stimuli need to be detected instead of gradual stimulus changes. This connection, if true, would be a nice functional explanation of the P300.

Open Questions

Despite the generally intriguing results presented in the paper a few questions remain. These predominantly regard details.

1) omission of data

In Figs. 2 and 3 the SSVEP is not shown anymore, presumably because of space restrictions. Similarly, the LHB is not presented in Fig. 4. I can believe that the SSVEP behaved expectedly in the different conditions of Figs. 2 and 3 such that not much information would have been added by providing the plots, but it would at least be interesting to know whether the accumulated SSVEP still predicted the LHB and CCP better than the original SSVEP in these conditions. Likewise, the authors do not report the equivalent analysis for the SSAEP in the auditory conditions. Regarding the omission of the LHB in Fig. 4, I’m not so certain about the behaviour of the LHB in the auditory conditions. It seems possible that the LHB shows different behaviour with different modalities. There is no mention of this in the text, though.

2) Is there a common threshold level?

The authors argue that the LHB and CCP reached a common threshold level just before response initiation (a prediction of accumulation to bound models, Fig. 1c), but the used test does not entirely convince me: They compared the variance just before response initiation with the variance of measurements across different time points (they randomly assigned the RT of one trial to another trial and computed variance of measurements at the shuffled time points). For a strongly varying function of time, it is no surprise that the measurements at a consistent time point vary less than the measurements made across many different time points as long as the measurement noise is small enough. Based on this argument, it is strange that they did not find a significant difference for the SSVEP which also varies strongly across time (this fits into their interpretation, though), but this lack of difference could be explained by larger measurement noise associated with the SSVEP.

Furthermore, the authors report themselves that they found a significant difference between the size of CPP peaks around decision time for varying contrast levels (Fig. 2c). Especially, the CPP peak for false alarms (no contrast change, but participant response) was lower than the other peaks. If the CPP really is the decision variable predicted by the models, then these differences should not have occurred. So where do they come from? The authors provide arguments that I cannot follow without further explanations.

3) timing of peaks

It appears that the mean reaction time precedes the peaks of the mean signals slightly. The effect is particularly clear in Fig. 3b (CPP), Fig. 4d (CPP) and Fig. 5a, but is also slightly visible in the averages centred at the time of response in Figs. 1c and 2c. Presuming a delay from internal decision time to actual response, the time of the peak of the decision variable should precede the reaction time, especially when reaction time is measured from button presses (here) compared to saccade initiation (typical monkey experiments). So why does it here appear to be the other way round?

4) variance of SSVEP baseline

The SSVEP in Fig. 4a is in a different range (1.0-1.3) than the SSVEP in Fig. 4d (1.7-2.5) even though the two plots should each contain a time course for the same experimental condition. Where does the difference come from?

5) multiple alternatives

The CPP, as described by the authors, is a single, global signal of a decision variable. If the decision problem is composed of only two decision alternatives, a single decision variable is indeed sufficient for decision making, but if more alternatives are considered, several evidence accumulating variables are needed. What would the CPP then signal? One of the decision variables? The total amount of certainty of the upcoming decision?


I do like the results in the paper. If they hold up, the CPP may provide a high temporal resolution window into the decision processes of humans. As a result, it may allow us to investigate decision processes for more complex situations than those which animals can master, but maybe it’s only a signal for the simple, perceptual decisions investigated here. Based on the above open questions I also guess that the reported signals were noisier than the plots make us belief and the correspondence of the CPP with theoretical decision variables should be further examined.

People are statisticians, not logicians

I was just reminded in a talk that people (including me) often fail to apply the modus tollens, i.e., they fail to infer that an antecedent is false given that the corresponding consequent is false. Here is an example:

If there is a circle, there is also a triangle. There is no triangle. Can you say anything about whether there is a circle?

According to the rules of propositional logic (modus tollens) you can infer the absence of the circle from the absence of the triangle. A few people, perhaps those who are not extensively trained in logic, tend to miss that. This effect is actually known for a long time (see Wason selection task for a different variant).

The example above made me think about what the effect means for how people reason about their environment. These experiments show that some people readily associate two things, but are wary of drawing conclusions from the absence of one of them. Those people, therefore, follow good statistical practice: absence of evidence is not evidence of absence. The point becomes clear with the help of Wikipedia’s example:

A baker never failed to put finished pies on her windowsill, so if there is no pie on the windowsill, then no finished pies exist?

We have learnt repeatedly in our lives that we cannot draw such a conclusion with certainty, because there may always be events which interfere with the process of putting pies in windows. For example, the baker may have had to leave the bakery due to an emergency after having finished the pies, but before putting them in the windowsill. It, hence, seems, in my opinion, that we are unconsciously aware that most associations we make are correlational and not causal. So we only apply the modus tollens, when we are sufficiently certain that the association we learnt is causal:

This glass is so fragile, if someone drops it, it will break. Later you see that the glass is not broken. Can you say something about whether someone dropped it?

Application of modus tollens is simple here (at least it is intuitive for me), because of our extensive experience with glasses and our acquired understanding that there is a causal relationship between dropping and a broken glass. I will argue now that the difference between application of modus tollens in the glass example and non-application of modus tollens in the bakery example is due to an acquired preference of people to accept a relation as causal, when the effect immediately follows its cause.

In my view, one of the biggest achievements of mankind is finding reliable relations between an increasing number of events or entities in the world. We use these relations to predict outcomes which are the basis for achieving our aims by executing appropriate actions. We usually call finding these relationships ‘learning’ and we typically learn by observing events which happen before us.

The most reliable relations we can learn about are causal relations. It is actually not easy to define Causality formally, but intuitively one could say: an event A causes an event B, iff B always follows A. The only difference to the if-then relationship (implication) of propositional logic then is the additional temporal aspect. However, it is tremendously hard to figure out whether a relation in the real world is truly causal, or whether it could be broken by an interfering event C such that we then know that B does not always follow A. It is clear that it is pretty much impossible to verify that B really follows A in all (natural) conditions. So we are left with some uncertainty about whether A causes B. In particular, we know from experience that when there is a long time between events A and B, their relation tends to be brittle in the sense that many events could interfere with it (bakery example). On the other hand, if B immediately follows A, there is very little time for other events to interfere and we can be more certain about the relation between A and B (glass example). I believe, like many other neuroscientists [1-4], that we (our brains) unconsciously represent the uncertainties over learnt relations, at least approximately.

Therefore, when we are reluctant to apply the modus tollens, it is because we do not associate sufficient certainty with the suggested relation. This means that we often are quite sceptical (believe that the relation is uncertain), when we get told:

If there is a circle, there is also a triangle.

In the sceptical interpretation of that sentence we apparently think rather of a single example than of a material relation. Hence, could we make it stronger by making it more explicit that this relation is supposed to be universally true? Judge by yourself:

All circles occur together with triangles. There is no triangle. Can you say anything about whether there is a circle?

I argued that when people do not apply modus tollens, it is because they unconsciously understand the presented relation as a statistical relation which is uncertain. From my point of view, this makes a lot of sense, because the brain appears to routinely represent and process uncertain concepts [1-4]. I would also argue that people who do not apply modus tollens do not have a deficit in logical inference, because most of them will show application of modus tollens, when confronted with an appropriately phrased explanation and question. They just do not translate the used language into a representation of a certain relation. Regular students of maths and logic, on the other hand, have learnt to interpret corresponding language with certainty.

My argument here is based on the intuitive understanding of the example sentences driven by my own intuitions. Perhaps you can divide people in ‘statistically minded’ and ‘logically minded’ with respect to the effect in question, but I’m tempted to believe that the statistical mindset is the more common, natural one, because the brain has to cope with uncertainty on several levels anyway, and that the logical mindset is acquired on top of the statistical one. You may well have different intuitions about the above examples and I really wonder whether proportions of modus tollens applications across people follow mine. Somebody should do an experiment …

PS: It might be that I have just reformulated the ideas of Oaksford & Chater (see ch. 5 of [3] for a formalisation, experiments and discussion), but for now I have spent already too much time here to check this thoroughly.

[1] Knill, D. C. & Richards, W. (ed.) Perception as Bayesian Inference. Cambridge University Press, 1996, Google Books
[2] Doya, K.; Ishii, S.; Pouget, A. & Rao, R. P. N. (ed.) Bayesian Brain. MIT Press, 2006, Google Books
[3] Chater, N. & Oaksford, M. (ed.) The Probabilistic Mind: Prospects for Bayesian Cognitive Science. Oxford University Press, 2008, Google Books
[4] Trommershäuser, J.; Körding, K. & Landy, M. S. (ed.) Sensory Cue Integration. Oxford University Press, 2011, Google Books