What it took to get an ERC starting grant in 2016: Your own research group

The starting grants of the European research council

[…] are designed to encourage young talented research leader to gain independence in Europe and to build their own careers. The scheme targets promising researchers who have the proven potential of becoming independent research leaders.

(from the official mission statement) Following this statement, I apparently misinterpreted the goals of the starting grant: I believed that it should enable young researchers to become independent, but seeing the recent grantees it rather enables already independent researchers to become research leaders. I came to this conclusion by looking up the job positions of the recent 2016 grantees where I found that only about one quarter of grantees in the German sub-sample are post-docs while the vast majority of grantees already lead independent research groups, or are professors. In the following I detail my findings and distil a bit of career advice.

Recipients of ERC starting grants in Germany in 2016

The ERC provides a list of all recepients of starting grants in 2016. In total 325 proposals were accepted (11% success rate). In terms of individual countries, most grantees were based in Germany (61) followed by the UK (59) and France (46). I decided to look only at the German sub-sample, because that interested me most and I know the scientific landscape in Germany well. The German grantees have diverse scientific backgrounds, but the distribution leans towards the physical sciences and engineering (see the ERC starting grants statistics).

One statistic that the ERC does not provide is the statistic over the state in the career of the grantees, i.e., their job positions. To fill this void I went through the list of German grantees and looked them up in the internet. Except for 2 of the 61 I found more or less up to date career information. In 2 further grantees I wasn’t completely sure about their current position, but included them in the analysis anyway. I classified all positions into three categories: post-docs, (independent research) group leaders and professors.

Types of job positions

Post-docs are researchers who work in the group of someone else, that is, they cannot, or rather have not, independently hired an employee. I classify them as experienced, but dependent researchers.

Independent research group leaders have their own research group, but are not tenured, yet. In Germany there is a famous funding programme for this kind of position: the Emmy Noether programme. Another position on the same level is the so-called Juniorprofessorship, which may correspond to an assistant professorship in Britain and the US, but is much less common in Germany. Additionally there exist group leaders in pure research institutes such as the Max Planck institutes which are not associated to universities.

Finally, there is the tenured professorship as the holy grail of the science career. I put all professors whose title was not ‘Juniorprofessor’ into this category.

Distribution of job positions among German grantees

I found 16 post-docs, 29 group leaders and 14 professors among the recipients of starting grants. Therefore, about one quarter of grantees will become independent researchers through the starting grant while three quarters of grantees have already lead independent research groups at the time they applied for the starting grant and one quarter of grantees have already held tenured positions.

I have published my list of grantees together with my analysis on Github in case somebody wants to look that up. Also, feel free to add information or to extend the analysis!

So what does the starting grant enable?

The ERC starting grant is very attractive, because it endows the recipient with up to 1.5 million Euro over a period of up to 5 years while proposals are evaluated only based on scientific merit and risky, interdisciplinary research is encouraged – scientist’s heaven. Everyone who has obtained a doctorate within the previous 7 years can apply and 2935 people have done so in 2015 for getting the grant in 2016.

These conditions are equally attractive to post-docs and already independent researchers. The difference is that post-docs hope to become independent with it while group leaders hope to use it as leverage for professorship applications. While I see how the starting grant can help group leaders in “becoming independent research leaders” in the sense that they are not truly independent until they are tenured, I wonder how tenured professors fit in this scheme? What part of their scientific career are they expected to start that they haven’t started yet?

The low number of post-docs among the recipients of starting grants should be no surprise in a competition in which scientific merit and proven potential are the sole criteria and researchers at all career levels can enter. After all, group leaders and professors are in their positions, because they have been selected based on similar criteria before. The starting grant, therefore, does not generally enable researchers to make first independent steps, but to continue on a previously successful path of independent research, perhaps with a slight change of direction.

Career advice

The advice is clear: Don’t wait for the ERC starting grant to become an independent researcher. The name is misleading. It is unfortunate that the starting grant, to my knowledge, is the only grant that enables you to found an independent research group more than 5 years after your PhD. This means that you should try to become independent within the first 5 years after your PhD.

In Germany, possible paths to an independent research group are the Emmy Noether programme and the Freigeist fellowships of the Volkswagen Stiftung where anyone (who does ‘excellent’ research) can apply with a proposal. Alternatively, you can look out for Juniorprofessorships at universities and group leader positions in dedicated research institutes, but my impression is that they are sometimes not openly advertised and given to internal candidates.

For all those who are past the 5 years after their PhD there is not much left but to apply for professorships or for an ERC starting grant. At least there are some post-docs who got a starting grant. I don’t know how many post-docs applied to get a starting grant, but I assume that the fraction of post-docs among applicants was higher than the fraction of post-docs among grantees. Given this assumption, the success rate for post-docs should be below the overall success rate of 11%.

In any case, do what you like as long as you can and then find something else to do that you like! 😉

[Update 1]

I originally estimated the success rate for post-docs by dividing the overall success rate of 11% by 4. This is only valid when 4-times more post-docs applied for a grant than all other researchers.

[Update 2]

Kristin Lohwasser got in touch with me and clarified her position so I updated the post which now counts one more post-doc. She recommends to post-docs who wan to apply for a starting grant in the future:

to point out, where you have designed projects for other students and have been their primary supervisor (even if not officially).

Normative evidence accumulation in unpredictable environments.

Glaze, C. M., Kable, J. W., and Gold, J. I.
Elife, 4, 2015
DOI, Google Scholar

Abstract

In our dynamic world, decisions about noisy stimuli can require temporal accumulation of evidence to identify steady signals; differentiation to detect unpredictable changes in those signals; or both. Normative models can account for learning in these environments but have not yet been applied to faster decision processes. We present a novel, normative formulation of adaptive learning models that forms decisions by acting as a leaky accumulator with non-absorbing bounds. These dynamics, derived for both discrete and continuous cases, depend on the expected rate of change of the statistics of the evidence and balance signal identification and change detection. We found that, for two different tasks, human subjects learned these expectations, albeit imperfectly, then used them to make decisions in accordance with the normative model. The results represent a unified, empirically supported account of decision-making in unpredictable environments that provides new insights into the expectation-driven dynamics of the underlying neural signals.

Review

The authors suggest a model of sequential information processing that is aware of possible switches in the underlying source of information. They further show that the model fits responses of people in two perceptual decision making tasks and consequently argue that behaviour, which was previously considered to be suboptimal, may follow the normative, i.e., optimal, mechanism of the model. This mechanism postulates that typical evidence accumulation mechanisms in perceptual decision making are altered by the expected switch rate of the stimulus. Specifically, evidence accumulation becomes more leaky and a non-absorbing bound becomes lower when the expected switch rate increases. The paper is generally well-written (although there are some convoluted bits in the results section) and convincing. I was a bit surprised, though, that only choices, but not their timing is considered in the analysis with the model. In the following I’ll go through some more details of the model and discuss limitations of the presented models and their relation to other models in the field, but first I describe the experiments reported in the paper.

The paper reports two experiments. In the first (triangles task) people saw two triangles on the screen and had to judge whether a single dot was more likely to originate from the one triangle or the other. There was one dot and corresponding response per trial. In each trial the position of the dot was redrawn from a Gaussian distribution centred around one of the two triangles. There were also change point trials in which the triangle from which the dot was drawn switched (and then remained the same until the next change point). The authors analysed the proportion correct in relation to whether a trial was a change point. Trials were grouped into blocks which were defined by constant rate of switches (hazard rate) in the true originating triangle. In the second experiment (dots-reversal task), a random dot stimulus repeatedly switched (reversed) direction within a trial. In each trial people had to tell in which direction the dots moved before they vanished. The authors analysed the proportion correct in relation to the time between the last switch and the end of stimulus presentation. There were no blocks. Each trial had one of two hazard rates and one of two difficulty levels. The two difficulty levels were determined for each subject individually such that the more difficult one lead to correct identification of motion direction of a 500ms long stimulus in 65% of cases.

The authors present two normative models, one discrete and one continuous, which they apply across and within trial in the triangles and dots-reversal tasks, respectively. The discrete model is a simple hidden Markov model in which the hidden state can take one of two values and there is a common transition probability between these two values which they call hazard ‘rate’ (H). Observations were implicitly assumed Gaussian. They only enter during fitting as log-likelihood ratios in the form \(\beta*x_n\) where beta is a scaling relating to the internal / sensory uncertainty associated with the generative model of observations and \(x_n\) is the observed dot position (x-coordinate) in the triangles task. In methods, the authors derive the update equation for the log posterior odds (\(L_n\)) of the hidden state values given in Eqs. (1) and (2).

The continuous model is based on a Markov jump process with two states which is the continuous equivalent of the hidden Markov model above. Using Ito-calculus the authors again derive an update equation for the log posterior odds of the two states (Eq. 4), but during fitting they actually approximate Eq. (4) with the discrete Eq. (1), because it is supposedly the most efficient discrete-time approximation of Eq. (4) (no explanation for why this is the case was given). They just replace the log-likelihood ratio placeholder (LLR) with a coherence-dependent term applicable to the random dot motion stimulus. Notably, in contrast to standard drift-diffusion modelling of random dot motion tasks, the authors used coherence-dependent noise. I’d be interested in the reason for this choice.

There is an apparent fundamental difference between the discrete and continuous models which can be seen in Fig. 1 B vs C. In the discrete model, for H>0.5, the log posterior odds may actually switch sign from one observation to the next whereas this cannot happen in the continuous model. Conceptually, this means that the log posterior odds in the discrete model, when the LLR is 0, i.e., when there is no evidence in either direction, would oscillate between decreasing positive and increasing negative values until converging to 0. This oscillation can be seen in Fig. 2G, red line for |LLR|>0. In the continuous model such an oscillation cannot happen, because the infinitely many, tiny time steps allow the model to converge to 0 before switching the sign. Another way to see this is through the discrete hazard ‘rate’ H which is the probability of a sign reversal within one time step of size dt. When you want to decrease dt in the model, but want to maintain a given rate of sign reversals in, e.g., 1 second, H would also have to decrease. Consequently, when dt approaches 0, the probability of a sign reversal approaches 0, too, which means that H is a useless parameter in continuous time which, in turn, is the reason why it is replaced by a real rate parameter (\(\lambda\)) representing the expected number of reversals per second. In conclusion, the fundamental difference between discrete and continuous models is only an apparent one. They are very similar models, just expressed in different resolutions of time. In that sense it would have perhaps been better to present results in the paper consistently in terms of a real hazard rate (\(\lambda\)) which could be obtained in the triangles task by dividing H by the average duration of a trial in seconds. Notice that the discrete model represents all hazard rates \(\lambda>1/dt\) as H=1, i.e., it cannot represent hazard rates which would lead to more than 1 expected sign reversal per \(dt\). There may be more subtle differences between the models when the exact distributions of sign reversals are considered instead of only the expected rates.

Using first order approximations of the two models the authors identify two components in the dynamics of the log posterior odds L: a leak and a bias. [Side remark: there is a small sign mistake in the definition of leak k of the continuous model in the Methods section.] Both depend on hazard rate and the authors show that the leak dominates the dynamics for small L whereas the bias dominates for large L. I find this denomination a bit misleading, because both, leak and bias, effectively result in a leak of log-posterior odds L by reducing L in every time step (cf. Fig. 1B,C). The change from a multiplicative leak to one based on a bias just means that the effective amount of leak in L increases nonlinearly with L as the bias takes over.

To test whether this special form of leak underlies decision making the authors compared the full model to two versions which only had a multiplicative leak, or one based on bias. In the former the leak stayed constant for increasing L, i.e., \(L’ = \gamma*L\). In the latter there was perfect accumulation without leak up to the bias and then a bias-based leak which corresponds to a multiplicative leak where the leak rate increased with L such that \(L’ = \gamma(L)*L\) with \(\gamma(L) = bias / L\). The authors report evidence that in both tasks both alternative models do not describe choice behaviour as well as the full, normative model. In Fig. 9 they provide a reason by estimating the effective leak rate in the data and the models in dependence on the strength of sensory evidence (coherence in the dots reversal task). They do this by fitting the model with multiplicative leak separately to trials with low and high coherence (fitting to choices in the data or predicted by the different fitted models). In both data and normative model the effective leak rates depended on coherence. This dependence arises, because high sensory evidence leads to large values of L and I have argued above that larger L has larger effective leak rate due to the bias. It is, therefore, not surprising that the alternative model with multiplicative leak shows no dependence of effective leak on coherence. But it is also not surprising that the alternative model with bias-based leak has a larger dependence of effective leak on coherence than the data, because this model jumps from no leak to very large leak when coherence jumps from low to high. The full, normative model lies in between, because it smoothly transitions between the two alternative models.

Why is there a leak in the first place? Other people have found no evidence for a leak in evidence accumulation (eg. Brunton et al., 2013). The leak results from the possibility of a switch of the source of the observations, i.e., a switch of the underlying true stimulus. Without any information, i.e., without observations the possibility of a switch means that you should become more uncertain about the stimulus as time passes. The larger the hazard rate, i.e., the larger the probability of a switch within some time window, the faster you should become uncertain about the current stimulus. For a log posterior odds of L=0 uncertainty is at its maximum (both stimuli have equal posterior probability). This is another reason why discrete hazard ‘rates’ H>0.5 which lead to sign reversals in L do not make much sense. The absence of evidence for one stimulus should not lead to evidence for the other stimulus. Anyway, as the hazard rate goes to 0 the leak will go to 0 such that in experiments where usually no switches in stimulus occur subjects should not exhibit a leak which explains why we often find no evidence for leaks in typical perceptual decision making experiments. This does not mean that there is no leak, though. Especially, the authors report here that hazard rates estimated from behaviour of subjects (subjective) tended to be a bit higher than the ones used to generate the stimuli (objective), when the objective hazard rates were very low and the other way around for high objective hazard rates. This indicates that people have some prior expectations towards intermediate hazard rates that biased their estimates of hazard rates in the experiment.

The discussed forms of leak implement a property of the model that the authors called a ‘non-absorbing bound’. I find this wording also a bit misleading, because ‘bound’ was usually used to indicate a threshold in drift diffusion models which, when reached, would trigger a response. The bound here triggers nothing. Rather, it represents an asymptote of the average log posterior odds. Thus, it’s not an absolute bound, but it’s often passed due to variance in the momentary sensory evidence (LLR). I can also not follow the authors when they write: “The stabilizing boundary is also in contrast to the asymptote in leaky accumulation, which increases linearly with the strength of evidence”. Based on the dynamics of L discussed above the ‘bound’ here should exhibit exactly the described behaviour of an asymptote in leaky accumulation. The strength of evidence is reflected in the magnitude of LLR which is added to the intrinsic dynamics of the log posterior odds L. The non-absorbing bound, therefore, should be given by bias + average of LLR for the current stimulus. The bound, thus, should rise linearly with the strength of evidence (LLR).

Fitting of the discrete and continuous models was done by maximising the likelihood of the models (in some fits with many parameters, priors over parameters were used to regularise the optimisation). The likelihood in the discrete models was Gaussian with mean equal to the log posterior odds (\(L_n\)) computed from the actual dot positions \(x_n\). The variance of the Gaussian likelihood was fitted to the data as a free parameter. In the continuous model the likelihood was numerically approximated by simulating the discretised evolution of the probabilities that the log posterior odds take on particular values. This is very similar to the approach used by Brunton2013. The distribution of the log posterior odds \(L_n\) was considered here, because the stream of sensory observations \(x(t)\) was unknown and therefore had to enter as a random variable while in the triangles task \(x(t)=x_n\) was set to the known x-coordinates of the presented dots.

The authors argued that the fits of behaviour were good, but at least for the dots reversal task Fig. 8 suggests otherwise. For example, Fig. 8G shows that 6 out of 12 subjects (there were supposed to be 13, but I can only see 12 in the plots) made 100% errors in trials with the low hazard rate of 0.1Hz and low coherence where the last switch in stimulus was very recent (maximally 300ms before the end of stimulus presentation). The best fitting model, however, predicted error rates of at most 90% in these conditions. Furthermore, there is a significant difference in choice errors between the low and high hazard rate for large times after the last switch in stimulus (Fig. 8A, more errors for high hazard rate) which was not predicted by the fitted normative model. Despite these differences the fitted normative model seems to capture the overall patterns in the data.

Conclusion

The authors present an interesting normative model in discrete and continuous time that extends previous models of evidence accumulation to situations in which switches in the presented stimulus can be expected. In light of this model, a leak in evidence accumulation reflects a tendency to increase uncertainty about the stimulus due to a potentially upcoming switch in the stimulus. The model provides a mathematical relation between the precise type of leak and the expected switch (hazard) rate of the stimulus. In particular, and in contrast to previous models, the leak in the present model depends nonlinearly on the accumulated evidence. As the authors discuss, the presented normative model potentially unifies decision making processes observed in different situations characterised by different stabilities of the underlying stimuli. I had the impression that the authors were very thorough in their analysis. However, some deviations of model and data apparent in Fig. 8 suggest that either the model itself, or the fitting procedure may be improved such that the model better fits people’s behaviour in the dots-reversal task. It was anyway surprising to me that subjects only had to make a single response per trial in that task. This feels like a big waste of potential choice data when I consider that each trial was 5-10s long and contained several stimulus switches (reversals).

A test of Bayesian observer models of processing in the Eriksen flanker task.

White, C. N., Brown, S., and Ratcliff, R.
J Exp Psychol Hum Percept Perform, 38:489–497, 2012
DOI, Google Scholar

Abstract

Two Bayesian observer models were recently proposed to account for data from the Eriksen flanker task, in which flanking items interfere with processing of a central target. One model assumes that interference stems from a perceptual bias to process nearby items as if they are compatible, and the other assumes that the interference is due to spatial uncertainty in the visual system (Yu, Dayan, & Cohen, 2009). Both models were shown to produce one aspect of the empirical data, the below-chance dip in accuracy for fast responses to incongruent trials. However, the models had not been fit to the full set of behavioral data from the flanker task, nor had they been contrasted with other models. The present study demonstrates that neither model can account for the behavioral data as well as a comparison spotlight-diffusion model. Both observer models missed key aspects of the data, challenging the validity of their underlying mechanisms. Analysis of a new hybrid model showed that the shortcomings of the observer models stem from their assumptions about visual processing, not the use of a Bayesian decision process.

Review

This is a response to Yu2009 in which the authors show that Yu et al.'s main Bayesian models cannot account for the full data of an Eriksen flanker task. In particular, Yu et al.'s models predict a far too high overall error rate with the suggested parameter settings that reproduce the inital drop of accuracy below chance level for very fast responses. The argument put forward by White et al. is that the mechanisms used in Yu et al.'s models to overcome initial, flanker-induced biases is too slow, i.e., the probabilistic evidence accumulation implemented by the models is influenced by the flankers for too long. White et al's shrinking spotlight models do not have such a problem, mostly because the speed with which flankers loose influence is fitted to the data. The argument seems compelling, but I would like to understand better why it takes so long in the Bayesian model to overcome flanker influence and whether there are other ways of speeding this up than the one suggested by White et al..

Dynamics of attentional selection under conflict: toward a rational Bayesian account.

Yu, A. J., Dayan, P., and Cohen, J. D.
J Exp Psychol Hum Percept Perform, 35:700–717, 2009
DOI, Google Scholar

Abstract

The brain exhibits remarkable facility in exerting attentional control in most circumstances, but it also suffers apparent limitations in others. The authors' goal is to construct a rational account for why attentional control appears suboptimal under conditions of conflict and what this implies about the underlying computational principles. The formal framework used is based on Bayesian probability theory, which provides a convenient language for delineating the rationale and dynamics of attentional selection. The authors illustrate these issues with the Eriksen flanker task, a classical paradigm that explores the effects of competing sensory inputs on response tendencies. The authors show how 2 distinctly formulated models, based on compatibility bias and spatial uncertainty principles, can account for the behavioral data. They also suggest novel experiments that may differentiate these models. In addition, they elaborate a simplified model that approximates optimal computation and may map more directly onto the underlying neural machinery. This approximate model uses conflict monitoring, putatively mediated by the anterior cingulate cortex, as a proxy for compatibility representation. The authors also consider how this conflict information might be disseminated and used to control processing.

Review

They suggest two simple, Bayesian perceptual models based on evidence integration for the (deadlined) Eriksen task. Their focus is on attentional mechanisms that can explain why particpants' responses are below chance for very fast responses. These mechanisms are based on a prior on compatibility (that flankers are compatible with the relevant centre stimulus) and spatial uncertainty (flankers influence processing of centre stimulus on a low, sensory level). The core inference is the same and replicates the basic mechanism you would expect for any perceptual decision making task. They don't fit behaviour, but rather show average trajectories from model simulations with hand-tuned parameters. They further suggest a third model inspired by previous work on conflict monitoring and cognitive control which supposedly is more likely to be implemented in the brain, because instead of having to consider (and compute with) all possible stimuli in the environment, it uses a conflict monitoring mechanism to switch between types of stimuli that are considered.

Neural correlates of perceptual decision making before, during, and after decision commitment in monkey frontal eye field.

Ding, L. and Gold, J. I.
Cereb Cortex, 22:1052–1067, 2012
DOI, Google Scholar

Abstract

Perceptual decision making requires a complex set of computations to implement, evaluate, and adjust the conversion of sensory input into a categorical judgment. Little is known about how the specific underlying computations are distributed across and within different brain regions. Using a reaction-time (RT) motion direction-discrimination task, we show that a unique combination of decision-related signals is represented in monkey frontal eye field (FEF). Some responses were modulated by choice, motion strength, and RT, consistent with a temporal accumulation of sensory evidence. These responses converged to a threshold level prior to behavioral responses, reflecting decision commitment. Other responses continued to be modulated by motion strength even after decision commitment, possibly providing a memory trace to help evaluate and adjust the decision process with respect to rewarding outcomes. Both response types were encoded by FEF neurons with both narrow- and broad-spike waveforms, presumably corresponding to inhibitory interneurons and excitatory pyramidal neurons, respectively, and with diverse visual, visuomotor, and motor properties, albeit with different frequencies. Thus, neurons throughout FEF appear to make multiple contributions to decision making that only partially overlap with contributions from other brain regions. These results help to constrain how networks of brain regions interact to generate perceptual decisions.

Review

This paper puts some perspective in the usually communicated statement that LIP neurons are responsible for perceptual decision making in monkeys who perform a reaction time motion discrimination task. Especially, the authors report on neurons in frontal eye field (FEF) that also show typical accumulation-to-bound responses. Furthermore, at least as many neurons in FEF exhibited activity that was correlated with motion coherence and choice during and after the saccade indicating a choice and extinguishing the stimulus, i.e., the activity of these neurons appeared to accumulate evidence, but seemed to ignore the supposed bound and maintained a representation of the stimulus after it had gone. In the discussion the authors also point to other studies which found activity that can be interpreted in terms of evidence accumulation. Corresponding neurons have been found in LIP, FEF, superior colliculus (SC) and caudate nucleus of which neurons in LIP and SC may be mostly governed by a bound. From the reported and reviewed results it becomes clear that, although accumulation-to-bound may be an important component of perceptual decision making, it is not sufficient to explain the wide variety of decision-related neuronal activity in the brain. In particular, it is unclear how neurons from the mentioned brain regions interact and what their different roles in perceptual decision making are.

On the Jeffreys-Lindley paradox.

Robert, C.
Philosophy of Science, 81:216–232, 2014
URL, Google Scholar

Abstract

This paper discusses the dual interpretation of the Jeffreys–Lindley paradox associated with Bayesian posterior probabilities and Bayes factors, both as a differentiation between frequentist and Bayesian statistics and as a pointer to the difficulty of using improper priors while testing. We stress the considerable impact of this paradox on the foundations of both classical and Bayesian statistics. While assessing existing resolutions of the paradox, we focus on a critical viewpoint of the paradox discussed by Spanos (2013) in Philosophy of Science.

Review

Robert discusses whether the Jeffreys-Lindley paradox can be used to discredit the frequentist or Bayesian approach to statistical testing. He concludes that it cannot, because it just shows the different interpretations inherent in the two approaches. Interesting insights into Bayesian hypothesis testing follow.

Whereas Murphy (2012) directly defines the Jeffreys-Lindley paradox in terms of too wide and improper priors, Robert here defines it first from the, historical, statistical testing perspective in which it is more puzzling. In particular, the Jeffreys-Lindley paradox is that for a given value of a test statistic and increasing number of data points the p-value of a point-null hypothesis stays the same, e.g. rejecting the null, but the supposedly corresponding Bayes factor goes to infinity, e.g. indicating overwhelming evidence for the null. Robert points out that the assumption that the test statistic stays constant for increasing number of data points is only realistic, when the point-null is actually true. If the alternative is true, the test statistic converges to infinity. Then, both approaches give consistent answers, i.e., they both reject the null. This analysis suggests a problem with p-value testing of point-null hypotheses which is well known and, as Robert states, has been addressed by the (frequentist) Neyman-Pearson approach to statistical testing in which the p-value threshold is adapted to the number of data points. Therefore, Robert concludes that the Jeffreys-Lindley paradox “is not a statistical paradox”. It points, however, to a problem in the Bayesian approach: The Bayes factor directly depends on the width of the prior distribution.

Robert mentions that the original formulation of the Bayesian part of the paradox is equivalent to another formulation in which the width of the prior for the alternative hypothesis is changed instead of the number of data points. Note that the width of the prior defines the range of parameter values that is considered to be realistic under the alternative hypothesis. With the width of the prior, the Bayes factor then increases, eventually accepting the point-null. This corresponds to Murphy2012’s definition of the paradox which states that decisions based on the Bayes factor arbitrarily depend on the chosen width of the prior. Consequently, the width of the prior should not be chosen arbitrarily! Robert points out that this behaviour of the Bayes factor is consistent with the Bayesian framework: As you become more uncertain about the alternative, the null becomes more likely. And he writes: “Depending on one’s perspective about the position of Bayesian statistics within statistical theories of inference, one might see this as a strength or as a weakness since Bayes factors and posterior probabilities do require a realistic model under the alternative when p-values and Bayesian predictives do not. A logical reason for this requirement is that Bayesian inference need proceed with the alternative model when the null is rejected.”

So it’s crucial for Bayesian hypothesis testing that priors realisitcally reflect the subjective uncertainty about the parameters in the model(s), but what if you really don’t know anything about sensible parameter ranges and it is the same to you whether you choose a wide prior, or one that is, e.g., two times wider? Robert briefly points to “score and predictive procedures” (I think he means Bayesian predictives as advocated by Gelman, e.g., in Gelman et al., 2013), but acknowledges that this is still a contended topic.

What does that mean for Bayesian model comparisons? They are delicate, although, in my opinion, not more delicate than p-value testing. Priors have to be justified and, in doubt, it has to be shown that the main conclusions do not critically depend on the priors. Robert also reminds us that model comparison fundamentally restricts inference to the considered models, i.e., model comparison does not say anything about the suitability of the considered models. If the considered models are all bad models of the data, model comparison likely does not provide useful information, because it only states which of the models is the “best”. And these considerations don’t even consider yet that marginal likelihoods are hard to compute. Model comparison is appealing conceptually, because it allows to make definitive statements about the data, but given these considerations it should, at least, be accompanied by a predictive analysis.

Universality in numerical computations with random data.

Deift, P. A., Menon, G., Olver, S., and Trogdon, T.
Proc Natl Acad Sci U S A, 111:14973–14978, 2014
DOI, Google Scholar

Abstract

The authors present evidence for universality in numerical computations with random data. Given a (possibly stochastic) numerical algorithm with random input data, the time (or number of iterations) to convergence (within a given tolerance) is a random variable, called the halting time. Two-component universality is observed for the fluctuations of the halting time-i.e., the histogram for the halting times, centered by the sample average and scaled by the sample variance, collapses to a universal curve, independent of the input data distribution, as the dimension increases. Thus, up to two components-the sample average and the sample variance-the statistics for the halting time are universally prescribed. The case studies include six standard numerical algorithms as well as a model of neural computation and decision-making. A link to relevant software is provided for readers who would like to do computations of their own.

Review

The author’s show that normalised halting / stopping times follow common distributions. Stopping times are assumed to be generated by an algorithm A from a random ensemble E where E does not represent the particular sample from which stopping times are generated, but the theoretical distribution of that sample. Normalisation is standard normalisation: subtract mean and divide by standard deviation of a sample of stopping times. The resulting distribution is the same across different ensembles E, but differs across algorithms A. That distributions are the same the authors call (two-component) universality without explanation why they call it like that. There is also no reference to a concept of universality. Perhaps it’s something common in physics. Perhaps it’s explained in their first reference. Reference numbers are shifted by one, by the way.

How is that interesting? I’m not sure. The authors give an example with a model of reaction times. This is a kind of Ising model where decisions are made once a sufficient number of binary states have switched to one of the states. States flip with a certain probability as determined by a given function of the current state of the whole Ising model. When different such functions were considered, corresponding to different ensembles E, normalised reaction times followed the same distribution again. However, the distribution of normalised reaction times differed for different total numbers of binary states in the Ising model. These results suggest that normalised reaction times should follow the same distribution over subjects, but only if subjects differ maximally by the randomness on which their decisions are based. If subjects use slightly different algorithms for making decisions, you would expect differences in the distribution of normalised reaction times. I guess it would be cool to infer that subjects use the same (or a different) algorithm purely from their reaction time distributions, but what would be an appropriate test for this and what would be its power?

Effects of cortical microstimulation on confidence in a perceptual decision.

Fetsch, C. R., Kiani, R., Newsome, W. T., and Shadlen, M. N.
Neuron, 83:797–804, 2014
DOI, Google Scholar

Abstract

Decisions are often associated with a degree of certainty, or confidence-an estimate of the probability that the chosen option will be correct. Recent neurophysiological results suggest that the central processing of evidence leading to a perceptual decision also establishes a level of confidence. Here we provide a causal test of this hypothesis by electrically stimulating areas of the visual cortex involved in motion perception. Monkeys discriminated the direction of motion in a noisy display and were sometimes allowed to opt out of the direction choice if their confidence was low. Microstimulation did not reduce overall confidence in the decision but instead altered confidence in a manner that mimicked a change in visual motion, plus a small increase in sensory noise. The results suggest that the same sensory neural signals support choice, reaction time, and confidence in a decision and that artificial manipulation of these signals preserves the quantitative relationship between accumulated evidence and confidence.

Review

The paper provides verification of beliefs asserted in Kiani2009: Confidence is directly linked to accumulated evidence as represented in monkey area LIP during a random dot motion discrimination task. The authors use exactly the same task, but now stimulate patches of MT/MST neurons instead of recording single LIP neurons and resort to analysing behavioural data only. They find that small microstimulation of functionally well-defined neurons, that signal a particular motion direction, affects decisions in the same way as manipulating the motion information in the stimulus directly. This was expected, because it has been shown before that stimulating MT neurons influences decisions in that way. New here is that the effect of stimulation on confidence judgements was evaluated at the same time. The rather humdrum result: confidence judgements are also affected in the same way. The authors argue that this didn’t have to be, because confidence judgements are thought to be a metacognitive process that may be influenced by other high-level cognitive functions such as related to motivation. Then again, isn’t decision making thought to be a high-level cognitive function that is clearly influenced by motivation?

Anyway, there was one small effect particular to stimulation that did not occur in the control experiment where the stimulus itself was manipulated: There was a slight decrease in the overall proportion of sure-bet choices (presumably indicating low confidence) with stimulation suggesting that monkeys were more confident when stimulated. The authors explain this with larger noise (diffusion) in a simple drift-diffusion model. Counterintuitively, the larger accumulation noise increases the probability of moving away from the initial value and out of the low-confidence region. The mechanism makes sense, but I would rather explain it within an equivalent Bayesian model in which MT neurons represent noisy observations that are transformed into noisy pieces of evidence which are accumulated in LIP. Stimulation increases the noise on the observations which in turn increases accumulation noise in the equivalent drift-diffusion model (see Bitzer et al., 2014).

In drift-diffusion models drift, diffusion and threshold are mutually redundant in that one of them needs to be fixed when fitting the model to choices and reaction times. The authors here let all of them vary simultaneously which indicates that the parameters can be discriminated based on confidence judgements even when no reaction time is taken into account. This should be followed up. It is also interesting to think about how the postulated tight link between the ‘decision variable’ and the experienced confidence can be consolidated in a reaction time task where supposedly all decisions are made at the same threshold value. Notice that the confidence of a decision in their framework depends on the state of the diffusion (most likely one of the two boundaries) and the time of the decision: Assuming fixed noise, smaller decision times should translate into larger confidence, because you assume that this is due to a larger drift. Therefore, you should see variability of confidence judgements in a reaction time task that is strongly correlated with reaction times.

Besser als der Zufall: Kleiner Trost bei verlorenem Tippspiel

WM vorbei und beim Tippspiel wieder nichts abgeräumt? Willkommen im Boot! Ich auch nicht. Dabei dachte ich, dass ich das nötige Fußballwissen habe um gut tippen zu können. Doch hier regierte der Zufall, oder? Jedenfalls wollte ich wissen ob mein Fußballwissen ausgereicht hat um wenigstens besser als jemand zu sein, der einfach nur zufällig auf einen Sieger getippt hat (der Zufallstipper). Hier meine Analyse. Sie verrät euch wieviel Tippspiel-Punkte ihr haben müsstet um einigermaßen sicher sagen zu können, dass ihr besser wart als der Zufall. Nebenbei erklärt sie wie Wissenschaftler den Ausgang von Experimenten bewerten.

Dieser Post basiert auf einem IPython Notebook, welches ihr euch auch (hier) direkt anschauen könnt.

Jogi’s Jungs machten am Sonntag den WM-Titel klar und krönten damit eine aufregende WM mit dem Ereignis, das wir uns alle erhofft hatten. Doch mal ehrlich: Wer hat das denn vorher getippt? Ich jedenfalls nicht. Und leider lag ich auch bei den Tipps für die einzelnen Spiele zu oft daneben. So kam es, dass ich mal wieder am Ende unseres Tippspiels leer aus ging. Dabei hatte ich dieses Mal viele Spiele (zu viele, nach Meinung meiner Freundin) geschaut, Analysen gelesen und Quoten studiert um den nötigen Vorteil beim Tippen zu erarbeiten. Belohnt haben mich meine Tipps nicht. Bin ich zu doof zum Tippen? Wäre ich vielleicht besser gewesen, wenn ich einfach ohne nachdenken, zufällig getippt hätte? Um genau diese Frage zu beantworten, habe ich mir einen statistischen Test ausgedacht, der auf einem Tippsimulator basiert. Der nötige Programmcode befindet sich in folgender Datei, die ich nun lade:

In [1]:

run TippSimulator.py

Es folgen noch ein paar Erläuterungen zum Vorgehen. Dann, weiter unten, die Antwort.

Tippsimulator

Um entscheiden zu können ob ich besser als der Zufall bin, muss ich natürlich erst einmal sagen, was der Zufall genau macht. Entscheidend für meine Analyse ist, dass mein Tippsimulator rein zufällig den Sieger eines Spiels tippt. Er ignoriert also jegliche Information über potentielle Favoriten und ist somit ein Zufallstipper. Da wir in einem Tippspiel aber immer die Spielstände tippen, muss der Tippsimulator natürlich auch sinnvolle Spielstände generieren können. Ich habe mir also angeschaut wie die Spiele in den letzten WMs ausgegangen sind und die Spielstände per Hand gesammelt (WMs 2006 und 2010). Hiermit kann ich sie einlesen:

In [2]:

worldcups = readWCHistory()

Die Spielstände der letzten zwei WMs füttere ich dann in meinen Tippsimulator:

In [3]:

sim = TippSimulator(worldcups)

Die Variable sim enthält nun einen TippSimulator, der Spielergebnisse tippen kann. Dabei aber typische Ergebnisse der letzten WMs berücksichtigt. Als Beispiel hier 7 zufällig getippte Spiele:

In [4]:

sim.generateScores(sim.stage1dist, 7)

Out[4]:

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 3.,  1.],
       [ 1.,  0.],
       [ 1.,  2.],
       [ 0.,  1.]])

Tippspiel

Um die Punkte fürs Tippspiel berechnen zu können, muss ich natürlich auch noch die Regeln des Tippspiels definieren. Diese habe ich in einer Klasse WC2104Tippspiel zusammen gefasst und dabei die Regeln unseres Tippspiels übernommen. Das waren die Standardregeln von kicktipp.de, das heißt, wir bekamen 4 Punkte für das richtige Ergebnis, 3 Punkte für die richtige Tordifferenz und 2 Punkte, wenn nur der richtige Sieger getippt wurde, oder wenn das Spiel unentschieden, aber mit einem anderen als dem getippten Unentschieden, ausging. Nach der Vorrunde tippten wir das Endergebnis nach möglichem Elfmeter.Natürlich braucht man auch die Spielergebnisse dieser WM um die Punkte für die Tipps berechnen zu können. All das wird von diesem Aufruf übernommen:

In [5]:

tip = WC2014Tippspiel()

Wieviel Punkte der Zufall bekommt

Ich kann nun den Tippsimulator dem Tippspiel übergeben und erhalte eine Punkteverteilung, die beschreibt wie wahrscheinlich jeder Punktestand ist, wenn man die Zufallsstrategie des Tippsimulators zum Tippen verwendet:

In [6]:

Pdist = tip.compPointDist(sim)

Das möchte ich natürlich mit meinen Punkten vergleichen. Dazu lade ich meine Tipps ins Programm und berechne die dafür erhaltenen Punkte vom Tippspiel. Gleichzeitig lade ich auch die Daten für den Gewinner unserer Tipprunde, der das Pseudonym “Noib” als Name gewählt hatte.

In [7]:

names, tscores = tip.readWC2014tipps()
tpoints = tip.compWCPoints(tscores)
tpointssum = tpoints.cumsum(0)

Nun kann ich alles zusammen in eine Abbildung bringen:

In [8]:

plotTippPointsWithDist(Pdist, tpointssum, names)

Die grauen Linien zeigen den Verlauf unseres Tippspiels für mich und Noib. Je mehr Spiele in der WM entschieden wurden, desto mehr Punkte haben wir natürlich gesammelt. Die Kurven steigen also nach rechts hin an. Anfangs hielt ich mit Noib noch gut mit, aber ab ca. Spiel 20 der WM war bei mir Punkteflaute und Noib setzte sich ab.Doch was macht der Zufallstipper? Ich habe seine Punkteverteilung farblich gekennzeichnet. Schwarz zeigt eine hohe Wahrscheinlichkeit für den jeweiligen Punktestand an und über rot und gold zu weiß sinkt die Wahrscheinlichkeit bis 0,0. Wahrscheinlichkeiten, die über 0,05 hinaus gehen sind auch mit schwarz gekennzeichnet. Man kann sehen, dass der Punktestand des Zufallstippers (natürlich) auch ansteigt. Dabei werden nach mehr Spielen immer mehr Punktestände wahrscheinlich, das heißt, für höhere Spielnummern sind mehr Punktestände farblich markiert.

Ein erstes Indiz dafür, ob ich besser war als der Zufall, ist ob sich mein Punktestand außerhalb des farbigen Bereichs befindet. Für die ersten paar Spiele sieht es so aus als hätte ich mich vom Zufall absetzen können, doch nach besagtem Spiel 20 habe ich mehrere Spiele hintereinander überhaupt keine Punkte bekommen. Da hat der Zufall dann wieder aufgeholt. Bis ca. zum Ende der Gruppenphase (Spiel 48) schwamm ich auf der Zufallswelle um dann von den leicht vorhersagbaren Spielergebnissen der Achtel- und Viertelfinals zu profitieren.

War ich wirklich besser als der Zufall?

Im Prinzip kann der Zufallstipper jeden beliebigen Punktestand erreichen. Die oben farbig gekennzeichneten Punktestände sind nur die, die der Zufallstipper am wahrscheinlichsten erreicht. Wie kann ich dann einigermaßen sicher sein, dass ich besser als der Zufallstipper war?Genau das gleiche Problem haben Wissenschaftler, wenn sie entscheiden müssen, ob ein beobachtetes Phänomen nur zufällig entstanden ist, oder tatäschlich von ihrer experimentellen Manipulation verursacht wurde. Zum Beispiel, könnte sich ein Arzt fragen, ob ein verabreichtes Medikament tatsächlich gewirkt hat. Dazu vergleicht er dann eine Kontrollgruppe, die nur ein Placebo bekommen hat mit einer anderen Gruppe, die das Medikament eingenommen hat. Die Kontrollgruppe ist hier der Zufallstipper, ich bin die Medikamentengruppe und wie stark die Beschwerden zurück gegangen sind ist der Punktestand. Der Arzt muss dann entscheiden, ob der Rückgang der Beschwerden in der Medikamentengruppe größer war als der in der Kontrollgruppe. Dazu schaut er sich an wie wahrscheinlich der Rückgang der Beschwerden in der Medikamentengruppe ist, wenn er die Variabilität des Rückgangs der Beschwerden in der Kontrollgruppe zu Grunde legt. Nur wenn der Rückgang der Beschwerden weit außerhalb des Bereiches liegt, der häufig in der Kontrollgruppe vorkommt, kann der Arzt sicher sein, dass das Medikament gewirkt hat und dass er den Rückgang der Beschwerden in der Medikamentengruppe nicht falsch dem Medikament zuschreibt (schließlich könnten die Beschwerden auch zufällig zurück gegangen sein). Genau dieses Prinzip wende ich nun auf mich und den Zufallstipper an.

Die Abbildung oben hat mir schon einen guten Eindruck davon vermittelt, dass ich zumindest zum Schluss des Tippspiels außerhalb des Punktebereiches lag, den der Zufallstipper häufiger erreichen würde. Das möchte ich nun auch noch anhand einer konkreten Zahl fest machen. Wenn der Zufallstipper viele Male mitgetippt hätte, würde diese Zahl ausdrücken in wieviel Prozent der wiederholten Tipps der Zufallstipper mehr Punkte bekommen hätte als ich. In der Wissenschaft nennt man das den p-Wert. Dort wird er allerdings eher als erwartete Fehlerwahrscheinlichkeit interpretiert: Wenn ein Experiment hypothetisch viele Male wiederholt werden würde, gäbe der p-Wert den Prozentsatz von Wiederholungen an, in denen der Wissenschaftler den falschen Schluss aus dem Experiment gezogen hätte. Im obigen Beispiel: Wie oft der Arzt den Rückgang der Beschwerden auf das Medikament zurück führen würde, obwohl tatsächlich nur der Zufall dafür sorgte.

In der Psychologie reicht oft schon ein p-Wert von 5% (0,05) um als ‘signifikantes’ Ergebnis in einem Experiment anerkannt zu werden. Häufiger müssen p-Werte unter 1% liegen. In der folgenden Abbildung zeige ich daher die Punktebereiche, für die ich mit einem p-Wert von 0.01 sagen kann, dass ich besser (schwarz), oder schlechter (gold) als der Zufall war. Der rote Bereich markiert Punktestände, die man vom Zufallstipper erwarten kann (die Abbildung oben zeigt diesen Bereich im Detail).

In [9]:

plotTippPointsWithCDF(Pdist, tpointssum, names, pval=0.01)

points to be above pval = 0.01 : 87.0

Oh, ich war gar nicht signifikant besser als der Zufall! Zum Schluss hatte ich 83 Punkte, aber um mit Sicherheit sagen zu können, dass ich besser als der Zufall war, hätte ich 87 Punkte haben müssen. Na ja, Deutschland ist trotzdem Weltmeister (gold!) und da ist ein Trend in die richtige Richtung! 🙂

Decision-related activity in sensory neurons reflects more than a neuron's causal effect.

Nienborg, H. and Cumming, B. G.
Nature, 459:89–92, 2009
DOI, Google Scholar

Abstract

During perceptual decisions, the activity of sensory neurons correlates with a subject’s percept, even when the physical stimulus is identical. The origin of this correlation is unknown. Current theory proposes a causal effect of noise in sensory neurons on perceptual decisions, but the correlation could result from different brain states associated with the perceptual choice (a top-down explanation). These two schemes have very different implications for the role of sensory neurons in forming decisions. Here we use white-noise analysis to measure tuning functions of V2 neurons associated with choice and simultaneously measure how the variation in the stimulus affects the subjects’ (two macaques) perceptual decisions. In causal models, stronger effects of the stimulus upon decisions, mediated by sensory neurons, are associated with stronger choice-related activity. However, we find that over the time course of the trial these measures change in different directions-at odds with causal models. An analysis of the effect of reward size also supports this conclusion. Finally, we find that choice is associated with changes in neuronal gain that are incompatible with causal models. All three results are readily explained if choice is associated with changes in neuronal gain caused by top-down phenomena that closely resemble attention. We conclude that top-down processes contribute to choice-related activity. Thus, even forming simple sensory decisions involves complex interactions between cognitive processes and sensory neurons.

Review

They investigated the source of the choice probability of early sensory neurons. Choice probability quantifies the difference in firing rate distributions separated by the behavioural response of the subject. The less overlap between the firing rate distributions for one response and its alternative (in two-choice tasks), the greater the choice probability. Importantly, they restricted their analysis to trials in which the stimulus was effectively random. In random dot motion experiments this corresponds to 0% coherent motion, but here they used a disparity discrimination task and looked at disparity selective neurons in macaque area V2. The mean contribution from the stimulus, therefore, should have been 0. Yet, they found that choice probability was above 0.5 indicating that the firing of the neurons still could predict the final response, but why? They consider two possibilities: 1) the particular noise in firing rates of sensory neurons causes, at least partially, the final choice. 2) The firing rate of sensory neurons reflects choice-related effects induced by top-down influences from more decision-related areas.

Note that the choice probability they use is somewhat corrected for influences from the stimulus by considering the firing rate of a neuron in response to a particular disparity, but without taking choices into account. This correction reduced choice probabilities a bit. Nevertheless, they remained significantly above 0.5. This result indicates that the firing rate distributions of the recorded neurons were only little affected by which disparities were shown in individual frames when these distributions are defined depending on the final choice. I don’t find this surprising, because there was no consistent stimulus to detect from the random disparities and the behavioural choices were effectively random.

Yet, the particular disparities presented in individual trials had an influence on the final choice. They used psychophysical reverse correlation to determine this. The analysis suggests that the very first frames had a very small effect which is followed by a steep rise in influence of frames at the beginning of a trial (until about 200ms) and then a steady decline. This result can mean different things depending on whether you believe that evidence accumulation stops once you have reached a threshold, or whether evidence accumulation continues until you are required to make a response. Shadlen is probably a proponent of the first proposition. Then, the decreasing influence of the stimulus on the choice just reflects the smaller number of trials in which the threshold hasn’t been reached, yet. Based on the second proposition, the result means that the weight of individual pieces of evidence during accumulation reduces as you come closer to the response. Currently, I can’t think of decisive evidence for either proposition, but it has been shown in perturbation experiments that stimulus perturbations close to a decision, late in a trial had smaller effects on final choices than perturbations early in a trial (Huk and Shadlen, 2005).

Back to the source of above chance-level choice probabilities. The authors argue, given the decreasing influence of the stimulus on the final choice and assuming that the influence of the stimulus on sensory neurons stays constant, that choice probabilities should also decrease towards the end of a trial. However, choice probabilities stay roughly constant after an initial rise. Consequently, they infer that the firing of the neurons must be influenced from other sources, apart from the stimulus, which are correlated with the choice. They consider two of these sources: i) Lateral, sensory neurons which could reflect the final decision better. ii) Higher, decision related areas which, for example, project a kind of bias onto the sensory neurons. The authors strongly prefer ii), also because they found that the firing of sensory neurons appears to be gain modulated when contrasting firing rates between final choices. In particular, firing rates showed a larger gain (steeper disparity tuning curve of neuron) when trials were considered which ended with the behavioural choice corresponding to the preferred dispartiy of the neuron. In other words, the output of a neuron was selectively increased, if that neuron preferred the disparity which was finally chosen. Equivalently, the output of a neuron was selectively decreased, if that neuron preferred a different disparity than the one which was finally chosen. This gain difference explains at least part of the difference in firing rate distributions which the choice probability measures.

They also show an interesting effect of reward size on the correlation between stimulus and final choice: Stimulus had larger influence on choice for larger reward. Again, if the choice probabilities were mainly driven by stimulus, bottom-up related effects and the stimulus had a larger influence on final choice in high reward trials, then choice probabilities should have been higher in high reward trials. The opposite was the case: choice probabilities were lower in high reward trials. The authors explain this using the previous bias hypothesis: The measured choice probabilities reflect something like an attentional gain or bias induced by higher-level decision-related areas. As the stimulus becomes more important, the bias looses influence. Hence, the choice probabilities reduce.

In summary, the authors present convincing evidence that already sensory neurons in early visual cortex (V2) receive top-down, decision-related influences. Compared with a previous paper (Nienborg and Cumming, 2006) the reported choice probabilities here were quite similar to those reported there, even though here only trials with complete random stimuli were considered. I would have guessed that choice probabilities would be considerably higher for trials with an actually presented stimulus. Why is there only a moderate difference? Perhaps there actually isn’t. My observation is only based on a brief look at the figures in the two papers.