Why don’t we use Bayesian statistics to analyse experimental data?

This paper decoder post is a little different as it doesn’t relate to a particular paper. Rather it’s my answer to the question in the title of this post which was triggered by a colleague of mine. The colleague has a psychology background and just got to know about Bayesian statistics when the following question crossed his mind:

Question

You
do Bayesian stuff, right? Trying to learn about it now, can’t quite get my head
around it yet, but it sounds like how I should be analysing data. In
psychophysics we usually collect a lot of data from a small number of subjects,
but then collapse all this data into a small number of points per subject for
the purposes of stats. This loses quite a lot of fine detail: for instance,
four steep psychometric functions with widely different means average together
to create a shallow function, which is not a good representation of the data.
Usually, the way psychoacousticians in particular get around this problem is
not to bother with the stats. This, of course, is not optimal either! As far as
I can tell the Bayesian approach to stats allows you to retain the variance
(and thus the detail) from each stage of analysis, which sounds perfect for my
old phd data and for the data i’m collecting now. it also sounds like the thing
to do for neuroimaging data: we collect a HUGE amount of data per subject in
the scanner, but then create these extremely course averages, leading people to
become very happy when they see something at the single-subject level. But of
course all effects should REALLY be at the single-subject level, we assume they
aren’t visible due to noise. So I’m wondering why everyone doesn’t employ this
Bayesian approach, even in fMRI etc..

In short, my answer is twofold: 1) Bayesian statistics can be computationally very hard and, conceptually critical, 2) choosing a prior influences the results of your statistical inference which makes experimenters uneasy.

The following is my full answer. It contains a basic introduction to Bayesian statistics targeted to people who just realised that this exists. I bet that a simple search for “Bayesian frequentist” brings up a lot more valuable information.

Answer

You’re right: the best way to analyse any data is to maintain the full distribution of your variables of interest throughout all analysis steps. You nicely described the reasons for this. The problem only is that this can be really hard depending on your statistical model, i.e., data. So you’ll need to make approximations. One way of doing this is to summarise the distribution by its mean and variance. The Gaussian distribution is so cool, because these two variables are actually sufficient to represent the whole distribution. For other probability distributions the mean and variance are not sufficient representations so that when you summarise the distribution with them you make an approximation. Therefore, you could say that the standard analysis methods you mention are valid approximations in the sense that they summarise the desired distribution with its mean. Then the question becomes: Can you make better approximations for the model you consider? This is where the expertise of the statistician comes into play, because what you can do really depends on the particular situation with your data. It’s most of the time impossible to come up with the right distribution analytically, but actually many things could be solved numerically in the computer these days.

Now a little clarification what I understand under the Bayesian approach. Here’s a hypothetical example: your variable of interest, x, is whether person A is a genius. You can’t really tell directly whether a person is a genius and you have to collect indirect evidence, y, from their behaviour (might be the questions they ask, the answers they give, or indeed a battery of psychological tests). So x can take values 0 (no genius) and 1 (genius). Your inference will be based on a statistical model of behaviour given genius or no genius (in words: if A is a genius then with probability p(y|x=1) he will exhibit behaviour y):

p(y|x=1) and p(y|x=0).

In a frequentist (classic) approach you will make a maximum likelihood estimate for x which will end up in a simple procedure where you sum up the log-probabilities of your evidence and compare which sum is larger:

sum over i log(p(y_i|x=1)) > sum over i log(p(y_i|x=0)) ???

If this statement is true, you’ll believe that A is a genius. Now, the problem is that, if you only have a few pieces of evidence, you can easily make false judgements with this procedure. Bayesians therefore take one additional source of information into account: the prior probability of someone being a genius, p(x=1), which is quite low. We can then get something called a maximum a posteriori estimate in which you weight evidence by the prior probability which leads to the following decision procedure:

sum over i log(p(y_i|x=1)p(x=1)) > sum over i log(p(y_i|x=0)p(x=0)) ???

Because p(x=1) is much smaller than p(x=0) this means that you now have to collect much more evidence where the probability of behaviour given that A is a genius, p(y_i|x=1), is larger than the probability of behaviour given that A is no genius, p(y_i|x=0), before you believe that A is a genius. In the full Bayesian approach you would actually not make a judgement, but estimate the posterior probability of A being a genius:

p(x=1|y) = p(y|x=1)p(x=1) / p(y).

This is the distribution which I said is hard to estimate above. The thing that makes it hard is p(y). In this case, where x can only take two values it is actually very easy to compute:

p(y) = p(y|x=1)p(x=1) + p(y|x=0)p(x=0)

but for each additional value x can take you’ll have to add a term to this equation and when x is a continuous variable this sum will become an integral and integration is hard.

One more, but very important thing: the technical problems aside, the biggest criticism of the Baysian approach is the use of the prior. In my example it helped us from making a premature judgement, but only because we had a suitable estimate of the prior probability of someone being a genius. The question is where does the prior come from? Well, it’s prior information that enters your inference. If you don’t have prior information about your variable of interest, you’ll use an uninformative prior which assigns equal probability to each value of x. Then the maximum likelihood and maximum a posteriori estimators above become equal, but what does it mean for the posterior distribution p(x|y)? It changes its interpretation. The posterior becomes an entity representing a belief over the corresponding statement (A is a genius) given the prior information provided by the prior. If the prior measures the true frequency of the corresponding event in the real world, the posterior is a statement about the state of the world. But if the prior has no such interpretation, the posterior is just the mentioned belief under the assumed prior. These arguments are very subtle. Think about my example. The prior could be paraphrased as the prior probabilty that person A is a genius. This prior cannot represent a frequency in the world, because person A exists only once in the world. So whatever we choose as prior merely is a prior belief. While frequentists often argue that the posterior does not faithfully represent the world, because of a potentially unsuitable prior, in my example the Bayesian approach allowed us to incorporate information in the inference that is inaccessible to the frequentist approach. We did this by transferring the frequency of being a genius in the whole population to our a priori belief that person A is a genius.

Note that there really is no “correct” prior in my example and any prior will correspond to a particular prior assumption. Furthermore, the frequentist maximum likelihood estimator is equivalent to a maximum a posteriori estimator with a particular (uninformative) prior. Therefore, it has been argued that the Bayesian approach just makes the prior assumptions explicit which are implicit also in the more common (frequentist) statistical analyses. Unfortunately, it seems to be a bitter pill to swallow for experimenters to admit that their statistical analysis (and thus outcome) of their experiment depends on prior assumptions (although they appear to be happy to do this in other contexts, for example, when making Gaussian assumptions when doing an ANOVA). Also, remember that the prior will ultimately be overwritten by sufficient evidence (even for a very low prior probability of A being a genius we’ll at some point belief that A is a genius, if A behaves accordingly). Given these considerations, the prior shouldn’t be a hindrance for using a Bayesian analyis of experimental data, but the technical issues remain.

Action generation and action perception in imitation: an instance of the ideomotor principle.

Wohlschläger, A., Gattis, M., and Bekkering, H.
Philos Trans R Soc Lond B Biol Sci, 358:501–515, 2003
DOI, Google Scholar

Abstract

We review a series of behavioural experiments on imitation in children and adults that test the predictions of a new theory of imitation. Most of the recent theories of imitation assume a direct visual-to-motor mapping between perceived and imitated movements. Based on our findings of systematic errors in imitation, the new theory of goal-directed imitation (GOADI) instead assumes that imitation is guided by cognitively specified goals. According to GOADI, the imitator does not imitate the observed movement as a whole, but rather decomposes it into its separate aspects. These aspects are hierarchically ordered, and the highest aspect becomes the imitator’s main goal. Other aspects become sub-goals. In accordance with the ideomotor principle, the main goal activates the motor programme that is most strongly associated with the achievement of that goal. When executed, this motor programme sometimes matches, and sometimes does not, the model’s movement. However, the main goal extracted from the model movement is almost always imitated correctly.

Review

The authors report about a series of experiments which led them to propose a theory for imitation which gives the goal of a demonstrated movement a central role for imiation: GOADI – goal directed imitation. In particular they were looking at the errors made by children when imitating hand movements of a model. These movements were: touching your ear with your hand, touching spots on a table, pointing at or picking up an object. These experiments allowed to dissociate the goal of the corresponding movement from the way it is executed. In the ear touching movements, for example, the model touches her right ear with her left hand (a contralateral movement), but the child might imitate by touching its left ear with its left hand (an ipsilateral movement). In the whole paper they assume that people naturally imitate in a mirror fashion, i.e. you would touch your left ear when the model sitting opposite of you touches her right ear. This is backed up by the data in the sense that this is what people do in the vast majority of times.

Their theory is motivated by frequent CI errors of the children in which a contralateral movement is imitated with an ipsilateral movement, but the target of the movement is chosen correctly, i.e. the correct ear is touched. The authors conclude that the children determine the goal correctly, but don’t have enough working memory / attention to process all aspects of the demonstrated movements and simply execute the movement that achieves that goal and is most natural to them. In adults these kinds of errors are greatly reduced which is perhaps a result of greater attention abilities, but when the imitation task is made slightly more complicated similar errors can be observed.

Another important part of the theory suggests that demonstrated movements are decomposed into separate aspects and that these are ordered in a hierarchy (a goal and subgoals) such that aspects higher in the hierarchy are imitated with greater care. They report about experiments in which such a hierarchy seems to be observed for aspects object identitiy, object treatment, use of effector and movement (in this order). While there is a certain difference between object specific aspects and movement specific aspects, I’m not so certain about the strict hierarchy.

Anyway, the experiments are pretty convincing and strongly support a goal directed theory of imitation in contrast to theories which propose a direct mapping from sensory input to motor output.