This paper decoder post is a little different as it doesn’t relate to a particular paper. Rather it’s my answer to the question in the title of this post which was triggered by a colleague of mine. The colleague has a psychology background and just got to know about Bayesian statistics when the following question crossed his mind:
do Bayesian stuff, right? Trying to learn about it now, can’t quite get my head
around it yet, but it sounds like how I should be analysing data. In
psychophysics we usually collect a lot of data from a small number of subjects,
but then collapse all this data into a small number of points per subject for
the purposes of stats. This loses quite a lot of fine detail: for instance,
four steep psychometric functions with widely different means average together
to create a shallow function, which is not a good representation of the data.
Usually, the way psychoacousticians in particular get around this problem is
not to bother with the stats. This, of course, is not optimal either! As far as
I can tell the Bayesian approach to stats allows you to retain the variance
(and thus the detail) from each stage of analysis, which sounds perfect for my
old phd data and for the data i’m collecting now. it also sounds like the thing
to do for neuroimaging data: we collect a HUGE amount of data per subject in
the scanner, but then create these extremely course averages, leading people to
become very happy when they see something at the single-subject level. But of
course all effects should REALLY be at the single-subject level, we assume they
aren’t visible due to noise. So I’m wondering why everyone doesn’t employ this
Bayesian approach, even in fMRI etc..
In short, my answer is twofold: 1) Bayesian statistics can be computationally very hard and, conceptually critical, 2) choosing a prior influences the results of your statistical inference which makes experimenters uneasy.
The following is my full answer. It contains a basic introduction to Bayesian statistics targeted to people who just realised that this exists. I bet that a simple search for “Bayesian frequentist” brings up a lot more valuable information.
You’re right: the best way to analyse any data is to maintain the full distribution of your variables of interest throughout all analysis steps. You nicely described the reasons for this. The problem only is that this can be really hard depending on your statistical model, i.e., data. So you’ll need to make approximations. One way of doing this is to summarise the distribution by its mean and variance. The Gaussian distribution is so cool, because these two variables are actually sufficient to represent the whole distribution. For other probability distributions the mean and variance are not sufficient representations so that when you summarise the distribution with them you make an approximation. Therefore, you could say that the standard analysis methods you mention are valid approximations in the sense that they summarise the desired distribution with its mean. Then the question becomes: Can you make better approximations for the model you consider? This is where the expertise of the statistician comes into play, because what you can do really depends on the particular situation with your data. It’s most of the time impossible to come up with the right distribution analytically, but actually many things could be solved numerically in the computer these days.
Now a little clarification what I understand under the Bayesian approach. Here’s a hypothetical example: your variable of interest, x, is whether person A is a genius. You can’t really tell directly whether a person is a genius and you have to collect indirect evidence, y, from their behaviour (might be the questions they ask, the answers they give, or indeed a battery of psychological tests). So x can take values 0 (no genius) and 1 (genius). Your inference will be based on a statistical model of behaviour given genius or no genius (in words: if A is a genius then with probability p(y|x=1) he will exhibit behaviour y):
p(y|x=1) and p(y|x=0).
In a frequentist (classic) approach you will make a maximum likelihood estimate for x which will end up in a simple procedure where you sum up the log-probabilities of your evidence and compare which sum is larger:
sum over i log(p(y_i|x=1)) > sum over i log(p(y_i|x=0)) ???
If this statement is true, you’ll believe that A is a genius. Now, the problem is that, if you only have a few pieces of evidence, you can easily make false judgements with this procedure. Bayesians therefore take one additional source of information into account: the prior probability of someone being a genius, p(x=1), which is quite low. We can then get something called a maximum a posteriori estimate in which you weight evidence by the prior probability which leads to the following decision procedure:
sum over i log(p(y_i|x=1)p(x=1)) > sum over i log(p(y_i|x=0)p(x=0)) ???
Because p(x=1) is much smaller than p(x=0) this means that you now have to collect much more evidence where the probability of behaviour given that A is a genius, p(y_i|x=1), is larger than the probability of behaviour given that A is no genius, p(y_i|x=0), before you believe that A is a genius. In the full Bayesian approach you would actually not make a judgement, but estimate the posterior probability of A being a genius:
p(x=1|y) = p(y|x=1)p(x=1) / p(y).
This is the distribution which I said is hard to estimate above. The thing that makes it hard is p(y). In this case, where x can only take two values it is actually very easy to compute:
p(y) = p(y|x=1)p(x=1) + p(y|x=0)p(x=0)
but for each additional value x can take you’ll have to add a term to this equation and when x is a continuous variable this sum will become an integral and integration is hard.
One more, but very important thing: the technical problems aside, the biggest criticism of the Baysian approach is the use of the prior. In my example it helped us from making a premature judgement, but only because we had a suitable estimate of the prior probability of someone being a genius. The question is where does the prior come from? Well, it’s prior information that enters your inference. If you don’t have prior information about your variable of interest, you’ll use an uninformative prior which assigns equal probability to each value of x. Then the maximum likelihood and maximum a posteriori estimators above become equal, but what does it mean for the posterior distribution p(x|y)? It changes its interpretation. The posterior becomes an entity representing a belief over the corresponding statement (A is a genius) given the prior information provided by the prior. If the prior measures the true frequency of the corresponding event in the real world, the posterior is a statement about the state of the world. But if the prior has no such interpretation, the posterior is just the mentioned belief under the assumed prior. These arguments are very subtle. Think about my example. The prior could be paraphrased as the prior probabilty that person A is a genius. This prior cannot represent a frequency in the world, because person A exists only once in the world. So whatever we choose as prior merely is a prior belief. While frequentists often argue that the posterior does not faithfully represent the world, because of a potentially unsuitable prior, in my example the Bayesian approach allowed us to incorporate information in the inference that is inaccessible to the frequentist approach. We did this by transferring the frequency of being a genius in the whole population to our a priori belief that person A is a genius.
Note that there really is no “correct” prior in my example and any prior will correspond to a particular prior assumption. Furthermore, the frequentist maximum likelihood estimator is equivalent to a maximum a posteriori estimator with a particular (uninformative) prior. Therefore, it has been argued that the Bayesian approach just makes the prior assumptions explicit which are implicit also in the more common (frequentist) statistical analyses. Unfortunately, it seems to be a bitter pill to swallow for experimenters to admit that their statistical analysis (and thus outcome) of their experiment depends on prior assumptions (although they appear to be happy to do this in other contexts, for example, when making Gaussian assumptions when doing an ANOVA). Also, remember that the prior will ultimately be overwritten by sufficient evidence (even for a very low prior probability of A being a genius we’ll at some point belief that A is a genius, if A behaves accordingly). Given these considerations, the prior shouldn’t be a hindrance for using a Bayesian analyis of experimental data, but the technical issues remain.