This paper discusses the dual interpretation of the Jeffreys–Lindley paradox associated with Bayesian posterior probabilities and Bayes factors, both as a differentiation between frequentist and Bayesian statistics and as a pointer to the difficulty of using improper priors while testing. We stress the considerable impact of this paradox on the foundations of both classical and Bayesian statistics. While assessing existing resolutions of the paradox, we focus on a critical viewpoint of the paradox discussed by Spanos (2013) in Philosophy of Science.
Robert discusses whether the Jeffreys-Lindley paradox can be used to discredit the frequentist or Bayesian approach to statistical testing. He concludes that it cannot, because it just shows the different interpretations inherent in the two approaches. Interesting insights into Bayesian hypothesis testing follow.
Whereas Murphy (2012) directly defines the Jeffreys-Lindley paradox in terms of too wide and improper priors, Robert here defines it first from the, historical, statistical testing perspective in which it is more puzzling. In particular, the Jeffreys-Lindley paradox is that for a given value of a test statistic and increasing number of data points the p-value of a point-null hypothesis stays the same, e.g. rejecting the null, but the supposedly corresponding Bayes factor goes to infinity, e.g. indicating overwhelming evidence for the null. Robert points out that the assumption that the test statistic stays constant for increasing number of data points is only realistic, when the point-null is actually true. If the alternative is true, the test statistic converges to infinity. Then, both approaches give consistent answers, i.e., they both reject the null. This analysis suggests a problem with p-value testing of point-null hypotheses which is well known and, as Robert states, has been addressed by the (frequentist) Neyman-Pearson approach to statistical testing in which the p-value threshold is adapted to the number of data points. Therefore, Robert concludes that the Jeffreys-Lindley paradox “is not a statistical paradox”. It points, however, to a problem in the Bayesian approach: The Bayes factor directly depends on the width of the prior distribution.
Robert mentions that the original formulation of the Bayesian part of the paradox is equivalent to another formulation in which the width of the prior for the alternative hypothesis is changed instead of the number of data points. Note that the width of the prior defines the range of parameter values that is considered to be realistic under the alternative hypothesis. With the width of the prior, the Bayes factor then increases, eventually accepting the point-null. This corresponds to Murphy2012’s definition of the paradox which states that decisions based on the Bayes factor arbitrarily depend on the chosen width of the prior. Consequently, the width of the prior should not be chosen arbitrarily! Robert points out that this behaviour of the Bayes factor is consistent with the Bayesian framework: As you become more uncertain about the alternative, the null becomes more likely. And he writes: “Depending on one’s perspective about the position of Bayesian statistics within statistical theories of inference, one might see this as a strength or as a weakness since Bayes factors and posterior probabilities do require a realistic model under the alternative when p-values and Bayesian predictives do not. A logical reason for this requirement is that Bayesian inference need proceed with the alternative model when the null is rejected.”
So it’s crucial for Bayesian hypothesis testing that priors realisitcally reflect the subjective uncertainty about the parameters in the model(s), but what if you really don’t know anything about sensible parameter ranges and it is the same to you whether you choose a wide prior, or one that is, e.g., two times wider? Robert briefly points to “score and predictive procedures” (I think he means Bayesian predictives as advocated by Gelman, e.g., in Gelman et al., 2013), but acknowledges that this is still a contended topic.
What does that mean for Bayesian model comparisons? They are delicate, although, in my opinion, not more delicate than p-value testing. Priors have to be justified and, in doubt, it has to be shown that the main conclusions do not critically depend on the priors. Robert also reminds us that model comparison fundamentally restricts inference to the considered models, i.e., model comparison does not say anything about the suitability of the considered models. If the considered models are all bad models of the data, model comparison likely does not provide useful information, because it only states which of the models is the “best”. And these considerations don’t even consider yet that marginal likelihoods are hard to compute. Model comparison is appealing conceptually, because it allows to make definitive statements about the data, but given these considerations it should, at least, be accompanied by a predictive analysis.