Ruberg et al. recently published an article with the aim of starting a virtuous cycle in the use of Bayesian approaches in drug development (Nat. Rev. Drug Discov. 22, 235–250; 2023)1. Here, we would like to extend important arguments of their article and add recommendations for using Bayesian statistics2.
The authors emphasize the fundamental distinctions between frequentist and Bayesian approaches1. However, we would also like to highlight the similarities between both perspectives. First, Bayesian and (likelihood-based) frequentist approaches will converge towards identical results as more data are collected, because the likelihood based on the observed data will dominate the model. Second, both approaches will give similar if not identical results if a non-informative prior distribution is used. Moreover, a frequentist one-sided p-value can indeed be interpreted as the probability that the null hypothesis is true in this situation because the p-value will be equal to the corresponding posterior predictive probability3. Awareness of these similarities may help to bridge both statistical schools of thought and to resolve regulatory concerns around the use of Bayesian approaches that may be less familiar than frequentist approaches to some researchers.
The authors propose Bayesian methods for the use of existing external information, such as data from previous clinical trials, while rightly pointing out that the use of external information may not be appropriate in all contexts owing to the potential for bias1. Generally speaking, we consider the application of Bayesian methods to be suitable in scenarios in which omitting suitable available evidence and leaving a suffering patient population without treatment options could be counter to common sense and unethical. Such scenarios include, for example, studies on medical devices, early-phase trials, trials involving patients with rare diseases and extrapolation of data from adult to paediatric populations.
We would like to illustrate this point by expanding on the excellent example provided by Ruberg et al.1. The THAPCA-OH trial4 compared therapeutic hypothermia with therapeutic normothermia (that is, palliative care) for the treatment of children with cardiac arrest. The primary frequentist analysis did not reach the standard statistical significance threshold of 5% and therapeutic hypothermia was rejected as ineffective (p = 0.14). However, Harhay et al.5 used a post-hoc Bayesian analysis to show that therapeutic hypothermia had a 94% probability of any benefit over therapeutic normothermia and concluded its efficacy. One could argue that a more liberal frequentist significance threshold, such as 15%, could have been chosen for the original analysis. However, regulators may not accept such a risk of a false positive conclusion and the actual evidence for the efficacy of the treatment may also be undervalued. For example, if the p-value turns out to be much lower than the significance threshold (such as p = 0.06), a strict frequentist interpretation would still consider the significance of results based on the liberal significance threshold of 15%.
The authors present the example of a Bayesian hierarchical model based on an assumption of an overall treatment effect across all subgroups to identify differential treatment effects between subgroups (see Box 4 in the original article1). First, the question here is whether one or more subgroups show a different treatment effect relative to other subgroups. If so, the assumption of an overall treatment effect may not be plausible and shrinkage of the treatment effects per subgroup towards the overall mean may make it harder to detect subgroup outliers. Second, the overall mean and its credible interval is less relevant here, since it does not represent the expected variability of treatment effects of a future study with the same sample size as the subgroup of interest because the overall mean includes data from all subgroups. Therefore, using standard Bayesian hierarchical models for the identification of a subgroup outlier might be misleading.
For these reasons, Bayesian predictive cross-validation models6 have been recommended instead. First, these models do not make an assumption of an overall treatment effect but predict the treatment effect in a subgroup of interest based on the heterogeneity of results in all other subgroups. Second, these models calculate the prediction interval7 for the subgroup of interest, which considers its sample size and corresponding sampling variability. If now the observed, independently estimated treatment effect in the subgroup of interest lies outside its prediction interval, one can make the unbiased conclusion that this subgroup shows a differential treatment effect relative to all other subgroups.
Dias et al.6 have illustrated the differences between a Bayesian hierarchical model and a Bayesian predictive cross-validation model for the identification of differential treatment effects when performing a meta-analysis of 16 studies (that is, subgroups) comparing the treatments of intravenous magnesium with placebo in patients with acute myocardial infarction. Using a Bayesian hierarchical model, comparing the results of the ISIS-4 ‘mega-trial’ with the overall mean and its credible interval (including all 16 studies), the results of the ISIS-4 study seem to be an outlier. When a Bayesian predictive cross-validation model was used instead to calculate the predictive mean and its prediction interval for the ISIS-4 study (excluding the ISIS-4 study from the prediction model), the observed mean of the ISIS-4 study did not show a major differential treatment effect when compared to its predictive mean and corresponding prediction interval.
We hope that our comments have deepened readers’ understanding of Bayesian statistics, and that the advantages that Bayesian statistical thinking has to offer discussed here and by Ruberg et al. will become more widely appreciated by drug developers and regulators in the future.
留言 (0)