Limitation of site-stratified cox regression analysis in survival data: a cautionary tale of the PANAMO phase III randomized, controlled study in critically ill COVID-19 patients

Within Cox regression analysis, adjustments can be made for confounders which are known or expected to have an impact on the outcome parameter survival [13]. Importantly, in the PANAMO phase III study, all Cox regression analyses were adjusted for age, as age has been demonstrated to impact survival in COVID-19. Adjusting for site stratification within Cox regression could be justified if site-specific heterogeneity was assumed to be a confounder for the outcome mortality. However, when adjusting for site stratification in Cox regression analysis, the Cox partial likelihood estimation method (i.e., the mathematical procedure) requires calculation of site-specific risk sets separately for each site to reflect heterogeneous baseline hazards across sites. Technically, for a site that has no events (e.g., no deaths) or a site that contains only one enrolled patient regardless of survival status, the corresponding risk set has no variability and thus does not contribute to the formation of the partial likelihood. Consequently, this excludes data from all such sites in the data analysis and creates data attrition, something not anticipated by the FDA’s request.

In the PANAMO phase III study, this was the case for 61 patients (16.6% of the total enrollment): 55 patients from sites with no events (i.e., deaths) plus 6 patients from single patient sites who died. By chance, these 6 patients from six singletons were all placebo deaths and not a single death from the vilobelimab group; thus, excluding them from the data analysis caused underestimation of the treatment effect. Because a factual exclusion of all patient outcome data from these sites is involved in the analysis, the resulting p-value was compromised. Therefore, this hidden bias due to a reduced effective sample size and unbalanced treatment allocation tipped the p-value above the significance level.

Empirically, we can verify the above insight by removing these 61 patients in a fit of the remaining data to the site-stratified Cox regression, which generated the identical output (p-value, hazard ratio and confidence intervals). When analyzing the data set with the originally proposed protocol method using Cox regression without site stratification, the analysis reported a positive finding with a hazard ratio (HR) of 0.67 and a p-value of 0.026 (Fig. 1), which was adopted by the FDA in its published review as the more reliable method.

Fig. 1figure 1

Cox regression analyses performed on the phase III PANAMO study population. p-values, hazard ratios (HR), and confidence intervals for various age-adjusted and stratified Cox regression analyses (Model) within the PANAMO phase III primary outcome data for 28-day all-cause mortality

In order to reflect the original motivation of site-stratified analysis to account for geographic diversity and population heterogeneity while addressing the technical challenge caused by local risk sets (e.g., confounding of race and health disparities), a country-level or region-level stratification may be deemed more appropriate. Also, one might argue that the healthcare system (country) may have more impact on mortality as it crucially impacts intensive care treatment modalities (i.e., which drugs are approved and paid for within the healthcare system) as well as unit staffing with qualified personnel and other factors. Fitting the country-stratified or the region-stratified Cox model, as well as the multilevel frailty Cox model with random effects to account region-specific heterogeneity, the resulting p-values for the treatment effect all suggested positive findings with the estimated hazard ratios varying in similar ranges (Fig. 1). The same phenomenon was repeated for the pre-specified sensitivity analysis using logistic regression as well as for a post-hoc simple group comparison via log rank test. When these same analyses were applied, the key secondary endpoint, 60-day all-cause mortality, comparable patterns of HRs, confidence intervals, and p-values were observed.

留言 (0)

沒有登入
gif