The use and abuse of survival analysis and Kaplan-Meier curves in surgical trials

The verdict of a clinical trial is based on the primary hypothesis of the trial. The most important component of the trial hypothesis is the primary endpoint or outcome measure. Careful selection of the most appropriate outcome measure is of crucial importance at the time of trial design. In many studies, paradigmatically in oncology, researchers are not solely interested in the occurrence of an irreversible event such as death. Researchers are also interested in the length of time of survival, for some therapies may have clinical value in delaying the occurrence of an inevitable event. The best way to detect meaningful prognostic factors or therapeutic effects in such circumstances is survival analysis [1], [2].

Survival analysis has become popular in many fields other than oncology, using endpoints other than death (and often combined in a composite primary outcome). Many landmark neurovascular trials have used survival analysis as their primary analysis: The EC-IC bypass study [3]; The North American Symptomatic Carotid Endarterectomy Trial [4]; and The Randomized trial of Unruptured Brain Arteriovenous malformations [5], to mention a few. In neurovascular studies, ‘survival’ then becomes ‘survival free of stroke or death’, or ‘survival until first hemorrhage’. The use of survival analysis using these sorts of outcomes in these circumstances raises special issues that we will discuss.

Survival analyses are popular because of their sensitivity or power in detecting prognostic or therapeutic effects by extracting information from all patients, whether or not they have reached the endpoint of interest, and regardless of the variable length of their follow-up. This is different from other analyses whose powers are limited because they rely on the (often small) number of patients who reach the selected endpoint at a fixed follow-up time. In survival analysis, patients with a short follow-up time, those that reach the end of the study period, or those that withdraw from the study without the event occurring, still contribute their ‘survival time’ to the data; they are ‘censored’ at time of drop-out. The use of Cox regression and Kaplan Meier (KM) survival curves increases the sensitivity of the study to reveal the signal of any prognostic or therapeutic effect on an irreversible or terminal event such as death. But there are problems when reversible and potentially minor events, such as peri-operative stroke or intracranial hemorrhage of any severity are analyzed using the KM method, for the analysis includes only the time period between recruitment and the first event. The evolution of patients after this first event is ignored. This type of analysis is often incompatible with the rationale of surgical trials that aim to prevent long-term life-threatening events by taking an upfront, hopefully infrequent and transient risk immediately.

The KM survival curve, Cox regression and the log-rank tests that are used to compare groups are based on the same assumptions: censoring should be unrelated to prognosis, the survival probabilities should be the same for subjects recruited early or late in the study, and the events should happen spontaneously at the times reported (in other words, the events are not scheduled).

Kaplan-Meier curves provide an intuitive and visually appealing way to present survival research results, but there are a number of pitfalls associated with the visual interpretation of KM curves. One such pitfall is to interpret the flattening of the right side of the curve as meaning that risk decreases with time, an interpretation that is usually an artefact caused by the small number of patients remaining in the study near the end of the observation period. This is one reason why it is important to report the number of patients at risk above the x or time axis [6], [7].

Visual inspection still remains a simple way to detect serious problems when comparing the survival of 2 groups of patients. The crossing of 2 survival curves reveals that a fundamental assumption has not been satisfied, a deviation that can invalidate statistical analyses [8]. This problem has not sufficiently been recognized when neurovascular trials use survival analysis to compare surgical interventions with conservative or medical management [3], [4], [5]. A recent example is the CMOSS bypass trial that we will examine in detail [9]. In fact, the assumption that the hazard ratios remain constant is, by design, not satisfied in most studies that compare surgery with conservative management, unless surgery by itself cannot cause the primary event of interest, or unless the survival analysis is initiated only after the postoperative period (say after 30 days).

We will conclude with alternative ways to design, analyse and interpret the results of trials that compare preventive surgery with medical management.

The Carotid and Middle Cerebral Artery Occlusion Surgery Study (CMOSS) was a multicenter, randomized, open-label, outcome-assessor blinded clinical trial conducted in 13 Chinese centers between 2013 and 2020. Eligible patients were 18 to 65 years old, with unilateral ICA or MCA occlusion on digital subtraction angiography (DSA), a modified Rankin Scale (mRS) score of 0 to 2 points; they had presented with a TIA or a recent non-disabling ischemic stroke within the past 12 months, with hemodynamic compromise in the MCA territory as defined by CT-perfusion. Qualifying strokes had to have occurred more than 3 weeks prior and any neurologic deficits established for more than 1 month. Exclusion criteria were: Patients with greater than 50% stenosis of any other large intracranial vessels, the presence of infarction involving >50% of the MCA territory, or other neurovascular diseases causing focal ischemia. The study protocol required surgeons to have performed a minimum of 15 consecutive ECIC bypasses in the preceding year, with an anastomosis patency rate of greater than 95% and a perioperative stroke or death risk of less than 10%. The trial protocol stipulated that surgery had to be performed within 7 days of randomization.

CMOSS centers enrolled 330 patients, from which 324 eligible patients were randomly allocated 1:1 to EC-IC bypass surgery plus medical therapy (n = 161), or medical therapy alone (n = 163). Follow-up neurology clinic appointments were scheduled at 30 days and at 6, 12, and 24 months.

The primary outcome was a composite of 30-day stroke or death, or ipsilateral ischemic stroke from 30 days to 2 years. The 9 secondary outcomes included surgical complications, the National Institutes of Health Stroke Scale (NIHSS) score and mRS score within 2 years, and the 2-year rates of any stroke, disabling stroke, fatal stroke, death, any stroke or death, TIAs, and anastomosis patency. All outcomes were evaluated by an independent outcome committee blinded to the assigned treatment.

The sample size was calculated based on a hypothesized primary outcome rate of 28% in the medical group and 14% in the surgical group (a 50% reduction). It was estimated that 330 patients (165 per group) were required to detect a 14% absolute difference in primary outcome rates, assuming 80% power with a two-tailed alpha-level of 0.05, and assuming a 20% attrition rate at 2 years follow-up.

Analysis was by intention-to-treat (ITT) and per-protocol (PP). Primary and secondary outcomes were computed as time-to-event analyses using Cox proportional-hazard models. Nine patients in the surgical group and 6 in the medical group were censored (lost to follow-up). Patients were followed until the outcome event or for 2 years.

Baseline demographic and clinical characteristics were similar for both groups. The primary outcome was observed in 13 of 151 patients in the surgical group (8.6%) and 19 of 155 patients in the medical group (12.3%), with a hazard ratio of 0.71 (95% CI: 0.33–1.54, p = 0.39). Specifically, stroke or death within 30 days occurred in 10 of 161 patients in the surgical group (6.2%) and 3 of 161 patients in medical group (1.8%), while ischemic strokes in the affected artery beyond 30 days through 2 years were seen in 3 of 151 surgical group patients (2.0%) and 16 of 155 medical group patients (10.3%).

In the analysis section of the article, authors state: ‘We estimated hazard ratios (HRs) with Cox proportional hazards models for time-to-event analyses of the primary outcome and secondary outcomes. However, we also performed post hoc relative risk (RR) analyses for both primary and secondary outcomes as the underlying assumption of proportional hazards was not met with supremum test (P < .001).’

Nonetheless, the authors concluded: ‘Among patients with symptomatic ICA or MCA occlusion and hemodynamic insufficiency, the addition of bypass surgery to medical therapy did not significantly change the risk of the composite outcome of stroke or death within 30 days or ipsilateral ischemic stroke beyond 30 days through 2 years’.

We disagreed with the analysis and interpretation of the CMOSS trial in a letter to the editor: ‘Given the nature of the trial designed to test whether an immediate surgical risk to prevent future strokes would benefit patients with intracranial arterial occlusion, the use of a Cox proportional hazards model for time-to-event analyses was certain, a priori, to violate the proportional hazard assumption, as authors themselves realized after the fact’ [10].

In our view, ‘the meaning of the trial would have been better conveyed by: ‘EC-IC bypass carries a 30-day risk of stroke or death of 6.2%, but subsequently confers superior protection against ischemic stroke than medical management alone’ (as shown by the survival curves e-Figures 9 and 10 in their supplemental material [9]) [10]. Because the supplemental curves that showed the benefits of surgery after the perioperative period were (properly) exploratory analyses, we fear the signal that surgery may have long-term benefit may have been lost. We now try to explain why alternative ways to analyse and interpret results are needed for trials that compare preventive surgery with medical management.

Fig. 1 demonstrates how follow-up data from multiple patients is transformed to create a KM survival curve and used to illustrate the probability of survival [7].

The KM method was described in 1958 to address statistical problems related to time: When the outcome of a study is time to a certain event, these times are unlikely to follow a normal distribution. In addition, we cannot expect that the endpoint of interest will occur in all patients, and we cannot wait until all patients have died. Some patients may have left the study early – they are lost to follow up. Thus, the only information we have about some patients is that they were still alive at the last follow up. These are called ‘censored’ observations, and they contribute to the information summarized in the statistics.

We must also distinguish various notions of time to understand survival analysis. Although patients are recruited at various calendar times, at the end of the study, in a KM curve all patients are ‘reset’, as if the clock started at time 0 for all of them. Then the patients are ordered (from the shortest to the longest time) according to their ‘serial time’ (or time according to their position in the series of patients included into the study), defined as the length of time between recruitment and the patient’s exit from the study, where ‘exit’ is defined when either the patient has the event of interest (illustrated by a step down in the curve), or the patients is ‘censored’ (lost to follow-up, died of another cause, or the study ends with the patient still alive, illustrated by a vertical slash in the horizontal line of the curve). The trial clock and trial time intervals differ from standard time, for the scale differs: the trial time ticks only and each time an event occurs. The number of patients still at risk of having the event are displayed at standard regular intervals below the X-axis which follows the days, months or years of the calendar.

Finally, each subject is characterized by three variables: 1) their ‘serial time’, 2) their status at the end of serial time (the event occurred or the patient is censored), and 3) the study group they are in. In other words, the patient is out of the study after the event occurs or after censoring, no matter what happens afterwards.

Now we can discuss why these analyses make sense in an oncology context, but how in the context of preventive surgical trials they can be seriously misleading. In oncology studies where death is the primary outcome measure, the problem is immaterial, for nothing can happen to dead patients thereafter. (There is a problem with ‘related death’ as opposed to ‘all-cause death’, a difficult to define judgment whose reliability is rarely tested in practice, but this problem is beyond the scope of the present article). The main point is that death is a terminal endpoint. In oncology trials where death is the primary outcome, survival analyses make all the sense in the world if our aim is to improve survival.

This is not the case for most outcome measures of neurovascular trials that compare preventive surgery and conservative management, where analytic methods that discard any information about the outcome of patients after the selected primary ‘event’ has occurred are inappropriate.

Keeping CMOSS in mind, let us imagine a trial that compares surgery and medical management in the prevention of stroke or death. Let us also imagine that medical management is ineffective, with strokes (often major) occurring at various times, while surgery is associated with a high risk of a minor and transient peri-operative stroke, but with full recovery, and furthermore - that surgery is close to 100% effective in preventing future strokes. The results of such a trial are illustrated in Fig. 2A.

While imaginary, this example is logically akin to what happened in CMOSS: A surgical treatment which puts the patient at risk immediately is used to prevent future events, but the hypothesis is that overall patients will benefit in the long term. To do this, we need an outcome measure that reports the status of the patient at a time that captures the consequences of what happened during follow-up in both groups being compared. Fig. 2A shows that our hypothesis has been confirmed: At the end of the trial surgical patients (last 4 lines) had a better outcome (shown here in terms of final mRS) than non-surgical patients (first 4 lines).

But Fig. 2B illustrates how survival analysis would treat the exact same trial, using time-to any stroke or death as the primary endpoint. Because whatever happens after the endpoint is ignored, medical patients have better outcomes (longer stroke-free survival) than surgical patients, and the verdict of the trial is reversed.

A fundamental reason why survival analysis is not well-adapted to surgical trials is that by definition, the event that defines survival time has to occur spontaneously or independently of any human intervention. For example, analyzing ‘survival free of aneurysmal recurrence’ after coiling is inappropriate, for recurrences can only be found when an angiogram is performed. A similar problem can occur if the definition of the event requires a scheduled follow-up visit with the treating physician. In CMOSS, the clustering of endpoints around the scheduled follow-up times (6 months, or 1 year in their Fig. 2) [9] suggests that follow-up strokes were not independent events, but were in some way related to the time of the follow-up visit (perhaps because ‘strokes’ were confirmed by imaging?). Trials that question whether the upfront risk of preventive surgery is worth taking would inappropriately compare early dependent, scheduled early perioperative events in the surgical group with independent, spontaneous events occurring at any time in the medical group.

It is important, when comparing 2 survival curves, to perform statistical analyses of the difference between the two groups and not rely on visual inspection only, but these statistical analyses are only valid in the presence of fundamental assumptions which are not always clearly tested or reported [11]. To compare the survival of 2 groups, the log rank test and Cox proportional hazard models are commonly used. These comparative statistics use a summation of results recorded at time ‘slices’ that are divided or defined by the occurrence of the event of interest in observed patients; in other words, for these statistical analyses, time is regimented by a clock that ticks when the event of interest occurs. Statistics use the summation of the probability of survival for each interval defined in this fashion. Tests that verify whether observed differences between 2 groups can be explained by chance alone (in other words that they belong to the same distribution) assume a fixed or constant hazard ratio: That the risk of an event in one group relative to the other does not change with time. Crossing survival curves, as shown in the CMOSS trial, is a clear demonstration that this fundamental assumption has not been met. As Bland and Altman noted 20 years ago: ‘The log rank test is most likely to detect a difference between groups when the risk of an event is consistently greater for one group than another. It is unlikely to detect a difference when survival curves cross, as can happen when comparing a medical with a surgical intervention’ [8].

Two previous trials have previously failed to show the benefits of bypass surgery in the prevention of stroke [12], [13]. It is worth noting that in both studies, surgery was associated with a higher operative morbidity (15% and 12%) as compared to CMOSS (6%).

The Carotid Occlusion Surgery Study (COSS) was prematurely interrupted for futility and was thus smaller than other bypass trials (195 patients) [14]. Survival analysis was used for the comparison of the primary outcome, a combination of i) all stroke and death from randomization through 30 days after surgery or randomization and ii) ipsilateral ischemic stroke within 2 years of randomization, and time to event curves are provided in their Figure 3 [14]. COSS was plagued by the same problem of curves crossing as the CMOSS study. After the 30-day postoperative or post-randomization period, there were 6 versus 20 events, suggesting that surgery in COSS was mechanistically effective in the prevention of stroke, just like what was shown in the CMOSS study.

The much larger (1377 patients) EC-IC bypass trial published in 1985 did use survival analysis of the primary outcome to conclude that surgery provided no benefit [3]. While the KM curve reporting the primary analysis did not visually reveal the problem we are concerned with, the result section transparently said that the fundamental assumption of a constant hazard ratio was violated: ‘The Mantel-Haenszel chi-square analysis generated a point estimate of the average effect of surgery over the entire trial of a 14% increase in the relative risk of fatal or non-fatal stroke. Their Figure 1 [3] shows that this risk was much more than 14% early in the trial and much less toward the end of the trial’. Multiple secondary analyses were performed, and many showed the crossing of KM curves that invalidates study statistics. Interestingly, in their effort to answer anticipated questions from the surgical community, the EC-IC bypass study group provided perhaps more convincing evidence that supported the claim that surgery provided no benefit: The fact that the final functional outcome in all patients (shown in their Table 2) and the percentage of total follow-up time spent at each functional status level (their Table 3) were identical for the surgical or medical groups supports the conclusion that surgery was not beneficial [3]. This idea addresses the concern we brought up with the hypothetical trial shown in our Fig. 2A and B, where the benefit of surgery, hidden by the survival analysis, could be shown using a functional outcome measure at a fixed follow-up time. This idea also provides a future direction to explore ways to better analyze preventive surgical trials.

Survival analysis has been used in questionable fashion in many other neurovascular trials.

The ARUBA trial showed better outcomes for medical as compared to interventional management of unruptured brain AVMs, but the primary outcome was studied using the Kaplan-Meier method and measured using Cox proportional-hazard regression models. Although the authors reported “no departure from the proportionality assumption” when it was tested post hoc, such testing may not have had the power to identify what was obvious a priori [15]. Analyses published in The Lancet were performed within the treatment period of 73 of 114 patients (“at the time of analysis, 53 patients randomized to interventional therapy had ongoing treatment plans, whereas 20 had not yet initiated therapy”) [5]. Thus, the majority of events in the intervention group were peri-operative; they occurred at the time of ‘scheduled treatments’ (most commonly embolizations). The fundamental assumption of a survival analysis (the independence of events) and the fundamental assumption of a comparison between patients (a constant hazard ratio between the 2 groups), were clearly not satisfied. The research question of ARUBA was logically similar to our hypothetical trial illustrated in our Fig. 2A and B.

The NASCET study is another large surgical trial (2226 patients) that used KM survival curves and compared surgical and medical patients by the Mantel–Haenszel chi-square test. Yet the survival curves clearly cross very early in the trial in violation of the proportional hazard assumption. The follow-up period was up to 8 years (mean 5 years) and the 5-year failure rate was relatively high (22%) in the control group. In this case the overall benefits of surgery could still be shown because the higher initial risk of surgery was swamped by the sheer number of events during follow-up.

The problem is not limited to neurovascular trials; it has also been noted in orthopedic surgery [16], and it can occur in oncology trials, especially when an effective treatment is initially very risky [17]. The problems with survival analysis that we have reviewed are not new [11]. There are methods to test whether the proportional hazard assumption has been violated, but those tests have limited power (as with ARUBA). They remain infrequently used or poorly reported [18]. Methods to address the problem of comparing treatments with crossing survival curves have recently been described, but they are beyond the scope of this article [19], [20], [21].

Some deviation from the fundamental assumptions may do little harm when the surgical risk is low, and the follow-up events numerous enough to show the effect of surgery, as with NASCET. But this was certainly not the case in the smaller COSS and CMOSS studies, in which total events were few (40 and 32 respectively) and follow-up time relatively short (2 years). A simple solution would be to compare the number of patients who have reached a functional status (say mRS >2) at some fixed time point, leaving sufficient time for preventive surgery to show its potential benefits (say 5 or 10 years). This choice was made in some ongoing neurovascular trials on unruptured aneurysms or brain AVMs [22], [23]. The approach is admittedly less powerful than survival analysis, and it does not take into account time, or the difference between an immediate poor outcome (say with surgery) with one happening in a delayed fashion (say aneurysm rupture). Consequently, survival analysis may seem to provide the fairest assessment possible, even in the presence of violations of testing assumptions. However, patients can recover from surgical complications, and then be protected from rupture for decades (as in our hypothetical trial). The idea of using tests for which fundamental assumptions are not satisfied to make final conclusions on surgical trials of such importance is hard to swallow.

Because of the nature of trials that test the value of surgery in the long-term prevention of neurological events, the fundamental problem remains that we may end up comparing an immediate risk at a single scheduled time point in one group with a rate of spontaneous events over time in the other group. Survival analysis and log rank tests were not designed for such comparisons in such circumstances.

One solution may be to start the survival analysis after the 30-day peri-operative period. This method avoids the crossing of KM curves, is consistent with the proportional hazard assumption, and allows the estimation of the efficacy of treatment by comparing event-free survival. Such post-operative survival curves, provided in the supplemental material of the CMOSS article, would suggest that bypass was effective in preventing strokes. This approach has also been used by authors of NASCET and of the ASCT in secondary analyses [24], [25], [26]. This is an honest resolution so long as the initial morbidity of the surgical procedure is transparently reported. When trials are small, the surgical morbidity may be difficult to evaluate with precision. But the separation between the perioperative period and the follow-up period is not unusual. After all, many trials separately examine a primary safety outcome (given here by the perioperative data or 30-day data) and a primary efficacy outcome (provided by the survival analysis of the follow-up period). There remains the problem of balancing immediate risks and long-term benefits. In this regard, examination of secondary endpoints (such as the functional status of the patients in the EC-IC bypass trial) may help determine the final judgment.

留言 (0)

沒有登入
gif