Multicenter external validation of prediction models for clinical outcomes after spinal fusion for lumbar degenerative disease

To address the problem of significant variability in postoperative outcome after lumbar fusion surgery due to a wide range of patient characteristics [5], CPMs were developed assisting in the decision-making process [23]. Khor’s model demonstrated good calibration and performance in their own, internal validation cohort [6] with comparable values in a small single-center external validation cohort [9]. Here, we performed a rigorous, multicenter external validation of Khor’s models (coined the SCOAP-CERTAIN tool) for prediction achieving the MCIC for 3 different clinical outcomes at 12 months postoperatively after lumbar fusion for degenerative disease. With data from the FUSE-ML study, we assess generalization of these CPMs and find that – while in terms of discrimination (binary prediction) the models generalize moderately well – the calibration (continuous risk assessment) seems to lack in robustness, although the cohorts appear comparable.

It is notoriously difficult to predict treatment response for patients undergoing lumbar spinal fusion for degenerative disease. While some indications such as isthmic spondylolisthesis represent a relatively clear indication for fusion, others such as low-grade degenerative spondylolisthesis with stenosis are less clear to benefit from addition of fusion [24, 25] The most extreme example certainly is chronic low back pain with concomitant discopathy [26]. While some individual patients with this pathology do profit from fusion, an unselected population does not: Randomized studies consistently indicate that, on the whole, fusion surgery does not yield significantly superior outcomes compared to conservative treatment for chronic low back pain [27]. Although surgery may not exhibit a clear advantage over conservative approaches in unselected patients with chronic low back pain, specific subsets of patients can genuinely experience benefits [28]. The critical factor for success in degenerative spine surgery lies in meticulous patient selection.

In the past, different methods were established to help select the best treatment option of the individual patient. From discography to pantaloon casting or considering radiological modifiers such as Modic-type endplate changes, many potential predicts of surgical success were evaluated, but often with very limited predictive ability [26, 28]. First, mostly radiological or physician-based outcomes were assessed, but over time, patient-reported outcome measures (PROMs) such as ODI [29] were implemented and validated trying to quantify and weigh symptoms to in the end justify risk and benefits of a potential surgery [30]. This then opened up the possibility of truly personalized medicine: Currently, the aim and idea of medical decisions is to consider every personal aspect of a patient’s physical and mental characteristics for the perfect treatment to fulfill the wide range of demanding aspects, such as symptom release for the patient, healing or preventing progression of a disease and balancing costs of the healthcare system by avoiding unnecessary diagnostics and treatments and complications [31, 32]. Another delicate aspect complicating medical decision making, is the wide range of symptoms that can be present in patients with degenerative lumbar spine diseases, e.g. facet-mediated pain, discogenic pain or myofascial pain [33], among others. The easiest would be, if we could pinpoint specific symptoms or patient characteristics, knowing that lumbar fusion would ease this symptom. With more information regarding the patient and e.g. the comorbidities to weigh up the risks of surgery in general versus the expected benefit, this could lead to improved risk–benefit counseling during clinics [34].

Thus, the aim of CPMs in the surgical field is to tell, which patients do benefit of a certain intervention, and which do not. Khor et al. [6] have published an internally validated CPM tool (SCOAP-CERTAIN) that aims at assisting in surgical decision making by providing predictive analytics on which patients scheduled for lumbar spinal fusion for degenerative disease are most likely to show significant 12-month improvement in functional outcome and pain severity. Rigorous multicenter/multicultural external validation is a crucial process necessary before clinical implementation of CPMs [7, 8, 35]. To assess generalization of a CPM, calibration and discrimination need to be quantified [36]. Discrimination refers to a model’s capacity to precisely categorize patients in a binary way, namely into those experiencing MCIC and those who do not see a clinically relevant improvement. On the other hand, the model’s capability to generate accurate predicted probabilities (between 0 and 1) that closely align with the true posterior (observed frequency) is termed calibration. The SCOAP-CERTAIN tool had previously been evaluated in a small single-center external validation study of Dutch patients, demonstrating adequate discrimination but only fair calibration [9]. In a previous study of the FUSE-ML study group, a second, simpler CPM for the same outcomes was developed, with the goal of achieving similar predictive power with a lower amount of input variables [10]. This goal was broadly achieved, and within that study, a small external validation (in three centers with a total of 298 patients) of the SCOAP-CERTAIN tool (with the goal of comparing both CPMs performances) was carried out, showing again relatively robust discrimination but only fair calibration of both models [10].

Although CPMs in degenerative spine surgery could in theory be highly beneficial if added into the clinical context, rigorous external validation is necessary first to make sure that models are not “let loose too early” [8, 34]. It is especially necessary to not only test models in one or two small cohorts, but rather in a wide range of different patient populations from multiple countries and continents – If performance then shows itself to be robust, it can be safely assumed that the CPM will achieve the expected predictive performance in real-world patients, and the model can be safely rolled out. In the present study, we performed such an extensive external validation study. With AUC between 0.70 and 0.72 in ODI, NRS-BP and NRS-LP we were able to show good discrimination metrics, comparing with those reported in Khor et al.’s initial internal validation study (0.66–0.79) [6]. Yet, calibration – evaluated through diverse metrics – again demonstrated only moderate performance, as in the previous small external validation studies. In the context of internal validation, Khor et al. had documented calibration intercepts ranging from −0.02 to 0.16, along with slopes spanning 0.80–1.05, whereas we reached a wider range from 0.04 to 1.01 for intercepts and less well calibrated values with 0.72–0.87 for slopes, respectively – even though outcome distribution was similar to the development cohort (as it is known that calibration intercepts are highly dependent on differences in outcome distribution) [37]. Summarizing, there was substantial heterogeneity in the observed calibration slopes along with a higher ECI, a measure of overall calibration, defined as the average squared difference of the predicted probabilities with their grouped estimated observed probabilities [18]; and clearly worse testing for goodness-of-fit by the method of Hosmer and Lemeshow [21]. The HL method is based on dividing the sample up according to usually 10 groups of predicted probabilities and then compares the distribution to a chi-square distribution with a p-value > 0.2 usually being seen as an indication of fair calibration/goodness-of-fit [18, 21]. Of course – as is the goal of external validation – our external validation cohort represents a much more heterogenous population than the development cohort, now including European and Asian individuals, which explains some of the lack of generalization in terms of calibration. In the realm of CPMs, calibration might arguably carry a more significant role than discrimination alone [37]. This is because clinicians and patients are typically more concerned with predicted probabilities of a specific endpoint rather than a binary classification – individual patients, after all, are not binary, but carry a spectrum of expected risks and benefits [7]. Hence, insufficient calibration poses a significant obstacle to the clinical and external applicability of prediction models. Another potential explanation of the poor generalization in terms of calibration can be explained by different definitions of input variables: Although our data collection adhered strictly to the definitions provided by the Khor et al. [6] paper, institutional protocols and inter-rater assessment still varies. This is one of the general limitations of CPMs based on tabulated medical data: Because data must first undergo multiple stages of summarization and simplification by human healthcare providers, the overall predictive power can quickly reach “ceiling effects” due to the input heterogeneity. This is another reason why external validation is so crucial: To test out whether CPMs work just as well if applied in a real-world environment (effectiveness vs. efficacy). In the future, direct inclusion of source data (such as MRI) without human coding, or automated data collection through natural language processing, might somewhat alleviate this bottleneck [38].

Still, even if not perfectly calibrated in a rigorous external validation study, the models published by Khor et al. [6] are admirable and show good generalization overall, especially in terms of discrimination performance – no signs of overfitting can be observed here. Overfitting manifests as a relevant difference between training and testing performance in terms of discrimination [35]. It is common for out-of-sample performance to be comparable to or slightly worse than the training performance for a well-fitted model. The observed discrimination performance in our external validation study fits this norm well. It can be concluded that the SCOAP-CERTAIN model can safely be applied in clinical practice, although it must be kept in mind that predicted probabilities (calibration) should only be used as rough estimates, and that binary predictions – while generalizing well (discrimination) – still are no more accurate than an AUC of around 0.70.

In the end, in the realm of degenerative spine surgery, well-validated CPMs such as the SCOAP-CERTAIN [6] or FUSE-ML [10] models should only be used cautiously as rough estimates to offer an objective “second opinion” in the risk–benefit counseling of patients, but never as absolute red or green lights for surgical indications. We suggest that a future model should also be capable of predicting longer-term prognosis, as longer-term outcomes will improve the robustness of outcome data in lumbar patients. This could be achieved by incorporating more extended follow-up data and reducing short-term variability. These measures will lead to a more comprehensive understanding of patient trajectories, which is essential for effective clinical decision-making and enhanced calibration.

Additionally, it is crucial that future studies, as previously mentioned in the external preliminary stage, report key metrics such as sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the models assessed. Reporting these metrics will enable better differentiation and validation of the predicted values, thereby enhancing the reliability and applicability of clinical prediction models in practice.

Limitations

Regarding the primary surgical indication, our cohort showed mostly lumbar spinal stenosis, spondylolisthesis and discogenic low back pain whereas in Khor’s cohort radiculopathy was the leading diagnosis, followed also by stenosis and spondylolisthesis [6]. Of course, surgical indication and especially the chosen technique might vary between centers, which is exactly why multicenter external validation is important. Compared to the development cohort, we also included lateral techniques, which in turn brings a broader range of included patients. We apply a mixed cohort (FUSE-ML) of partially prospectively collected, and partially retrospectively collected data. It is known that the difference of these two strategies have a relevant influence on collected data – especially on complications, which is fortunately not a topic here – as well as on missingness, and could therefore affect final analysis, too [39]. Still, on the other hand, the fact that models still generalized relatively well on these heterogenous data is the point of external validation and even more so proves the robustness of the Khor et al. [6] models. Due to the lack of long-term (> 2 years follow-up) data, even with good calibration and discrimination performance, we are only able to predict short- and mid-term outcomes. More long-term data evaluation regarding CPMs is necessary. The validated models also do not predict surgical risks such as perioperative complications or long-term adjacent segment degeneration, information which would be particularly useful in risk–benefit discussions. The fact that FUSE-ML or SCOAP-CERTAIN models also are not able to provide prognosis of natural history or conservative treatment in these degenerative conditions means that they only provide half of the answer when making decisions on surgical versus conservative treatment strategies.

留言 (0)

沒有登入
gif