Machine learning survival models trained on clinical data to identify high risk patients with hormone responsive HER2 negative breast cancer

Enrolled patients and features

Our dataset is composed by clinical and cytohistological outcomes of 145 patients our extracted from our database of approximately 900 patients registered for a first BC diagnosis in the period 1997–2019 and referred to Istituto Tumori “Giovanni Paolo II” in Bari (Italy). The inclusion criteria for collecting such database were: absence of primary chemotherapy for BC, ab initio non-metastatic patient. Then, according to the genomic test eligibility criteria defined by the decree of the Ministry of Health of May 202123, i.e. early stage tumor, patients not at high or low risk of recurrence with hormone responsive and HER2 negative BC, 145 patients wase extracted. In line with the aim of our work, we considered the only patients who did not undergo that chemotherapy. In fact, for patients who have undergone chemotherapy, the absence of a second event could be due to a patient-specific positive prognostic profile and not necessarily to an effect of the therapeutic treatment. In other words, there may be patients who would not have relapsed even if they had not undergone additional chemotherapy.

In this work, we specifically focus our attention on breast cancer-related invasive disease events (IDEs), which include local recurrence, the appearance of distant visceral and soft tissue metastases, contralateral invasive breast cancer or a second primary tumor24.

Collected features were the age at diagnosis, tumor size (diameter: T1a, T1b, T1c, T2, T3, T4), histological subtype (ductal, lobular, other), type of surgery (quadrantectomy/mastectomy), estrogen receptor expression (ER, %), progesterone receptor expression (PgR, %), cellular marker for proliferation (Ki67, %), histological grade (grading, Elston–Ellis scale: G1, G2, G3), human epidermal growth factor receptor-2 score (HER2/neu: 0+, 1+, 2+), the number of metastatic and eradicated lymph nodes, lymph nodes dissection (no/yes), sentinel lymph node (no, negative, positive), lymph nodes stage (N: 0, 1, 2, 3), in situ component (absent, G1, G2, G3, present but not typed), lymphovascular invasion (absent, focal, extensive, present but not typed), multiplicity (no/yes) and previous tumors (no/yes). Moreover, 97.41% of the collected patients did not undergo radiotherapy, we therefore did not consider it significant to include this information in the analyses.

The data set is described in Table 1. The set of predictive features is composed by \(N=18\) prognostic factors, typically considered by clinicians during the first tumor diagnosis and related surgery. The missing data recovery has been implemented by means of the Python package musingly (v. 0.2.0).

Table 1 Observed patients’ statistics according to considered features.

A separate dataset, composed by 27 patients endowed with EndoPredict® (EP) scores and undergoing surgery during 2021 in our institute, is exploited for further evaluations of our survival estimation. EP is a gene expression test for patients with ER-positive and HER2-negative early-stage BC, both node-negative and node-positive (N0, N1, micrometastasis). It is a second-generation test that combines a molecular score of 12 genes with tumor size and lymph node status5. This genomic test has entered the clinical practice of the Istituto Tumori 'Giovanni Paolo II' in Bari since 2021 and it is used by the Breast Care team when the clinical case is highly doubtful.

Time dependent classification

Random survival forest feature importance is resumed in Fig. 1. Their calculation is nested in the 20 rounds of fivefold cross-validations to avoid any influence imposed by a single evaluation with a fixed training set. To take into account the included statistical variation, each feature weight is described by its average and standard deviation. We select those features characterized by a weight greater than 0.01, thus yielding: Ki67, PgR, age, ER, eradicated lymph nodes and diameter.

Figure 1figure 1

Random survival forest features importance averaged over 20 rounds of fivefold cross-validation. Error bars correspond to standard deviation and the red dashed line represents the imposed threshold for the mean feature weight.

The ability of machine learning survival algorithms to model data high dimensionality with respect to the classical CPH regression18 is shown by comparing the time behavior of the considered metrics (see Appendix B) when all features are included (N = 18) or just the six selected ones. In Fig. 2, the upper panels are related with the first case, while the lower panels to the latter one. At 5 and 10 years, time period usually considered for follow-up in clinical practice, the performances of the CPH model are on average lower than those of machine learning models by at least 10 percentage points, when all the features are considered. However, when just a subset of features is considered, consisting in the most important ones, the average difference in performance is halved, signaling a better condition for the CPH model, while the performance of machine learning survival models tends to remain unchanged with respect to the number of features involved. In order to define a parsimonious model, i.e. to use just the right number of predictors needed to explain the model well, the following analyses will be carried out on the results obtained by considering the selected subset of features.

Figure 2figure 2

Description of the time behavior of the performance metrics, area under the ROC curve (AUC) and concordance index (c-index). Top panels are referred to the exploitation of the whole features set, while the bottom ones exploit just selected features. Solid lines join the mean points evaluated every 10 months, while shaded regions correspond to standard deviations.

The sensitivity and specificity balanced performance (see Fig. 6 in Appendix C) corresponding to the 5 years time frame are equal to 0.62–0.65 for the three machine learning survival models, while the much lower one of CPH shows approximately 0.55 for the same balanced metrics pair. If we consider 10 years after the first BC, the CPH model still shows the lowest performance, while the machine learning survival models RSF and GB are characterized by a balanced aforementioned metric pair equal to 0.63–0.65, while CGB emerges as the best performing classifier with 0.67 for the balanced sensitivity and specificity pair.

Once the 10 years time frame is kept fixed, we establish a threshold for each model according to the median of the ones selected by the Youden index optimization (see Fig. 6 in Appendix B). The average score of each patient over the 20 rounds is then adopted to assign each case to a high or low risk category. These strata are further characterized by means of the Kaplan–Meier curves shown in Fig. 3. We reported in the descriptive table of survival estimation models in the supplementary materials (see Fig. 6). The p values confirm that CGB implements the best discrimination between high or low risk patients, while CPH is much less efficient than the machine learning survival models.

Figure 3figure 3

Representation of Kaplan–Meier survival probability for the patients stratification in high and low risk implemented by each model corresponding to 10 years after the first BC diagnosis.

Correlation with EndoPredict® scores

The similarity measure of our risk estimation with the one predicted by EP is evaluated over an independent test consisting of 27 patients. The last step of our preliminary study consists in the calculation of the Pearson correlation coefficients between the hazards (see Appendix A) estimated by our best performing CGB survival curves at 10 years after the first BC diagnosis and the risk scores provided to our institution by the EP software exploiting genetic data.

A scores statistics for the separate dataset is obtained by testing the sample set of 27 patients on 20 rounds fivefold cross-validation for the model trained on 145 patients, such that a variation in the training is obtained to gain a wider statistic. The hazard value corresponding to 10 years after the first breast cancer diagnosis is then deduced, whose overall correlation statistics is shown in Fig. 4. The violin plot in the left panel is characterized by a sufficient stability around the median value, imposing the slope of the trend line in the bubble plot of the right panel.

Figure 4figure 4

Representation of Pearson correlation statistics. In the left panel the distribution of correlations between the hazards yielded by the best performing CGB survival curves at 10 years and EP scores; In the right panel a bubble plot of scores and hazards in a rescaled version with respect to their maximum values, where each bubble ray measures the standard deviation over the 20 rounds, while the dashed red line slope is equal to the median value in the left panel.

The performance of EP declared by the authors in terms of c-index is equal to 0.753 for the prediction of distant recurrences within 10 years5. As emerged from our results, CGB shows the highest correlations, equal in average to 0.42, with respect to the remaining survival models, instead resulting in average uncorrelated. We underline that such correlations are time independent by definition for CGB, GB and CPH, because the corresponding hazard functions assume time independent parameters, such that features and time are independent variables.

留言 (0)

沒有登入
gif