CT-based multimodal deep learning for non-invasive overall survival prediction in advanced hepatocellular carcinoma patients treated with immunotherapy

Patient characteristics

Characteristics of the 207 patients (mean age, 61 years ± 12 [SD], 180 male) can be found in Table 1. Among all patients, the median interval between baseline and follow-up CTs is 55 days. The median survival time is 475 days, in which 138 patients (66.7%) have deceased. There was no significant difference in survival status among the training, validation, and test datasets (Fig. S2, train vs validation, HR, 1.051, 95% confidence interval [CI]: 0.662–1.669, p = 0.833; train vs test, HR, 1.032, 95% CI: 0.686–1.552, p = 0.880; validation vs test, HR, 1.039, 95% CI: 0.604–1.790, p = 0.889). For histological types, out of 207 patients, 29 (14%) were highly differentiated, 154 (74%) were moderately differentiated, 23 (11%) were low differentiated, and 1 (< 1%) was undifferentiated. For the baseline symptoms, 35 (17%) had NASH/NAFLD and 36 (17%) had PVTT. For the additional treatments, 109 (53%) had undergone surgery, 10 (5%) received EBRT, 121 (58%) received TAE/TACE, and 58 (28%) received RFA/WMA. Multivariable Cox regression calculated a risk score based on the seven variables using the formula: \(} =0.3747 \times } + 0.1593\times } \! - \!0.1801\times }+0.6732 \times }-0.8235\,\times \,} + 0.6482 \,\times }-0.4497 \times }\) (Fig. S3, C-index = 0.630, p = 0.018).

Table 1 Patient characteristicsComparisons of different models on the survival prediction

Prediction performances were compared among CLN, Rad-S, RadCLN-S, Rad-D, and RadCLN-D on both the validation and independent test sets (Tables 2 and 3 and Fig. S4). Clinical variables displayed unfavorable prediction performances, with a C-index of 0.537 (95% CI: 0.406–0.668) on the validation set and 0.622 (95% CI: 0.500–0.744) on the test set. Using the baseline radiological image achieved the C-index of 0.692 (95% CI: 0.569–0.815) on the validation set and 0.608 (95% CI: 0.504–0.712). By incorporating the first follow-up image, performance demonstrated significant improvement, reaching 0.748 (95% CI: 0.664–0.832) and 0.681 (95% CI: 0.573–0.789) on the validation and test sets, respectively. Multi-modal inputs (RadCLN-S and RadCLN-D) outperformed the uni-modal (CLN, Rad-S, and Rad-D). RadCLN-S reached the C-index of 0.697 (95% CI: 0.574–0.820) on the validation set, and 0.638 (95% CI: 0.536–0.740) on the test set. RadCLN-D attained 0.752 (95% CI: 0.660–0.844) on the validation set, and 0.695 (95% CI: 0.581–0.809) on the test set. Time-dependent ROCs showed a similar pattern in survival prediction performances (Figs. S4b and S4c).

Table 2 Concordance index of the prognostic prediction modelsTable 3 Time-dependent AUCs of the prognostic prediction models

To further demonstrate the prognostic predictive ability of the model, the models’ capability for risk stratification was assessed. RadCLN-S, Rad-D, and RadCLN-D effectively stratified patients into high-risk and low-risk groups (Figs. 3 right, S5, and S6), demonstrating the models’ ability to identify survival risk using clinical information and baseline CT scans or solely baseline and first follow-up scans. Among all, RadCLN-D exhibited the highest predictive performance, as it included the most comprehensive information.

Fig. 3figure 3

Performances of RadCLN-D. Left, the distributions of risk scores based on the multi-modal predictions are presented. Heatmaps are displayed to illustrate the distribution levels of the two modalities (radiological and clinical). Middle, time-dependent ROC curves at 1-year and 2-year. Right, Kaplan–Meier survival estimates for the OS, were stratified into low-risk and high-risk groups according to the median risk score in the training set. AUC, area under the curve; HR, hazard ratio

RadCLN-D accurately predicts the OS

Performances of RadCLN-D were further detailly illustrated. RadCLN-D combines the radiological score from the output of the CRNN structure and the clinical score with the formula \(}}\,=\,9.8834\times }}}}}_}}}+0.5300\times }}}}}_}}}\), the two modalities all significantly contributed to the OS prediction (Fig. 4b, radiological, p < 0.001; clinical, p = 0.0056; Wald Test). For 1-year OS predictions, the AUC is 0.966 in the training set, 0.777 in the validation set, and 0.704 in the test set. For 2-year OS predictions, the AUC is 0.983 in the training set, 0.839 in the validation set, and 0.652 in the test set (Fig. 3, middle). Patients with lower multi-modal scores tend to be censored or with a relatively longer survival time, while most patients with higher scores suffered early decease (Fig. 3, left). The median score from the training data was used to apply a cutoff for stratifying patients into high-risk and low-risk groups, i.e., ‘score > 0.66’ signifies high-risk, and ‘score ≤ 0.66’ signifies low-risk. To examine the generalizability, the risk score calculation, and cutoff stratification used in the validation and test sets were consistent with those of the training set. The multi-modal score displayed reliable predictive accuracies. It made significant risk stratifications in all the training (HR, 24.173, 95% CI: 12.181–47.971, p < 0.001), validation (HR, 3.330, 95% CI: 1.369–8.102, p = 0.008), and test sets (HR, 2.024, 95% CI: 1.009–4.064, p = 0.047).

Fig. 4figure 4

Establish and validate the nomogram of RadCLN-D. a For the radiological score and the clinical score, locate the corresponding value on the scale provided on the nomogram, then add up to get the total points. A vertical line from the total points value is to the predicted probability of the 1-year and 2-year survival probability. b Importance of the radiological score and the clinical score. c Calibration of the nomogram in terms of the agreement between predicted and observed 1-year survival outcomes

A nomogram was developed based on the RadCLN-D prediction model to determine the OS for individual patients (Fig. 4a). It allows clinicians to estimate the 1-year and 2-year survival probabilities in a clear and concise manner. Calibration plots indicated favorable comparability between the nomogram and an ideal model across the training, validation, and test datasets (Fig. 4c).

Moreover, to assess the robustness of the model, the patients in the validation set and test set were grouped according to the manufacturers used. Among the 78 patients, the main manufacturers were Siemens (37 patients) and General Electric (33 patients). Due to the small sample sizes of other manufacturers such as Philips, TOSHIBA, and Hitachi Medical, these patients were not included in the analysis. The prognostic performance of patients with two major manufacturers was compared, with a C-index of 0.728 for the Siemens group and 0.707 for the general electric group.

Interpretation of the deep learning model

To demonstrate the explainability of the deep learning model, four patients from the test set, including two predicted high-risk and two predicted low-risk by Rad-D, were presented to interpret the constructed CRNN architecture. Heatmaps highlighted the regions of the image that contribute most to the network’s decision-making process. The prognostic model focused particularly on the tumor regions (Fig. 5a, b, liver scans, the hottest region on the tumor), which is consistent with common medical knowledge that regions with high malignancy correlate strongly with prognosis. Non-liver-malignancies in two low-risk patients resulted in hot regions of the whole liver detected by the model (Fig. 5c, d, liver scans). Similar heatmap patterns predicted by the deep learning model can be observed in lung scans, i.e., the model focused more on suspicious lesion areas.

Fig. 5figure 5

Interpretation of the deep learning model. Grad-CAM computes the gradients of the target class’s score (i.e., the risk score) with respect to the feature maps in the last convolutional layer of the network. These gradients are then weighted by the average pooling of the gradients to obtain the importance weights of the feature maps then normalized to [0,1] (with the blue color close to zero and the red color close to one) and linearly combined with the original feature maps. The four demonstrated cases are selected from the independent test set. a, b Two patients with model-predicted high-risk. c, d Two patients with model-predicted low-risk

Incremental value of RadCLN-D to traditional size-based method

RECIST outcomes assessed by an independent review committee were adopted and patients with a progression status were assigned to the high-risk group. Risk stratification performances of RadCLN-D and the conventional RECIST criteria were compared (Fig. 6). RECIST outcomes showed acceptable risk prediction performance as the response status significantly stratified the high-risk group and the low-risk group (HR, 1.992, 95% CI: 1.119–3.545, p = 0.019). Whereas, RadCLN-D exhibited stronger categorization capability (HR, 2.450, 95% CI: 1.424–4.214, p = 0.001), suggesting an improvement of the deep learning-based method over the conventional size-based method.

Fig. 6figure 6

Risk stratification by the deep learning model and the conventional RECIST criteria. Results were made on the combination of the validation set and the test set. a Risk stratification by the deep learning model RadCLN-D, the high-risk group is defined as a score > 0.66, and the low-risk group is defined as a score ≤ 0.66. b Risk stratification by the RECIST criteria, the high-risk group is defined as disease progression at the first follow-up, and the low-risk is defined as no progression observed at the first follow-up. HR, hazard ratio; RECIST, response evaluation criteria in solid tumors

留言 (0)

沒有登入
gif