Localized fine-tuning and clinical evaluation of deep-learning based auto-segmentation (DLAS) model for clinical target volume (CTV) and organs-at-risk (OAR) in rectal cancer radiotherapy

Fig. 1figure 1

Conceptual design and implementation workflow of this study in model fine-tuning and performance evaluation (external validation and generalizability evaluation)

The conceptual design and overall workflow are shown in Fig. 1. The work is generally composed of two procedures-model fine-tuning, and performance evaluation. The latter includes external validation, evaluating model performance on patients scanned on the same CT simulator as training patients but not utilized during model training. Generalization evaluation refers to assessing the model performance on patients scanned on a different CT simulator and not involved in model training.

Data collectionPatient cohort

This retrospective study was approved by the institutional review board (IRB) of Peking University Cancer Hospital. A total of 120 patients were included in this work, who were diagnosed with Stage II/III mid-low rectal cancer (i.e., gross tumors were located within 10 cm from the anal verge) and received chemoradiation at the institutional radiotherapy department. Over the enrolled cohort, 71 were female and 49 were male, and the ages ranged from 33 to 86 with the median as 65.

The enrolled patients were grouped into three datasets - a training dataset, an external validation dataset denoted as ExVal and a generalizability evaluation dataset denoted as GenEva as shown in Table 1. The training dataset was composed of 60 patients treated between March 2020 and October 2022. The external validation dataset ExVal was composed of 30 patients treated between November 2022 and May 2023. At the end of 2022, a Philips RT-specific CT scanner was commissioned into clinical service at our institution, and 30 patients scanned on this CT-Sim between February 2023 and May 2023 were collected as the dataset GenEva to evaluate model generalizability.

Table 1 Description of patient data groupingImage acquisition

In this study, patients were immobilized using a pelvic thermoplastic in a supine position. The training dataset and ExVal were scanned on a Siemens Sensation Open CT simulator, while the GenEva dataset was scanned on a Philips Big-Bore CT simulator. Detailed specifications of the scan parameters are listed. The CT images were imported into the Eclipse Treatment Planning System (Varian Medical System Inc., USA) for physician to delineate target and OAR structures. The contours as well as plans were reviewed by an internal panel before approved for clinical treatment.

In this retrospective study, we retrieved the planning CT images as well as CTV and OAR contours from the treatment planning system in an anonymized approach under the IRB approval. The CTV and OAR contours approved for treatment were used ground truth (GT) reference in model training and performance evaluation. It’s important to emphasize that all the contours used were based on real-world data, and no editing was done to refine them specifically for this study.

DL model and localized fine-tuningDL kernel network

The DL model for rectal cancer neoadjuvant radiotherapy herein was adopted from the work by Wu et al. [19, 20] and commercialized as RT-Mind-AI (MedMind Technology Co. Ltd., Beijing, China). The backbone network, referred as DpnUNet, was characterized by integrating dual-path-network (DPN) modules into the UNet structure. The overall architecture of DpnUNet was generally depicted in Fig. 2.

Fig. 2figure 2

Schematic of the kernel DpnUNet network architecture

Localized model fine-tuning

The model was pretrained using 122 patients’ data from a single institution [19]. We further trained the model with the enrolled training data (60 patients) to adapt to the institutional contouring protocol. The contours of interest were CTV, bladder, femoral heads and small intestine. The class weighted cross-entropy was used to take into account the overall accuracy in both CTV and OARs. Localized model fine-tuning was performed on a single GPU workstation (Nvidia GeForce RTX 2080Ti) using 5-fold cross validation (48 vs. 12). The optimizer was Adam, and the batch size was 4. The initial learning rate was 0.0001, and the value decayed exponentially by a factor of 0.9 over each epoch. The epoch was 60, and the model with the lowest cross-validation loss was selected as the final output.

Performance evaluationExternal validation and generalizability evaluation

This study used two datasets (ExVal and GenEva) with 30 cases in each to evaluate model performance in two aspects. The data in ExVal were acquired on the same CT simulator with the training data, and therefore used for external validation. The data in GenEva were acquired on a different CT simulator, and herein were used to evaluate model generalization in the context of imaging equipment changes.

Quantitative metrics

Two sets of deep learning predicted contours were generated for all 60 testing cases, using a vendor-provided pretrained model (VPM) and a localized fine-tuned model (LFT) respectively. We utilized several valid and widely used metrics to quantify segmentation performance, including the Dice Similarity Coefficient (DSC), the 95th percentile of the Hausdorff Distance (95HD), sensitivity, and specificity, using the clinically approved CTV and OAR contours as GT.

DSC, the most used measure in the field of medical image segmentation, provides an effective assessment of similarity, and is defined as:

$$\text\text\text(\text,\text)=\frac\cap \text\right|}\right|+\left|\text\right|}$$

(1)

where D and G represent the DLAS-predicted and GT contours respectively, and |D∩G| represents the intersected volume between D and G.

The 95HD metric is a routinely used spatial distance-based metric to measures the distance between the DLAS-predicted and GT contours, which is defined as

$$95\text\text(\text,\text)=\text\text\text\text\text\text\text\text\text\text\left(\text\right(\text,\text)\cup \text(\text,\text),95\text\text)$$

(2)

$$\text(\text,\text)=\text\text}_}_}}\left(\text\text}_}_}}||_- _||\right),_\in \text, _\in \text$$

(3)

where ||.|| stands for the Euclidean norm of the points of d and g.

Sensitivity and specificity are popular metrics for the evaluation of medical image segmentation performance [14, 15], which are defined as

$$\text\text\text\text\text\text\text\text\text\text\text=\frac\text}\text+\text\text}$$

(4)

$$\text\text\text\text\text\text\text\text\text\text\text=\frac\text}\text+\text\text}$$

(5)

which TP, FP, TN and FN denote the pixel numbers of true positive, false positive, true negative and false negative respectively for DLAS-predicted CTV and OAR contours, which reflect the number of pixels that are classified correctly or incorrectly with respect to the GT [21].

In addition, the CTV volume was also measured. The DSC, 95HD, sensitivity, specificity, and CTV-volume values of each testing case were calculated in the 3D Slicer software (version 5.4.0) [16].

Statistical analysis

The mean and standard deviation (SD) values were calculated for each metric. Within each testing dataset, the Wilcoxon paired signed-rank test was used to compare the performance between VPM and LFT. The statistical analysis was performed in OriginPro (version 2021a, OriginLab, USA), and the significance level was set at 0.05.

留言 (0)

沒有登入
gif