The study received approval from the institutional review board. Due to the retrospective design, the requirement for informed consent was waived. The study adhered to the checklist for evaluation of radiomics research guidelines [19] to ensure comprehensive and transparent reporting.
PatientsThis multi-cohort study enrolled patients from three institutions, comprising two retrospective cohorts and a retrospective analysis of a cohort from a prospective clinical trial. The inclusion criteria were as follows: (a) pathologically confirmed ESCC, (b) receipt of nCRT and curative resection, and (c) availability of pre-treatment contrast-enhanced CT and MR data, including DWI and T2WI. The exclusion criteria included: (a) insufficient image quality due to obvious artifacts and (b) incomplete clinical and pathological data. Patients for the retrospective cohorts were recruited between September 2014 and September 2023 from institution 1 and between December 2017 and August 2021 from institution 2. The clinical trial cohort comprised patients enrolled in the KEYSTONE-002 trial at institution 3 until November 2023. The KEYSTONE-002 trial is an ongoing phase III randomized controlled trial registered at ClinicalTrials. gov (NCT04807673), with the main inclusion criteria being (a) pathologically confirmed ESCC, (b) R0 resectable thoracic esophageal cancer, cT1-3N1-2M0, cT2-3N0M0, (c) age 18-75 years old, and (d) Eastern Cooperative Oncology Group Performance Status (ECOG-PS) 0–1. All patients underwent curative resection, which involved transthoracic esophagectomy with two-field or three-field lymphadenectomy. Pre-treatment imaging, including contrast-enhanced CT and MR, was performed within two weeks prior to the initiation of nCRT. Patients from institution 1 were allocated to the training cohort, and those from institutions 2 and 3 to the testing cohort.
Clinical and pathological data collectionClinical data including age, sex, ECOG-PS, tumor location, tumor length, TNM stage (AJCC 8th edition), chemotherapy regimen, radiotherapy technology, and radiotherapy dose were collected. All patients underwent endoscopy, with most cases using it to determine tumor location and length by measuring the distance from the incisors, categorized according to the AJCC 8th edition. In rare instances where endoscopy could not pass, CT was used for assessing tumor location and length. Clinical staging was assessed using contrast-enhanced CT and MR. Endoscopic ultrasound was utilized except in rare cases where the probe could not pass. 18F-FDG PET-CT was performed in 35 patients (23.2%) who had suspected metastatic disease not clearly identified on CT or MR imaging. Clinical staging was determined from initial imaging reports prepared by experienced radiologists and endoscopists at each institution. The staging procedures were consistent across all institutions, including the cohort at Institution 3. Pathologic tumor regression grade was determined postoperatively using the method described by Mandard et al [20]. The therapeutic response was categorized into five grades. Mandard grade 1 with negative lymph node metastasis was categorized as pCR, and others as non-pCR, based on the surgical pathologic examination report.
CT and MR techniqueAll patients underwent pre-nCRT contrast-enhanced CT and MR scans. Portal venous phase CT images were collected, with scanning parameter details provided in eTable 1. MR scanning parameter details are shown in eTable 2.
Tumor segmentationTumor regions were manually segmented on multiple contiguous axial slices of contrast-enhanced CT, T2WI, and DWI images to cover the entire tumor volume, by two radiologists, each with five years of experience in thoracic imaging interpretation. The segmentation results were reviewed by two senior experts with 15 years and 25 years of experience, respectively. For interobserver reproducibility analysis, an additional radiologist with 6 years of experience independently segmented tumors in 20 randomly selected patients from the training cohort. The manual segmentation process is illustrated on eFig. 1.
Radiomic analysisFeature extractionN4 bias field correction was utilized for MR sequences to address image inhomogeneity [21]. Images were resampled isotropically to a voxel dimension of 1 × 1 × 1 to standardize voxel spacing. To mitigate noise and discretize intensities, the Hounsfield units of CT images were adjusted to the standard abdominal window, setting the window center at 50 and window width at 350. Z-score normalization was implemented before extracting traditional features. A total of 1652 features were extracted from both original and filtered images (eTable 3). For deep learning, the Inception-V3 network [22], pre-trained on ImageNet data, was employed. The largest tumor images from each patient were cropped for feature extraction, with grayscale values normalized within the range [−1, 1] using min–max transformation. The images were then resized to 299 × 299 using the nearest interpolation. The network’s last fully connected layer was removed, and the average pooling layer of the feature maps was used to extract 2048 deep-learning features.
Feature selectionThe feature selection process in this study was conducted within the training cohort. Only features with an intraclass correlation coefficient (ICC) greater than 0.8 were retained. The training cohort was divided into an internal training set and an internal validation set in a 4:1 ratio, a procedure replicated across 100 iterations. In each iteration, the internal training set underwent analysis using the Mann–Whitney U-test and least absolute shrinkage and selection operator (LASSO) with 5-fold cross-validation. These methods were employed to generate a feature set for model construction. Ten algorithms were used to build classifiers: logistic regression (LR), support vector machine (SVM), K-nearest neighbors (KNN), decision tree, random forest, extra trees, XGBoost, multi-layer perceptron (MLP), Naive Bayes, and light gradient boosting machine (LightGBM). The performance of these classifiers was tested on the internal validation set. The best-performing classifier and its feature set from each iteration were recorded. Features were ranked based on their frequency of selection. The top two features for each imaging modality (CT and MRI) and feature extraction method (traditional radiomics and deep learning) were selected to build single modality models using the ten algorithms. Then for each imaging modality, a combined model based on both traditional and deep-learning radiomics features was built. To enhance the model’s generalizability and reduce overfitting, another round of feature selection and model construction was performed, again over 100 iterations, based on the twelve features selected in the former procedure. This process aimed to select the top four features for building integrated models with the ten algorithms.
Model constructionFive-fold cross-validation was performed with selected feature sets for each modality with ten algorithms (LR, SVM, KNN, Decision Tree, Random Forest, Extra Trees, XGBoost, MLP, Naïve Bayes, and LightGBM). The evaluation of algorithmic performance was based on the calculation of the mean AUC, which served as the principal metric for algorithm selection. Subsequently, the algorithm that exhibited superior performance, as indicated by the highest mean AUC, was selected for the development of the dedicated machine learning model using the training cohort corresponding to each specific modality.
Assessment of model performanceReceiver operating characteristic curve analysis and AUC were used to evaluate the models, with the 95% confidence intervals (CIs) being generated through bootstrap resampling, performed 1000 times. The models with ten algorithms were tested independently in the testing cohort and the best was kept to represent the model’s performance. To determine the most effective threshold for the radiomics score, the Youden index was maximized in the training cohort, and these optimal cutoff values were subsequently applied to the testing cohort, and the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated. The comparison of AUC values among different models was performed using the DeLong method. Calibration curves and decision curve analysis (DCA) were employed to further assess the models’ performance and their clinical utility. The workflow of the radiomics analysis is depicted in Fig. 1.
Fig. 1Workflow of the radiomics analysis
Statistical analysisStatistical analysis in this study was executed using R version 4.1.2 and Python 3.9. Categorical variables were assessed using Fisher’s exact test, while continuous variables were analyzed with the Mann–Whitney U-test. A two-sided p-value of less than 0.05 was set for statistical significance.
留言 (0)