Multimodal-based machine learning strategy for accurate and non-invasive prediction of intramedullary glioma grade and mutation status of molecular markers: a retrospective study

Aim

The present study aimed to develop a non-invasive preoperative method for predicting IMG grade and mutational status of molecular markers. Towards this goal, we retrospectively analyzed the preoperative MRI scans of patients with IMGs whose ATRX and P53 mutation statuses were tested by IHC. In the experimental design, multiple feature representations and mainstream machine learning models were compared to explore the superior predictive effect. After exploring the superior feature representation of novel machine learning models, including WHO-Mind, ATRX-Mind, and P53-Mind, methods were developed to accurately classify the glioma grade and predict the ATRX and P53 mutation statuses of IMGs. Additionally, rigorous external validation was performed to verify the generalizability of the proposed method.

Patient demographic characteristics and data acquisition

This retrospective study was approved by the ethical committees of Beijing Tsinghua Changgung Hospital (hospital 1) and The First Affiliated Hospital of Zhengzhou University (hospital 2) and was conducted in accordance with the principles of the Declaration of Helsinki. The requirement for obtaining informed consent was waived.

We collected the records of patients from two hospitals (hospital 1: January 2015 to April 2021 and hospital 2: February 2017 to October 2021). The patient inclusion criteria were preoperative MRI examination, pathological diagnosis of IMGs according to the WHO CNS5, and IHC-confirmed ATRX and P53 mutation status. Patients who had undergone biopsy, chemotherapy, and radiotherapy and those with IMG metastases from brain glioma were excluded. Following the initial patient screening, MRI data were evaluated, and the exclusion criteria were as follows: absence of sagittal (SAG) and transverse (TRA) T2-weighted imaging (T2WI) scans and artifacts in the MRI image.

Data on clinical baseline characteristics, including age, sex, smoking, alcohol consumption, medical history with accompanying diseases, time of onset, preoperative McCormick score, and radiological features of tumor-related lesions, were collected. Moreover, all radiological features, including tumor number, axis ratio (tumor/spinal cord), bleeding, cysts, spinal cord cavity, edema, atrophy, and malformation, were independently assessed by three neuroradiologists. All of these features were not obtainable by radiomic feature extraction. Two experienced pathologists performed the pathological evaluation of all gliomas using the WHO CNS5 for glioma grading and classification [13]. We classified glioma grades as low grade (WHO I–II, benign) and high grade (WHO III–IV, malignant), and the mutation status of ATRX and P53 was determined by IHC.

Patients recruited from hospital 1 were included in the primary cohort (n = 332) and were randomly assigned to the training set and internal validation set after fivefold cross-validation according to the ATRX/P53 mutation status and tumor grade. Patients from hospital 2 were included in the independent external validation cohort (n = 129). A detailed flow diagram of the patient selection process is illustrated in Fig. 1.

Fig. 1figure 1

Flow diagram showing the patient selection protocol and inclusion and exclusion criteria. ATRX, alpha thalassemia/mental retardation syndrome X-linked; IHC, immunohistochemistry; P53, tumor protein p53; SAG, sagittal; TRA, transverse; WI, weighted imaging

MRI acquisition and region of interest segmentation

MR data were obtained employing a 3.0-T MR scanner (Discovery MR750; GE Healthcare, Milwaukee, WI) in hospital 1 and either Siemens (Verio, Prisma, or Skyra; Siemens Healthcare, Erlangen, Germany) or Philips (Ingenia, Philips Medical System, Best, The Netherlands) MR systems in hospital 2. An overview of the parameters used to acquire MRI is given in Additional file 1, which shows that scans are obtained by scanners from different vendors employing various acquisition settings. Identifying the borders of low-grade gliomas is difficult on T1-weighted and contrast-enhanced sequences; thus, the T2-weighted (T2W) sequence is well accepted for identifying abnormal hyperintense signals representing gliomas, and tumor regions of interest (ROIs) were delineated on SAG and TRA T2WI separately [14].

Lesions were delineated on both SAG and TRA T2WI by two experienced neuroradiologists using the ITK-SNAP software (www.itksnap.org). Abnormal hyperintense signals on T2W MR images were regarded as tumor regions, and the signals of cerebrospinal fluid, spinal cord cavity, and edema were avoided. If the ROI was controversial, a senior neuroradiologist performed the final determination. T1-weighted images with and without gadolinium enhancement, short T1 inversion recovery, and fluid attenuation inversion recovery images were collected for most patients to aid in identifying tumors. Only T2WI was used in subsequent studies. Finally, to reduce the heterogeneity between different scanners, we resampled the T2W images and the corresponding ROI mask into a uniform voxel size of 0.5 × 0.5 × 5 mm across all patients for radiomic extraction.

Lesion segmentation using deep learning

After preprocessing, deep learning models were proposed to segment lesions, which greatly improved future work efficiency. This work selected a deep learning model method based on the transformer architecture (Swin transformer) because of its superiority in multiple domains [15, 16]. The Swin transformer adopts a hierarchical design containing a total of four stages: each stage decreases the resolution of the input feature map and expands the perceptual field layer by layer, similar to a convolutional neural network. In addition, it is designed with patch embedding, which cuts MRI scans into patches and embeds them into the model. Although this architecture has performed well in several tasks [17, 18], its use for segmenting IMGs has not been reported.

Both SAG and TRA images of the primary cohort were employed for the training, internal, and external validation cohorts for testing. The input channel was two dimensional, that is, all MRI sequences were converted into slices and then batch inputted, and the final output results were reconstructed in 3D. Transfer learning was adopted for training, and the pre-trained model was swin_transformer_base_patch4_window7_224_imagenet_1k. The backbone of the model was SwinTransformer_base_patch4_window7_224.

Feature extraction and selection

Radiomic features were extracted using the PyRadiomics module written in Python 3.7 in accordance with a previous study [19]. Both the SAG and TRA images extracted the following six feature classes: first-order statistics, shape, gray-level run-length matrix, gray-level size zone matrix, neighboring gray-tone difference matrix, and the gray-level dependence matrix. These feature classes used six built-in filters (wavelet, square, square root, logarithm, exponential, and gradient filters). All features were named by connecting the MRI type, the image type from which the feature was extracted, the feature class, and the feature name separated by an underscore. For example, TRA_original_glrlm_RunEntropy is a feature extracted from the TRA T2WI sequence, original image, and glrlm feature class, and the feature name is RunEntropy.

For every patient, 1960 radiomic features were extracted from SAG and TRA T2WI data, and all radiomic features were z transformed for data standardization. To avoid interobserver variations during manual segmentation, we calculated the intraclass correlation coefficient (ICC) for each feature, and only those with high stability (ICC > 0.8) were included in the analysis [20]. Then, the selected stable features were tested using the independent samples t-test or the Mann–Whitney U test to select potential important features. Features that did not meet the criteria for either of the aforementioned tests were excluded. This study adopted the least absolute shrinkage and selection operator (LASSO) on the training cohort to screen significant features with non-zero coefficients that can differentiate ATRX and P53 mutation status or glioma grade separately. For the three outcomes of ATRX, P53, and tumor grade, we used LASSO to select features in the TRA, SAG, and TRA + SAG groups, respectively. The aforementioned calculation methods are available in PyRadiomics 2.2.0 documentation [21].

Prediction model construction

In this study, three types of deep neural networks (DNNs) were constructed based on the data characteristics of the prediction tasks: WHO-Mind, ATRX-Mind, and P53-Mind. WHO-Mind has one input layer, four hidden layers, and one output layer. Both ATRX-Mind and P53-Mind have one input layer, three hidden layers, and one output layer. For each DNN, the ReLU and Adam were selected as the activation function and the solver for weight optimization, respectively. In addition, the initial learning rates and batch sizes were set to 0.01 and 64 in all models, respectively. The model architecture and the key hyperparameters are summarized in Additional file 2. However, the unbalanced nature of the data included in this study posed a challenge for model training, as data category imbalance may lead to severe overfitting. Therefore, after dividing the dataset, the Synthetic Minority Oversampling Technique (SMOTE) was adopted for the divided training set to achieve data augmentation. Its basic idea is to generate new synthetic samples by interpolating between minority class samples, thereby balancing the class distribution, and thus improving the performance of the classifier.

Specifically, this algorithm analyzes the minority class samples and artificially synthesizes new samples based on the minority class samples to add to the dataset, thereby solving the problem of model overfitting. The algorithm first randomly selects a sample from the minority class (sample A) and identifies its k nearest neighbors within the minority class. Subsequently, a neighbor (sample B) is randomly chosen from these k neighbors. The new synthetic sample is created on the line segment between sample A and sample B. This process is repeated as needed until a sufficient number of synthetic minority class samples are generated to achieve the desired class balance. It is worth noting that SMOTE solely creates new samples within the feature space without generating new class labels. Synthetic samples inherit the class label of their parent samples (i.e., sample A and sample B). To implement SMOTE in Python 3.7, we utilized the imbalanced-learn library, applying the SMOTE algorithm to the training data in order to balance the class distribution.

Additionally, considering that some mainstream models that have been developed obtained good performance in related fields [22,23,24], extreme Gradient Boosting (XGBoost), Gradient Boosting Decision Tree (LightGBM), and random forest (RF) were also trained and tested for comparison. The parameters of the above models were optimized during the training process using the GridSearchCV tool in Scikit-learn 1.1.1. The modeling was performed in the Python 3.7 programming environment, and the core library involved was Scikit-learn 1.1.1.

Experimental setup for the prediction models

The overall process of this study is illustrated in Fig. 2. Three tasks (WHO tumor grade, ATRX mutation status prediction, and P53 mutation status prediction) were included, and each was used to build four machine learning models. Moreover, six different feature representations were compared to explore the most superior model input: SAG radiomics, SAG radiomics with clinical baseline, TRA radiomics, TRA radiomics with clinical baseline, SAG + TRA radiomics, and SAG + TRA radiomics with clinical baseline features. In the primary cohort, fivefold cross-validation was employed to identify the best models. Data were randomly divided into five equal parts: one was selected for internal validation, and the rest were trained. This process was repeated five times. The best-performing models were tested using an external validation cohort to verify the generalization ability. This work was completed in Windows 10 OS and involved computing devices with a CPU AMD Ryzen 7 5800H (16 GB RAM) and GPU GeForce RTX ™ 3090 (24 GB RAM).

Fig. 2figure 2

Illustration of the study process. Stage I includes raw image acquisition (a), manual ROI segmentation (b), and auto ROI segmentation (c). Stage II includes feature extraction and selection. From both SAG and TRA images, radiomic features, including first-order statistical, shape, texture, and wavelet features, are extracted (d). All extracted features are screened out by ICC to select stable features (e). Informative features are then selected using LASSO (f). Stage III includes model construction and validation. Selected clinical and radiomic features are entered into the deep neural networks to predict the different tasks (g), and model performance is further tested in the external validation cohort (h). ATRX, alpha thalassemia/mental retardation syndrome X-linked; ICC, intraclass correlation coefficient; LASSO, least absolute shrinkage and selection operator; P53, tumor protein p53; ROI, region of interest; SAG, sagittal; TRA, transverse

Statistical analysis

When representing continuous variables, differences were assessed using Student’s t-test or the Mann–Whitney U test, as appropriate, and data are represented as median with interquartile range. Specifically, because the fivefold cross-validation was employed in the primary cohort, we also calculated the means and 95% confidence intervals (CIs) for the evaluation metrics of the models. We adopted the chi-square test to evaluate the differences in categorical variables, and the results are presented as the number of events and relative frequency (%). The performance of all models was evaluated according to accuracy (Acc), sensitivity (Sens), specificity (Spec), F1 scores, and receiver operating characteristic (ROC) curves. A dice similarity coefficient (DSC) was adopted to evaluate the performance of the networks for IMG segmentation. All statistical analyses were performed employing R software (version 3.6.3, R Foundation for Statistical Computing, Vienna, Austria). Statistical significance was defined as p ≤ 0.05.

留言 (0)

沒有登入
gif