Automatic segmentation model and machine learning model grounded in ultrasound radiomics for distinguishing between low malignant risk and intermediate-high malignant risk of adnexal masses

Ethical approval

The research was authorized by the ethics committees of South China Hospital of Shenzhen University (no.: HNLS20230112101-A). Considering the retrospective nature of the research, a waiver for patient informed consent was provided. The workflow for model construction and visualization is illustrated in Fig. 1.

Fig. 1figure 1

Workflow of ultrasound-based radiomics analysis

Participants and data collection

From June 2021 to January 2024, we conducted a retrospective collection of ultrasound images (through the vagina or rectal) from participants diagnosed with adnexal masses at the South China Hospital of Shenzhen University (training set) and the Qingdao Municipal Hospital (testing set). In adherence to O-RADS, two experienced radiologists (H.T. and W.C.) with two decades of gynecological ultrasound experience categorized these images into five types (O-RADS 1–5), and their consistent classification served as the diagnostic criterion. Given that O-RADS 4 has been established as the optimal threshold for malignancy [16, 18], the masses were divided into two subsets: O-RADS 1–3 lesions were considered a low malignant risk set, whereas O-RADS 4–5 lesions were considered an intermediate-high malignant risk set. In cases of disagreement, a gynecological ultrasound specialist would be consulted to reach a consensus.

The participants’ clinical features included age, the mass’ maximum diameter, symptoms, menopausal status, and whether ascites were present or absent. The symptoms encompassed dysmenorrhea, chronic pelvic pain, abdominal discomfort, dyspareunia, and a sensation of abdominal distension.

The eligibility criteria included: (1) participants with adnexal lesions had undergone ultrasound examination at the two designated hospitals mentioned above; (2) participants over the age of 18; and (3) participants with multiple adnexal masses, only the mass exhibiting the most intricate morphology were included. The exclusion criteria included: (1) pelvic masses that could not be confirmed as originating from the adnexa; (2) low-quality images; and (3) incomplete clinical data.

Image acquisition

The ultrasonic examination was performed utilizing diverse devices, including Mindray DC-80, Samsung HERA XW10, GE Voluson E10, and GE Logiq E9. All ultrasound scans were administered by licensed radiologists. If multiple images were obtained of a single mass, the image depicting the largest diameter was chosen for inclusion. Image quality control was carried out by two radiologists (T.W. and Y.L.).

Construction of deep learning image segmentation model

A subset of images was randomly selected to train the segmentation model. Two investigators (L.L. and H.T.) used Labelme software (version 5.3.1, USA) to manually delineate the target lesions of the selected images and their segmentation results served as the standard. To avoid the subjective errors caused by individual differences, the interclass correlation coefficient (ICC) was utilized to assess the inter-observer or intra-observer concordance in the delineation of lesions. An ICC threshold of 0.75 or greater was deemed to reflect acceptable concordance.

Seven deep learning segmentation models were trained, including FCN ResNet50, FCN ResNet101, DeepLabV3 ResNet50, DeepLabV3 ResNet101, DeepLabV3 MobileNetV3-Large, LR-ASPP MobileNetV3-Large, and U-Net. We applied the dice similarity coefficient (DSC), which quantifies the overlap between the segmentation results of the investigator and the model, to evaluate the accuracy of segmentation. We chose the best model to automatically segment the remaining images. In cases where the segmentation was inaccurate, manual fine-tuning was performed.

Features extraction and features selection

We used a feature analysis program designed for radiomics analysis within Pyradiomics to extract radiomic features. The extracted features were classified into three primary categories: geometry, intensity, and texture. The geometry feature depicts the shape characteristic of the lesion. The intensity feature illustrates the first-order statistical distribution of voxel intensity within the lesion. Texture features describe the pattern or the second- and high-order spatial distribution of the intensity, including gray-level co-occurrence matrix (GLCM), gray-level dependence matrix, gray-level size zone matrix, gray-level run-length matrix, and neighboring gray-tone difference matrix.

To identify features that exhibited the strongest correlation with the categorization, we employed the t-test and the Mann–Whitney U-test to screen features. We retained features associated with a p value less than 0.05, as they indicated significant differences between the two sets. To eliminate superfluous features, the correlation among the features was ascertained by calculating Spearman’s rank correlation coefficient. We selected only one feature from any pair presenting a correlation coefficient exceeding 0.9 to remove those exhibiting high redundancy. Additionally, we adopted a greedy recursive deletion approach to filter features, which involves the iterative removal of features deemed most superfluous in the current set.

We utilized the least absolute shrinkage and selection operator (LASSO) regression to minimize the feature set. The Rad scores were generated using LASSO logistic regression (LR), selecting only the features with non-zero coefficients. LASSO shrank all regression coefficients towards 0, depending on the regulation weight λ, and precisely set the coefficient of the uncorrelated feature to 0. The optimal λ value was ascertained through 10-fold cross-validation using minimum criteria, and the ultimate λ induced the least cross-validation error. The most robust feature with a non-zero coefficient was utilized for regression model fitting and integrated into a radiomic signature. A Rad score was generated from a linear combination of the selected features, each weighted according to the corresponding model coefficient.

Model construction and evaluation

We utilized Python to establish radiomic and clinical models. The selected features were put into several machine learning algorithms, including LR, support vector machine (SVM), k-nearest neighbor (KNN), Random Forest, XGBoost, LightGBM, and multi-layer perception (MLP). We subsequently conducted a 5-fold cross-validation to ascertain the best hyperparameters for model fitting. A radiomic nomogram was developed, incorporating both radiomic features and clinical features for analysis.

The receiver operating characteristic (ROC) curve was used to visually assess the diagnostic performance. Furthermore, several diagnostic indices were calculated, such as the area under the ROC curve (AUC), specificity, sensitivity, accuracy, positive predictive value (PPV), negative predictive value (NPV), and precision. We performed the DeLong test to compare the AUCs of various models with MedCalc software. To evaluate the concordance between the model’s predictions and actual classifications, a calibration curve was generated to assess the calibration effectiveness using the Hosmer–Lemeshow analysis. The decision curve analysis (DCA) was performed to assess the clinical utility of the model.

Assessment by radiologists

Two experienced radiologists (H.T. and W.C.) independently categorized the images based on O-RADS. A third experienced radiologist (L.L.) with more than ten years of gynecological ultrasound experience and two less-experienced radiologists (T.W. and Y.L.) with less than five years of gynecological ultrasound experience were tasked to categorize the images. With the assistance of the model, two less-experienced radiologists reclassified these images. In instances of uncertainty, less-experienced radiologists could refer to the nomogram results and the analysis of SHAP force plots for the images to potentially assist their diagnosis and adjust their classifications accordingly.

Model interpretability and visualization

The SHAP method was used to visualize the significance of the features and their influence on the model, interpreting the internal decision-making process of the model by assigning importance values (SHAP values) to features.

Statistical analysis

The IBM SPSS statistical software was employed to compare the clinical features of participants between the sets. The continuous variable was summarized using mean ± standard deviation and evaluated with the t-test (normal distribution) or Mann–Whitney U-test (non-normal distribution). The categorical variable was presented as a percentage and analyzed with the Chi-square test. A two-sided p value of less than 0.05 was considered statistically significant. The 95% confidence interval (CI) for the AUC was determined. Python was utilized to conduct the Z score normalization, Spearman rank correlation test, and LASSO analysis.

留言 (0)

沒有登入
gif