Machine learning models in evaluating the malignancy risk of ovarian tumors: a comparative study

Patient characteristics

In this study, a total of 1,632 patients with adnexal tumors detected by ultrasound examination at the Department of Obstetrics and Gynecology, Ruijin Hospital affiliated to Shanghai Jiao Tong University School of Medicine between January 2019 and May 2021 were included. After applying exclusion criteria, 1,555 patients were analyzed, including 1,196 (76.9%) patients with benign tumors and 359 (23.1%) patients with malignant tumors. The flowchart of enrollment is shown in Fig. 2. Pathological results of the patients are summarized in Table 1, whereas demographic and clinical characteristics are presented in Table 2.

The dataset was divided according to an 8:1:1 ratio, resulting in a training set (containing 956 benign and 285 malignant cases; totaling 7,493 images; 80%), a validation set (consisting of 119 benign and 35 malignant cases; comprising 799 images; 10%), and a test set (comprising 121 benign and 39 malignant cases; encompassing 818 images; 10%). Demographic and clinical characteristics between the training, validation, and test sets were consistent, as detailed in Table 3. There were no significant differences in age, CA125 levels, or other key clinical features, thus ensuring that the test set was representative of the patient population and reducing potential bias.

Significant differences were observed between benign and malignant tumors with respect to clinical and ultrasound characteristics. The mean age of patients with malignant tumors was higher than that of patients with benign tumors, with a median age at diagnosis of 54.0 and 41.0 years, respectively (p < 0.001). Serum tumor markers showed significantly higher levels in patients with malignant tumors compared to those with benign tumors, as reflected by median values of CA125 (122.2 vs. 17.6, p < 0.001). Ultrasound features also differed significantly between benign and malignant adnexal tumors. Malignant tumors had larger diameters for both mass and solid components (74 vs. 55 mm, p < 0.001; 50 vs. 24 mm, p < 0.001) and more abundant blood flow (p < 0.001). There were also notable differences in tumor type between the two groups, with malignant tumors occurring more frequently in masses with solid component, while benign tumors were more likely to be simple cysts. Additionally, malignant tumors were frequently associated with pelvic fluid, ascites, or pelvic nodules (p < 0.001).

Fig. 2figure 2

Flowchart of enrollment in study cohort

Table 1 Histopathological findings in 1555 women with adnexal massTable 2 Demographic and Clinical Characteristics of patients with benign and malignant ovarian tumors (n = 1555)Table 3 Demographic and Clinical Characteristics of patients in training set, validation set and test set (n = 1555)Diagnostic performance of adnexal mass prediction models

Table 4 compares the efficacy of different models, namely ResNet50, DenseNet, Vision Transformer, Swin Transformer, and SA, in identifying benign and malignant ovarian tumors (Figure 3). The evaluation metrics used include AUC, sensitivity, specificity, NPV, PPV, Youden index, cutoff value, +LR, -LR, and DOR. The figure depicts the comparison of AUCcurves for different machine learning models. The x-axis represents the false positive rate (FPR), and the y-axis represents the true positive rate (TPR).

Table 4 Comparison of the efficacy of ResNet, DenseNet, Vision Transformer, Swin Transformer and SA in identifying benign and malignant ovarian tumors

Among these models, ResNet50, DenseNet, Swin Transformer, and SA achieved high AUC values of 0.91, 0.91, 0.92, and 0.97, respectively. Vision Transformer had a slightly lower AUC of 0.87. In terms of sensitivity, Swin Transformer and SA performed the best sensitivity scores, with values of 87.2% for both models. Specificity was highest for SA at 98.4%, followed by Swin Transformer at 94.3%. Vision Transformer had the lowest specificity at 81.2%.

When considering the NPV, all models performed similarly well, with values above 99.6%. However, there were notable differences in PPV. SA had the highest PPV at 52.0%, while Vision Transformer had the lowest at 8.4%. The Youden index, a measure of overall diagnostic performance, was highest for SA at 0.86. Cutoff values were determined for each model, with values ranging from > 0.17 to > 3. Additionally, +LR values ranged from 4.49 to 53.18, while -LR values ranged from 0.13 to 0.25. The DOR was highest for SA at 409.08.

Table 5 further compares the efficacy of models in identifying benign and malignant ovarian tumors, with and without the use of CA125, a biomarker for ovarian cancer (Figure 4). The evaluation metrics used are similar to those in Table 4. The results showed that the addition of CA125 did not significantly improve the performance of the models in terms of AUC and sensitivity. However, there were slight improvements in PPV and DOR when CA125 was incorporated. Overall, the performance of the models remained consistent regardless of the presence of CA125.

Fig. 3figure 3

Comparison of the efficacy of ResNet, DenseNet, Vision Transformer, Swin Transformer and SA in identifying benign and malignant ovarian tumors

Table 5 Comparison of the efficacy of ResNet, DenseNet, Vision Transformer and Swin Transformer in identifying benign and malignant ovarian tumors with or without CA125Fig. 4figure 4

Comparison of the AUC of ResNet + CA125, DenseNet + CA125, Vision Transformer + CA125, Swin Transformer + CA125 and SA in identifying benign and malignant ovarian tumors

Channel attention visualization analysis

As illustrated in Fig. 5, the gradient-weighted class activation map are generated by using the gradients of the classification score with respect to the final convolutional feature map. In the Grad-CAM image, the activated (red) area is strongly considered in predicting final results, whereas the blue area is generally not considered in the final result. These findings were compared with justifications provided by clinicians. In cases where the diagnosis was correct, both the models and clinicians focused on the same regions of interest. Nonetheless, there were instances where both clinicians and DCNNs made incorrect diagnoses. We also compared the areas of interest identified by advanced Sonographers and machine learning models.

We further analyzed six misdiagnosis cases as shown in Fig. 6. Case A was benign, but all four machine learning models predicted it as malignant. The postoperative pathology revealed it to be an endometriotic cyst with old hemorrhage and coffee-colored material, without nodules or papillary growth. The machine learning algorithms may have misinterpreted the old blood clot as a papillary or solid component, erroneously considering it a malignant feature. In Case B, despite being benign, DenseNet, Swin, and Vision Transformer models predicted it as malignant. The postoperative pathology confirmed it to be an endometriotic cyst. However, it differed from typical ground-glass appearance on ultrasound, showing uniform hyperechoic content within the cyst. Analyzing the class activation maps, we observed that the misjudgment models excessively focused on the hyperechoic area, potentially leading to misclassification.

Similarly, in Case C, which was a scenario like Case A with an endometriotic cyst and old hemorrhage, the presence of bleeding clots resembling papillary projections resulted in misdiagnosis by two Transformer models. Case D involved pathological changes due to torsion of an adnexal cyst. Except for the DenseNet model, all other models incorrectly classified it as malignant. This may be attributed to the large size of the tumor, causing the models to miss capturing benign features accurately, leading to misclassification. Additionally, the extensive hemorrhagic necrosis resulting from a 1080° torsion might have caused the models to overly focus on certain benign features and erroneously consider them malignant. Cases E and F were both mature cystic teratomas with neural glial components—a unique subtype of teratomas. Benign teratomas often exhibit characteristic ultrasonographic features, such as mixed echogenicity/white ball and stripes/shadowing [27]. However, these two cases presented with similar solid components and/or thick septations.

The models may have mistakenly classified them as malignant characteristics, potentially resulting in misdiagnosis.

Fig. 5figure 5

Visualization of channel attention module

Fig. 6figure 6

CAM analysis of 6 cases (A-F). The grayscale ultrasound images are shown on the top left, while the Doppler ultrasound images are shown below. On the right side, clockwise from top left, are DenseNet, ResNet, Swin, and VisionTransformer

留言 (0)

沒有登入
gif