The impact of multicentric datasets for the automated tumor delineation in primary prostate cancer using convolutional neural networks on 18F-PSMA-1007 PET

In the present study, we undertook a comprehensive analysis evaluating the ability of CNNs to generalize to novel datasets in the context of intraprostatic GTV delineation for 18F-PSMA-1007 PET imaging. Our findings indicate that training the model with datasets from multiple centers significantly improves performance compared to training solely on data from a single center, when compared on an equivalent amount of training data. This improvement is anticipated, as exposure to multicentric data during training enables the model to encounter a broader variety of data representations, thereby enhancing its generalization capacity. Additionally, although the difference in performance between the Leave-One-Center-Out approach and mixed training did not reach the required significance level, the mixed training methodology also exhibited slightly superior performance. Overall, results suggest that integrating data from all centers in the training process, despite sometimes only yielding a small benefit, can contribute to a more robust model by providing a more diverse training dataset.

The difficulty of training deep learning models that perform well on novel, unseen data represents a fundamental challenge within the current landscape of machine learning [23]. Factors such as AI bias [21], shortcut learning [21], distribution shifts [22] and heterogenous acquisition and data annotation [19] further complicate this issue. Even when performing the same method on similar tasks, results can vary greatly depending on the dataset used.

This phenomenon is evident across various studies regarding automated tumor segmentation in PSMA-PET. For intraprostatic GTV segmentation, Kostyszyn et al. (2021) [12] demonstrated that a CNN achieved median DSC of 0.81 to 0.84 across internal and external independent validation cohorts on 68 Ga-PSMA-11 and 18F-PSMA-1007 PET scans. Adding to these findings, Ghezzo et al. (2023) [13], conducted an independent external validation of Kostyszyn et al.'s method, observing lower median DSC values ranging from 0.72 to 0.77, with mean DSC values between 0.69 and 0.71 for 68 Ga-PSMA-11 PET.

Holzschuh et al. (2023) [11], reported a range of median DSC values, from 0.70 to 0.82, for 18F-PSMA-1007, 18F-DCFPyL, and 68 Ga-PSMA-11. Notably, in this study an external validation was conducted independently by another institute.

Leung et al. reported a mean DSC of 0.7 for 18F-DCFPyL PET [23], though it is unknown if only intraprostatic or whole-body lesions are considered.

Regarding whole body PSMA PET, Kendrick et al. (2022) [24] reported a median DSC of 0.5 in a single-center study. Huang et al. (2023) [25] report mean DSC values ranging from 0.59 to 0.63 on 68 Ga-PSMA-11 PET. Jafari et al. [26] presented results for whole-body 68 Ga-PSMA-11 PET, showing voxel-level mean DSC values of 0.65 to 0.7 for different independent centers.

Notably, our results are consistent with previously observed data ranges for automated tumor segmentation in PSMA-PET imaging. Our analysis also reveals that the performance of CNNs in delineating GTV is influenced by the dataset employed for training and testing. This variance also underscores the complexity of machine learning models in adapting to new, unseen data, highlighting the critical importance of well annotated, diverse and representative training datasets to improve model generalization, which is particularly relevant in the context of medical imaging.

However, despite the potential for slight performance decrements, our study also demonstrates that models can achieve commendable performance in certain cases when trained with mixed data from multiple centers, even if the quantity of data is small (n = 19), yielding a median DSC of 0.74. This underscores the data-dependent nature of AI experiments, which can lead to over- or under-estimation of the trained segmentation model's final performance. While the nnU-Net employed in our study aims to maintain hyperparameter invariance to data due to its end-to-end design, this aspect gains particular significance in the context of comparing models that have undergone individual hyperparameter optimization.

Regarding limitations, our study's conclusions are inherently confined to the delineation of intraprostatic GTV using the 18F-PSMA-1007 PET tracer within the nnUNet framework. Although, to the best of our knowledge, this represents one of the most extensive cohorts to date concerning 18F-PSMA-1007 PET imaging, it is imperative to incorporate more data in future research as cohorts from individual centers are relatively small, necessitating verification of these results across larger cohorts. Moreover, the inherent challenges associated with image segmentation metrics must be acknowledged [18, 27]. Also, tumor size can represent a potential source of bias that could affect segmentation results. In our study, the cohorts comprised patients at different tumor stages, which were not homogenous across different centers. For instance, no stage I patients were included in the Freiburg and Munich cohorts while present in other cohorts. This resulted in potential variability in tumor sizes across the groups.

Additionally, the exploration of alternative deep learning architectures is warranted in subsequent studies, given that our analysis was limited to the nnU-Net architecture. While our cohorts offer a diverse clinical spectrum, results may also vary across different patient collectives.

Overall, our research highlights the importance of multicentric training datasets in enhancing the generalization capabilities of CNNs, underscoring the relationship between dataset diversity and the performance of machine learning models.

留言 (0)

沒有登入
gif