Stepwise incremental pretraining for integrating discriminative, restorative, and adversarial learning

Self-supervised learning (SSL) (Jing and Tian, 2020) pretrains generic source models (Zhou et al., 2021b) without using expert annotation, allowing the pretrained generic source models to be quickly fine-tuned into high-performance application-specific target models to minimize annotation cost (Tajbakhsh et al., 2021). The existing SSL methods typically employ one or a combination of the following three learning ingredients (Haghighi et al., 2022): (1) discriminative learning, which pretrains an encoder by distinguishing images associated with (computer-generated) pseudo labels; (2) restorative learning, which pretrains an encoder–decoder by reconstructing original images from their distorted versions; and (3) adversarial learning, which pretrains an additional adversary encoder to enhance restorative learning. It has already been demonstrated in Haghighi et al. (2021) that combining self-supervised discriminative methods and restoration enhances network performance in both classification and segmentation tasks. Further, (Tao et al., 2020) demonstrated that reconstructive method is further enhanced by adversarial learning. Inspired by both (Haghighi et al., 2021, Tao et al., 2020), we believe that combining all three components – discriminative, restorative, and adversarial learning – yields the best performance. Haghighi et al., 2022, Haghighi et al., 2024 articulated a vision and insights for integrating three learning ingredients in one single framework for collaborative learning, yielding three learned components: a discriminative encoder, a restorative decoder, and an adversary encoder (Fig. 1). However, such integration inevitably increases model complexity and pretraining difficulty, raising these two questions: (a) how to optimally pretrain such complex generic models, and (b) how to effectively utilize pretrained components for target tasks?

To address these two questions, we have redesigned nine prominent SSL methods for 3D imaging, including Rotation (Gidaris et al., 2018), Jigsaw (Noroozi and Favaro, 2016), Rubik’s Cube (Zhuang et al., 2019), Deep Clustering (Caron et al., 2018), TransVW (Haghighi et al., 2021), MoCo (Momentum Contrast) (He et al., 2020), BYOL (Bootstrap Your Own Latent) (Grill et al., 2020), PCRL (Preservational Contrastive Representation Learning) (Zhou et al., 2021a), and Swin UNETR (Swin UNEt TRansformers) (Tang et al., 2022). Among these methods, Rotation, Jigsaw, and Rubik’s Cube are classic discriminative methods. Deep Clustering is a classic clustering method. TransVW and PCRL are methods that integrate both discriminative and restorative approaches. MoCo and BYOL are contrastive methods. Swin UNETR is a transformer-based model that incorporates contrastive, restorative, and discriminative methods. With these methods, we aim to encompass all components and models of SSL, emphasizing the generality of our approach. We formulated each methods in a single framework called “United” (Fig. 2), as it unites discriminative, restorative, and adversarial learning. Pretraining United models, with all three components together, directly from scratch is unstable (Table 2); therefore, we have investigated various training strategies and discovered a stable solution: stepwise incremental pretraining. Such pretraining is accomplished as follows: first training a discriminative encoder via discriminative learning, called Step D, then attaching the pretrained discriminative encoder to a restorative decoder (i.e., forming an encoder–decoder) for further combined discriminative and restorative learning, called Step D(D+R), and finally associating the pretrained autoencoder with an adversarial-encoder for the final full discriminative, restorative, and adversarial training, called Step D(D+R)(D+R+A). This stepwise pretraining strategy provides the most reliable performance across most target tasks evaluated in this work encompassing both classification and segmentation (see Table 2, 3, 4, 5, and 7).

Through our extensive experiments, we have observed that (1) discriminative learning alone (i.e., Step D) significantly enhances discriminative encoders on target classification tasks (e.g., +4%, 6%, and 1% AUC (Area Under the ROC Curve) improvement for lung nodule, pulmonary embolism and pulmonary embolism with vessel-oriented image representation false positive reduction as shown in Table 3) relative to training from scratch; (2) in comparison with (sole) discriminative learning, incremental restorative pretraining combined with continual discriminative learning (i.e., Step D(D+R)) enhances discriminative encoders further for target classification tasks (e.g., +2%, +4%, and +2% AUC improvement for lung nodule, pulmonary embolism and pulmonary embolism with vessel-oriented image representation false positive reduction as shown in Table 3) and boosts encoder–decoder models for target segmentation tasks (e.g., +3%, +7%, and +5% IoU (Intersection over Union) improvement for lung nodule, liver, and brain tumor segmentation as shown in Table 5); and (3) compared with Step D(D+R), the final stepwise incremental pretraining (i.e., Step D(D+R)(D+R+A)) generates sharper and more realistic medical images (e.g., FID decreases from 427.6 to 251.3 as shown in Table 6) and further strengthens each component for representation learning, leading to considerable performance gains (see Fig. 4) and annotation cost reduction (e.g., 28%, 43%, and 26% faster for lung nodule false positive reduction, lung nodule tumor segmentation, and pulmonary embolism false positive reduction as shown in Fig. 5) for six target tasks across diseases, organs, datasets, and modalities.

We should note that recently (Haghighi et al., 2022) also combined discriminative, restorative, and adversarial learning, but our findings complement theirs, and more importantly, our method significantly differs from theirs, because they were more concerned with contrastive learning (e.g., MoCo-v2 (Chen et al., 2020), Barlow Twins (Zbontar et al., 2021), and SimSiam (Chen and He, 2021)) and focused on 2D medical image analysis. By contrast, we are focusing on 3D medical imaging by redesigning nine popular SSL methods beyond contrastive learning. As they acknowledged (Haghighi et al., 2022), their results on TransVW (Haghighi et al., 2021) augmented with an adversarial encoder were based on the experiments presented in this paper. Furthermore, this paper focuses on a stepwise incremental pretraining to stabilize United model training, revealing new insights into synergistic effects and contributions among the three learning ingredients.

In summary, we make the following three main contributions:

1.

A stepwise incremental pretraining strategy that stabilizes United models’ pretraining and releases the synergistic effects of the three SSL ingredients;

2.

A collection of pretrained United models that integrate discriminative, restorative, and adversarial learning in a single framework for 3D medical imaging, encompassing both classification and segmentation tasks, and;

3.

A set of extensive experiments that demonstrate how various pretraining strategies benefit each SSL method for target tasks across diseases, organs, datasets, and modalities.

留言 (0)

沒有登入
gif