Self-supervised learning for medical image analysis: Discriminative, restorative, or adversarial?

Self-supervised learning (SSL) aims to learn generalizable representations without relying on expert annotations. The mainstream self-supervised representation learning approaches can be categorized into three groups: (1) discriminative learning, which utilizes encoders to distinguish instances from different (pseudo) classes; (2) restorative learning, which employs encoder–decoder models to reconstruct original images from their distorted versions; and (3) adversarial learning, which leverages adversary models to enhance restorative learning. Despite the significant contributions of discriminative, restorative, and adversarial learning to the performance of SSL individually (Chen et al., 2020b, He et al., 2022, Zbontar et al., 2021, Grill et al., 2020) or in pairs (Haghighi et al., 2021, Tao et al., 2020, Zhou et al., 2021a, Hosseinzadeh Taher et al., 2022), the simultaneous exploitation of all three learning ingredients in a SSL schema remains unexploited (see Fig. 1). Therefore, we wished to determine if discriminative, restorative, and adversarial learning can be seamlessly integrated into a single framework to foster collaborative learning for deep semantic representation, yielding more powerful SSL models capable of excelling across a broad range of applications?

Achieving outstanding performance for medical vision tasks requires a higher level of feature granularity compared with computer vision tasks. This requirement stems from the marked disparities between photographic and medical images, leading to significant impacts on the effectiveness of discriminative, restorative, and adversarial learning techniques for photographic and for medical images. Photographic images, particularly those found in ImageNet, typically consist of large foreground objects with distinct discriminative parts that are set against varying backgrounds (e.g., images of dogs and elephants in Fig. 2). Thus, image recognition tasks in photographic images are primarily based on high-level features captured from discriminative regions. Therefore, for computer vision tasks, discriminative learning is preferred, as evidenced by the state-of-the-art performance of discriminative SSL methods, notably instance discrimination learning (Chen et al., 2020b, Chen and He, 2021, Zbontar et al., 2021, Grill et al., 2020, He et al., 2020, Chuang et al., 2022), which show superior performance compared with standard supervised ImageNet models for some computer vision benchmarks. Conversely, medical images acquired using specific imaging protocols exhibit consistent anatomical structures (e.g., chest anatomy in Fig. 2), with clinically relevant information dispersed over the entire image (Haghighi et al., 2021). Particularly, high-level structural information (i.e., anatomical structures and their relative spatial orientations) is essential for the identification of both normal anatomy and various disorders. Also, medical tasks demand heightened attention to fine-grained details within images, as the identification of diseases, organ delineation, and lesion isolation depend on subtle, local variations in texture (Hosseinzadeh Taher et al., 2021). Thus, medical imaging recognition tasks rely heavily on the integration of fine-grained discriminative features extracted from the entire image. As a result, for medical vision tasks, restorative-based learning is preferred, as evidenced by the superior performance of restorative SSL approaches (Zhou et al., 2021b, Chen et al., 2019, Haghighi et al., 2021, Tao et al., 2020, Zhou et al., 2021a, Haghighi et al., 2020, Hosseinzadeh Taher et al., 2022) compared with their discriminative counterparts (Azizi et al., 2021, Zhou et al., 2020, Kaku et al., 2021) for various medical vision benchmarks.

Accordingly, our systematic analysis has disclosed: (1) discriminative learning excels in capturing high-level (global) discriminative features; (2) restorative learning is optimal for conserving fine-grained details embedded in local image regions; and (3) adversarial learning consolidates restoration by conserving more fine-grained details. More importantly, we have acquired a new and intriguing insight into trio of discriminative, restorative, and adversarial learning to excavate effective features required for medical recognition tasks—both high-level anatomical representations and fine-grained discriminative cues embedded in the local parts of medical images.

Based on the insights above, we have designed a novel self-supervised learning framework, named DiRA, by uniting discriminative learning, restorative learning, and adversarial learning in a unified manner to glean complementary visual information from unlabeled medical images. Our extensive experiments demonstrate that (1) DiRA encourages collaborative learning among three learning components, resulting in more generalizable representation across distribution shifts, organs, diseases, and modalities (see Fig. 4); (2) DiRA outperforms fully-supervised baselines models and increases robustness in small data regimes, thereby reducing annotation costs in medical imaging (see Table 2, Table 3); (3) DiRA learns fine-grained representations, facilitating more accurate lesion localization with only image-level annotations (see Fig. 5, Fig. 6); (4) DiRA provides highly reusable low/mid-level features, resulting in greater transferability to different medical tasks (see Table 4); and (5) DiRA enhances restorative-based approaches, showing that DiRA is a general framework for united representation learning (see Fig. 7, Table 5, Table 6, and Fig. 8).

In summary, we make the following contributions:

The insights that we have gained into the synergy of discriminative, restorative, and adversarial learning in a ternary setup, realizing a new paradigm of collaborative learning for SSL.

The first self-supervised learning framework that seamlessly unites discriminative, restorative, and adversarial learning in a unified manner.

A thorough and insightful set of experiments that demonstrate not only DiRA’s generalizability but also its potential to take a fundamental step towards developing universal representations for medical imaging.

留言 (0)

沒有登入
gif