Cancers, Vol. 14, Pages 5778: Contrastive Multiple Instance Learning: An Unsupervised Framework for Learning Slide-Level Representations of Whole Slide Histopathology Images without Labels

1. IntroductionHistopathological analyses play a central role in the characterization of biological tissues. Increasingly, whole-slide imaging (WSI) of tissues, in tandem with inexpensive storage and fast networks for data transfer, has made it possible to curate large databases of digitized tissue sections [1]. Furthermore, advances in deep learning methods have enabled scientists to develop automated histopathological analysis methods on whole-slide images, ranging from primitive applications such as in nuclei detection [2] and in mitosis detection [3] to more advanced applications, such as tumor grading [4].Despite successful application to various diagnostic and prognostic problems [1,5], developing methods for computational pathology perpetually rely on painstakingly annotated nuclei, cells, and tissue structures [6,7,8,9,10]. This is driven primarily by the prevalence of annotation-heavy supervised methods in more generalized computer vision applications [11,12,13]. Unlike general computer vision applications and other medical imaging modalities [14], reliance on annotations heavily limits research in computational pathology, as annotations must be performed by expert pathologists [15]. Furthermore, annotations are labor-intensive and often subject to significant inter- and intra- reader variability [1]. Finally, medical image datasets are vanishingly small compared to general-purpose computer vision datasets [14,16]. It is no wonder that recent high-profile publications in computational pathology have moved away from fully-supervised methods to semi- and weakly-supervised methods [17,18,19,20].In continuance with recent trends towards less supervision, our goal is to develop an unsupervised method to learn meaningful, compact representations of WSIs. Such methods exist in other medical imaging modalities [14] but to our knowledge, no such method currently exists for computational pathology. However, several unsupervised (specifically, self-supervised) methods exist to learn patch-wise representations within computational pathology. These works define pretext tasks from which patch-wise feature representations are learned. Such pretext tasks include contrastive predictive coding [21], contrastive learning on adjacent image patches [22], contrastive learning using SimCLR [23,24,25], and SimSiam [26] with an additional stop-gradient for adjacent patches [27]. Many methods utilize generic features derived from ImageNet as their patch-wise feature representations [17,18,28]. Neural image compression [29] compares several self-supervised pretext training tasks to create feature-rich WSI representations but is impeded by dimensionality [30], due to the sheer number of parameters associated with each compressed slide. Finally, several autoencoder and derivative methods have been applied to learn compact patch-wise representations without labels to a variety of application areas, including nuclei detection [31,32], cell detection and classification [33], drug efficacy prediction [34], tumor subtype classification [35], and lymph node metastases detection [36].By and large, these studies have three main findings. Firstly, self-supervised pretext tasks for learning patch-wise features tend to outperform ImageNet features. This is probably because ImageNet features are generic with respect to everyday objects, whereas features derived from self-supervision are purely based on WSI patches. Furthermore, the transformation space can be tweaked to suit pathology (i.e., scale invariance). Secondly, there is a saturation point, at which time adding more patches to the pretext tasks does not offer downstream performance gains. This result is not reflected in general-purpose self-supervised learning, where more images result in higher downstream performance. Stacke et al. [23] propose that this may be due to the redundancy in WSI patches (i.e., anatomies tend to be repeated). Thirdly, fewer labels are required when compared to training from scratch or transfer learning from ImageNet weights. This is also reflected in general-purpose applications.Due to these promising empirical findings and theoretical perspectives for patch-wise classification of histopathology images, a few studies have combined self-supervised patch-level representations with weakly supervised multiple instance learning (MIL) [37,38] methods (i.e., for WSI analysis). Lu et al. [21] utilized contrastive predictive coding on WSI patches as a pre-training step for subsequent MIL-based classification of breast cancer WSIs. Another study by Li et al. [24] utilized SimCLR [39] on WSI patches as a pre-training step for subsequent MIL-based classification. Fashi et al. [40] utilized contrastive learning based on site-of-origin labels as pseudo-labels for pre-training then applied attention pooling on the resulting embeddings to classify WSIs. All studies demonstrated similar benefits at the slide level that were considered at the patch level.Despite promising research showing the benefits of self-supervision and MIL, one aspect of WSI analysis that has yet to be addressed in the literature is whether self-supervision can be applied at the slide level via multiple instance learning. Thus, we propose a novel fusion of self-supervision and MIL, which we call SS-MIL, as a method for learning unsupervised representations for WSIs. Our method first trains a patch-wise encoder using SimCLR [39]. Then, each patch is embedded to yield a collection of instances for each WSI. Each collection is divided into multiple subsets, each representing the WSI, which MIL then fuses to yield multiple slide-level representations. Through this operation on the same slide, a positive pair for contrastive learning is created. Similarly, a negative pair is created for a different slide. This MIL model is then trained using contrastive loss. The resulting slide-level representations are therefore created without any supervision. We then apply supervision to these unsupervised representations in both a classification and regression task. For classification, we subtype NSCLC into lung adenocarcinoma (LUAD, n = 541) and squamous cell carcinoma (LUSC, n = 512) using the publicly available Cancer Genome Atlas (TCGA)-NSCLC dataset. For regression, we score the degree of proliferation (i.e., mitotic activity) in breast cancer using the publicly available TUPAC16 dataset. We demonstrate through ablation experiments that the unsupervised slide-level feature space can easily be fine-tuned using a fraction of labeled slides, indicating the unsupervised feature space is meaningful. Not only is SS-MIL a novel, label-less approach to computational pathology, but it also creates an opportunity in which vast amounts of unlabeled or irrelevantly labeled WSIs may benefit the development of models in computational pathology.

Our main contributions in this paper are: (1) a novel fusion of self-supervision at the slidelevel; (2) an unsupervised method (SS-MIL) to learn representations of WSIs; (3) empirical evidence suggesting that ImageNet features may outperform self-supervised features at the slide-level, contrary to previous studies; and (4) empirical evidence via ablation studies demonstrating that the resulting unsupervised feature space is rich.

5. Discussion 5.1. SS-MIL Can Leverage Completely Unannotated DatasetsThe lack of need for labels, tissue-level or slide-level, opens up several opportunities for research that are simply not possible with existing methods. SS-MIL allows for datasets to be combined even though they may have different kinds of slide-level labels or if no slide-level labels exist for a particular data source (i.e., missing data). For example, the TUPAC16 dataset could be combined with the TCGA-BRCA (breast invasive carcinoma). Our proposed model could be trained on TCGA-BRCA and then either frozen or fine-tuned on TUPAC16, or vice-versa. Furthermore, SS-MIL could be trained on both TCGA-BRCA and TUPAC16 and then frozen or fine-tuned on a smaller, external dataset. The benefit SS-MIL derives from combining different datasets to learn slide-level representation is impossible with existing methods. It is not just publicly available datasets that could theoretically be combined, and benefit derived from. Hospitals and clinical trials have large swaths of WSIs which are easily accessible, but their slide-level labels are either prohibitively difficult to obtain, irrelevant to the problem of interest, or non-existent [90,91]. 5.2. SS-MIL Still Underperforms Compared to Supervised Methods

Overall, SS-MIL’s performance is not on par with supervised methods’ performance in both tasks. However, this is to be expected. All supervised comparison—attention-based MIL, CLAM, and Attention2majority—can learn slide-level feature spaces that can more easily discriminate NSCLC subtypes or regress proliferation scores. Furthermore, each of these methods, whether utilizing SimCLR-derived features or ImageNet features, partly consist of a shallow feature extractor, allowing the network to learn a tile-level feature space which ultimately contributes to a more discriminable slide-level feature space. By comparison, SS-MIL does not benefit from these advantages. Instead, it entirely relies on the target of transformation invariance to learn meaningful tile-level features. Likewise, SS-MIL relies on the power of contrastive learning and attention to learn a sampling-invariant representation of each slide rather than a label-dependent feature space.

This is evidenced by three observations. Firstly, fine-tuning SS-MIL clearly outperforms itself when training a new linear layer on the unsupervised slide-level representations. With fine-tuning, only the starting point of the network is different, and slide-level feature space dependent on the slide-level labels is more easily achieved. Secondly, training any of the comparison methods using SimCLR-derived features and slide-level labels outperforms fine-tuning. In other words, each supervised method can learn a favorable tile-level feature space in which slide-level representations are more easily distinguished. Thirdly, during model training, we observed that all supervised methods always yielded training set losses approaching zero. Of course, this did not result in model overfitting, as early stopping criteria was based on validation loss. However, this behavior was in strong contrast to SS-MIL. No matter how long SS-MIL was trained, training loss always converged to a level well-above zero and nearly matched validation and testing losses. This observation supports the notion that supervised comparisons learn a slide-level feature space that benefits greatly from labels. We also believe this observation indicates that SS-MIL is self-regularizing to a degree. In summary, there are several benefits to supervised methods that SS-MIL does not benefit from.

Despite the advantages of supervision, the results of the ablation study offer some respite. The results suggest that SS-MIL still learns a robust feature space. This is evidenced by the observation that the decrease in performance when utilizing fewer slides for SS-MIL is about the same as comparable methods in the ablation study (i.e., fine-tuning and supervision). In addition, evidence is derived from the apparent maintenance of performance despite the decrease in the number of training slides.

5.3. Generic Features Outperform Histopathology FeaturesThe results also suggest that ImageNet features may be advantageous over histopathology-specific features. Several studies have sought to utilize histopathology-specific tile features for WSI analysis either using self-supervision [21,22,24,25,27], weak supervision using slide-level labels [20,65], supervision on unrelated histopathology tasks [29,92], in an end-to-end manner [86,87], or graph-based methods [93]. The argument made by such studies is that histopathology WSI analysis should utilize histopathology-specific tile features. Without context, this makes sense. However, this point-of-view is entirely upended by the vast body of work across medical image analysis which utilizes transfer learning with small image datasets [94]. Furthermore, and by contrast, several studies use generic ImageNet features [17,18,19] for WSI analysis with great success. These studies and the results presented here call into question the necessity of histopathology-specific features for WSI analysis. Be that as it may, we do not suggest rejecting the prospect of histopathology-specific features altogether, as studies have shown that self-supervision as a means of reducing label load is far superior to transfer learning from ImageNet features [60,61]. 5.4. Neighbors as Augmentations Does Not Benefit Downstream MILContrary to supporting previous studies’ utilization of ImageNet features, our results do not support previous findings that consider neighboring tiles as augmentations for self-supervision to result in better representations than standard augmentations. In SimTriplet [27], the authors extended SimSiam [17] to include an additional stop-gradient arm that operated solely on encoded neighboring patches. The loss was then computed between the original SimSiam augmentation and the anchoring tile as well as between the neighbor tile and anchor tile. The idea here is that not only should tiles be invariant to their transformation but that they are nearly identical to their neighbors. Their method outperformed SimSiam on a melanoma histopathology tile classification task. We also independently recognized this aspect of histopathology that may be exploited by self-supervision. However, in all our experiments, SLLn features (with neighbors considered as augmentations) always underperformed compared to SSL features (not considering neighbors). This contradicts the results presented in SimTriplet [27]. We believe that the gap between these two approaches could be narrowed and flipped if our SimCLR encoder had ample time to learn SLLn representations. This is supported by our observation that the contrastive loss for the SLLn SimCLR encoder was higher at stopping time compared to the SLL SimCLR encoder (average 3% higher). We did not let either encoder train longer because previous studies showed that this does not affect downstream patch-wise classification accuracy [23,60]. However, these studies did not consider neighbors as augmentations nor were they performing slide-level classification or regression. Perhaps under these conditions, training the encoder longer may improve slide-level performance. 5.5. Why CLAM Outperforms Attention2majorityContrary to expectations, CLAM outperformed Attention2majority in the NSCLC subtyping task. In a previous study [20], Attention2majority outperformed CLAM in a multi-class kidney grading task and Camelyon16. We believe this is due to the difference between the tasks. In NSCLC subtyping and TUPAC proliferation scoring, the signal AB-MIL attends to is well-dispersed throughout the tissue. In contrast, in Camelyon16, the signal may be a very small lesion, even smaller than the size of a tile, and in the kidney grading task, the signal is mixed. Attention2majority outperforms CLAM on these latter datasets because of its intelligent sampling scheme, which verifiably biases bags to include tiles that correspond to the overall slide-level label. CLAM utilizes all tiles, so the signal becomes harder to detect. Given that the signal is more dispersed in the NSCLC and TUPAC datasets, this effect is less pronounced, so the advantages afforded by Attention2majority no longer apply. 5.6. Issues with ReproducibilityContrary to our expectations, we could not reproduce the results reported by CLAM on the NSCLC subtyping task. In the original paper, CLAM achieved an AUC of 0.991 ± 0.004. Granted, we did not have the same experimental setup regarding slide divisions among folds. However, other studies have been unable to reproduce the results reported by CLAM for NSCLC subtyping. TransMIL reported 0.9377 [25], neural image compression reported 0.9477 [95], and node-aligned graphs reported 0.9320 [96]. Similarly, contrary to our expectations, AB-MIL performed as well as CLAM with ImageNet features and performed better when utilizing SSL or SSLn features. This is corroborated by one other study [96]. These observations highlight the importance of reproducibility in deep learning methodologies. 5.7. Areas for ImprovementSS-MIL has many areas for improvement. First and foremost, given the results presented in the ablation study, a larger and label-free WSI dataset would likely be beneficial. Clearly, increasing the number of slides at the level of constative MIL improves downstream performance (Table 2). Similar benefits from more slides may also be derived for the initial tile-level SimCLR pretraining step, as is supported by one study [60]. However, SS-MIL would not require slide-level labels from a larger dataset, unlike previous methods.Notwithstanding, such datasets would not necessarily be beneficial if it were not the same tissue type or disease category [23]. The proposed method may benefit from longer training times for the SimCLR encoder (specifically when neighbors are used as augmentations). We may also consider a more apt augmentation space. The original SimCLR augmentation space allows for two random crops from the same image to serve as a positive pair. We can modify this random cropping augmentation in two ways. First, our experiments are performed at 20×. When randomly cropping a 20× tile, the resulting crop is unsampled. However, since we have access to the WSI, we may instead grab the crop at 40× and then down sample it. In this manner, the tile crop augmentations contain more detail.Secondly, we may consider a random center crop as an augmentation rather than a random crop. In other words, a tile is randomly cropped by keeping the center position the same and then resampled directly from the slide from 40×. This is motivated by the observation that two random crops from the same tile may not be adjacent. However, by center cropping, we can be sure that they overlap. Ultimately, augmentation space greatly affects downstream performance and may even be dataset-dependent [23,27]. The modifications that could be made with respect to augmentation space are many [97]. We could also modify our positive/negative pair generation during contrastive MIL model training. Currently, different bag views are generated by randomly subsampling 25% of the WSI tile embeddings.However, for tasks in which the region of interest is very small, such as metastasis detection, this sampling method will miss regions of interest and thus generate positive pairs in which one bag contains the region of interest and the other does not. Alternatively, we could generate multiple bags for the same slide by shifting the starting position of the tile grid. This way, each bag contains all tiles from the WSI but is also distinct from other bags from the same WSI. Additional augmentation policies could be applied to WSI bags, including random perturbation and random zeroing [98]. 5.8. Implications of SS-MIL

SS-MIL enables researchers to benefit from a dataset where no label information is available. Its innovation lies 1) in the subtly of histopathology that neighboring tissue structures are likely to represent relatively the same (or very similar) clinical information and thus should be represented similarly as embeddings and 2) in the unsupervised nature of the proposed model. As the learned WSI embeddings have general applicability to many machine learning tasks (classification and regression), many other applications can benefit from these results.

The lack of the need for labels opens several opportunities for research that are simply not possible with existing methods. SS-MIL allows for datasets to be combined even though they may have different kinds of slide-level labels or if no slide-level labels exist for a particular data source (i.e., missing data). For example, the TUPAC16 dataset could be combined with the TCGA-BRCA (breast invasive carcinoma). Our proposed model could be trained on TCGA-BRCA and then either frozen or fine-tuned on TUPAC16, or visa-versa. Furthermore, SS-MIL could be trained on both TCGA-BRCA and TUPAC16 and then frozen or fine-tuned on a smaller, external dataset. The benefit SS-MIL derives from combining different datasets to learn slide-level representation is impossible with existing methods. It is not just publicly available datasets that could theoretically be combined, and benefit derived from. Hospitals and clinical trials have large swaths of WSIs which are easily accessible, but their slide-level labels are either prohibitively difficult to obtain, irrelevant to the problem of interest, or non-existent.

Furthermore, we have yet to examine the attention weights from the SS-MIL. We hypothesized that the features learned in bag-level encodings correspond to features of the slide with the maximum variance across the dataset and that instances with the highest attention would correspond to these histopathological features. It may prove difficult to support such a hypothesis with this dataset. However, perhaps with a hand-crafted dataset with a diversity of tissue structures (such as one WSI per organ), we may indeed demonstrate that the SS-MIL method possesses learning imaging features with the highest variance across a given dataset.

6. Conclusions

In conclusion, we have presented a method to learn compact representations of WSIs without supervision. Our method trains a patch-wise encoder using SimCLR. Each patch is embedded to yield a collection of instances for each WSI. Each collection is divided into several subsets, each representing the WSI, which MIL then fuses to yield multiple slide-level representations. Through this operation on the same slide, a positive pair for contrastive learning is created. Similarly, a negative pair is created for a different slide. This MIL model is then trained using contrastive loss. These unsupervised representations can be utilized to classify and regress WSIs to weakly labeled outcomes. We applied our method to both NSCLC subtyping and TUPAC proliferation scoring, achieving an AUC of 0.8641 ± 0.0115 and R2 of 0.5740 ± 0.0970. Though our method does not achieve the same level of performance as supervised attention-based MIL, CLAM, or Attention2majority, we have shown through ablation experiments that the slide-level feature space that its learning is robust and its performance is likely limited solely by the number of available slides. In future experiments, we plan to apply our method to larger datasets in order to observe whether the apparent benefits from increasing the number of slide (as evidence in ablation studies) continues to return benefits in performance. We expect that the performance of our method may indeed exceed that of supervised methods when limited labels are available. Secondly, we plan to modify our tile-level augmentation space to more accurately reflect the histopathology-specific transformation invariance (i.e., center crops). We will also perform separate experiments to find an optimal transformation policy. Similarly, we plan to modify our slide-level augmentation space (via shifting the overlaid grid or random zeroing) to represent each slide fully rather than randomly subsampling as in the current study. Thirdly, we will apply our resulting method to Camelyon16, a breast cancer seminal lymph node metastasis dataset, in conjunction with the novel, in-house MIL models. From a technical standpoint, our proposed method is a novel approach to computational pathology, where meaningful features can be learned from WSIs without needing any annotations. The proposed method can benefit computational pathology from a practical standpoint, as it would theoretically enable researchers to benefit from vast amounts of unlabeled or irrelevantly labeled.

留言 (0)

沒有登入
gif