Our main contributions in this paper are: (1) a novel fusion of self-supervision at the slidelevel; (2) an unsupervised method (SS-MIL) to learn representations of WSIs; (3) empirical evidence suggesting that ImageNet features may outperform self-supervised features at the slide-level, contrary to previous studies; and (4) empirical evidence via ablation studies demonstrating that the resulting unsupervised feature space is rich.
5. Discussion 5.1. SS-MIL Can Leverage Completely Unannotated DatasetsThe lack of need for labels, tissue-level or slide-level, opens up several opportunities for research that are simply not possible with existing methods. SS-MIL allows for datasets to be combined even though they may have different kinds of slide-level labels or if no slide-level labels exist for a particular data source (i.e., missing data). For example, the TUPAC16 dataset could be combined with the TCGA-BRCA (breast invasive carcinoma). Our proposed model could be trained on TCGA-BRCA and then either frozen or fine-tuned on TUPAC16, or vice-versa. Furthermore, SS-MIL could be trained on both TCGA-BRCA and TUPAC16 and then frozen or fine-tuned on a smaller, external dataset. The benefit SS-MIL derives from combining different datasets to learn slide-level representation is impossible with existing methods. It is not just publicly available datasets that could theoretically be combined, and benefit derived from. Hospitals and clinical trials have large swaths of WSIs which are easily accessible, but their slide-level labels are either prohibitively difficult to obtain, irrelevant to the problem of interest, or non-existent [90,91]. 5.2. SS-MIL Still Underperforms Compared to Supervised MethodsOverall, SS-MIL’s performance is not on par with supervised methods’ performance in both tasks. However, this is to be expected. All supervised comparison—attention-based MIL, CLAM, and Attention2majority—can learn slide-level feature spaces that can more easily discriminate NSCLC subtypes or regress proliferation scores. Furthermore, each of these methods, whether utilizing SimCLR-derived features or ImageNet features, partly consist of a shallow feature extractor, allowing the network to learn a tile-level feature space which ultimately contributes to a more discriminable slide-level feature space. By comparison, SS-MIL does not benefit from these advantages. Instead, it entirely relies on the target of transformation invariance to learn meaningful tile-level features. Likewise, SS-MIL relies on the power of contrastive learning and attention to learn a sampling-invariant representation of each slide rather than a label-dependent feature space.
This is evidenced by three observations. Firstly, fine-tuning SS-MIL clearly outperforms itself when training a new linear layer on the unsupervised slide-level representations. With fine-tuning, only the starting point of the network is different, and slide-level feature space dependent on the slide-level labels is more easily achieved. Secondly, training any of the comparison methods using SimCLR-derived features and slide-level labels outperforms fine-tuning. In other words, each supervised method can learn a favorable tile-level feature space in which slide-level representations are more easily distinguished. Thirdly, during model training, we observed that all supervised methods always yielded training set losses approaching zero. Of course, this did not result in model overfitting, as early stopping criteria was based on validation loss. However, this behavior was in strong contrast to SS-MIL. No matter how long SS-MIL was trained, training loss always converged to a level well-above zero and nearly matched validation and testing losses. This observation supports the notion that supervised comparisons learn a slide-level feature space that benefits greatly from labels. We also believe this observation indicates that SS-MIL is self-regularizing to a degree. In summary, there are several benefits to supervised methods that SS-MIL does not benefit from.
Despite the advantages of supervision, the results of the ablation study offer some respite. The results suggest that SS-MIL still learns a robust feature space. This is evidenced by the observation that the decrease in performance when utilizing fewer slides for SS-MIL is about the same as comparable methods in the ablation study (i.e., fine-tuning and supervision). In addition, evidence is derived from the apparent maintenance of performance despite the decrease in the number of training slides.
5.3. Generic Features Outperform Histopathology FeaturesThe results also suggest that ImageNet features may be advantageous over histopathology-specific features. Several studies have sought to utilize histopathology-specific tile features for WSI analysis either using self-supervision [21,22,24,25,27], weak supervision using slide-level labels [20,65], supervision on unrelated histopathology tasks [29,92], in an end-to-end manner [86,87], or graph-based methods [93]. The argument made by such studies is that histopathology WSI analysis should utilize histopathology-specific tile features. Without context, this makes sense. However, this point-of-view is entirely upended by the vast body of work across medical image analysis which utilizes transfer learning with small image datasets [94]. Furthermore, and by contrast, several studies use generic ImageNet features [17,18,19] for WSI analysis with great success. These studies and the results presented here call into question the necessity of histopathology-specific features for WSI analysis. Be that as it may, we do not suggest rejecting the prospect of histopathology-specific features altogether, as studies have shown that self-supervision as a means of reducing label load is far superior to transfer learning from ImageNet features [60,61]. 5.4. Neighbors as Augmentations Does Not Benefit Downstream MILContrary to supporting previous studies’ utilization of ImageNet features, our results do not support previous findings that consider neighboring tiles as augmentations for self-supervision to result in better representations than standard augmentations. In SimTriplet [27], the authors extended SimSiam [17] to include an additional stop-gradient arm that operated solely on encoded neighboring patches. The loss was then computed between the original SimSiam augmentation and the anchoring tile as well as between the neighbor tile and anchor tile. The idea here is that not only should tiles be invariant to their transformation but that they are nearly identical to their neighbors. Their method outperformed SimSiam on a melanoma histopathology tile classification task. We also independently recognized this aspect of histopathology that may be exploited by self-supervision. However, in all our experiments, SLLn features (with neighbors considered as augmentations) always underperformed compared to SSL features (not considering neighbors). This contradicts the results presented in SimTriplet [27]. We believe that the gap between these two approaches could be narrowed and flipped if our SimCLR encoder had ample time to learn SLLn representations. This is supported by our observation that the contrastive loss for the SLLn SimCLR encoder was higher at stopping time compared to the SLL SimCLR encoder (average 3% higher). We did not let either encoder train longer because previous studies showed that this does not affect downstream patch-wise classification accuracy [23,60]. However, these studies did not consider neighbors as augmentations nor were they performing slide-level classification or regression. Perhaps under these conditions, training the encoder longer may improve slide-level performance. 5.5. Why CLAM Outperforms Attention2majorityContrary to expectations, CLAM outperformed Attention2majority in the NSCLC subtyping task. In a previous study [20], Attention2majority outperformed CLAM in a multi-class kidney grading task and Camelyon16. We believe this is due to the difference between the tasks. In NSCLC subtyping and TUPAC proliferation scoring, the signal AB-MIL attends to is well-dispersed throughout the tissue. In contrast, in Camelyon16, the signal may be a very small lesion, even smaller than the size of a tile, and in the kidney grading task, the signal is mixed. Attention2majority outperforms CLAM on these latter datasets because of its intelligent sampling scheme, which verifiably biases bags to include tiles that correspond to the overall slide-level label. CLAM utilizes all tiles, so the signal becomes harder to detect. Given that the signal is more dispersed in the NSCLC and TUPAC datasets, this effect is less pronounced, so the advantages afforded by Attention2majority no longer apply. 5.6. Issues with ReproducibilityContrary to our expectations, we could not reproduce the results reported by CLAM on the NSCLC subtyping task. In the original paper, CLAM achieved an AUC of 0.991 ± 0.004. Granted, we did not have the same experimental setup regarding slide divisions among folds. However, other studies have been unable to reproduce the results reported by CLAM for NSCLC subtyping. TransMIL reported 0.9377 [25], neural image compression reported 0.9477 [95], and node-aligned graphs reported 0.9320 [96]. Similarly, contrary to our expectations, AB-MIL performed as well as CLAM with ImageNet features and performed better when utilizing SSL or SSLn features. This is corroborated by one other study [96]. These observations highlight the importance of reproducibility in deep learning methodologies. 5.7. Areas for ImprovementSS-MIL has many areas for improvement. First and foremost, given the results presented in the ablation study, a larger and label-free WSI dataset would likely be beneficial. Clearly, increasing the number of slides at the level of constative MIL improves downstream performance (Table 2). Similar benefits from more slides may also be derived for the initial tile-level SimCLR pretraining step, as is supported by one study [60]. However, SS-MIL would not require slide-level labels from a larger dataset, unlike previous methods.Notwithstanding, such datasets would not necessarily be beneficial if it were not the same tissue type or disease category [23]. The proposed method may benefit from longer training times for the SimCLR encoder (specifically when neighbors are used as augmentations). We may also consider a more apt augmentation space. The original SimCLR augmentation space allows for two random crops from the same image to serve as a positive pair. We can modify this random cropping augmentation in two ways. First, our experiments are performed at 20×. When randomly cropping a 20× tile, the resulting crop is unsampled. However, since we have access to the WSI, we may instead grab the crop at 40× and then down sample it. In this manner, the tile crop augmentations contain more detail.Secondly, we may consider a random center crop as an augmentation rather than a random crop. In other words, a tile is randomly cropped by keeping the center position the same and then resampled directly from the slide from 40×. This is motivated by the observation that two random crops from the same tile may not be adjacent. However, by center cropping, we can be sure that they overlap. Ultimately, augmentation space greatly affects downstream performance and may even be dataset-dependent [23,27]. The modifications that could be made with respect to augmentation space are many [97]. We could also modify our positive/negative pair generation during contrastive MIL model training. Currently, different bag views are generated by randomly subsampling 25% of the WSI tile embeddings.However, for tasks in which the region of interest is very small, such as metastasis detection, this sampling method will miss regions of interest and thus generate positive pairs in which one bag contains the region of interest and the other does not. Alternatively, we could generate multiple bags for the same slide by shifting the starting position of the tile grid. This way, each bag contains all tiles from the WSI but is also distinct from other bags from the same WSI. Additional augmentation policies could be applied to WSI bags, including random perturbation and random zeroing [98]. 5.8. Implications of SS-MILSS-MIL enables researchers to benefit from a dataset where no label information is available. Its innovation lies 1) in the subtly of histopathology that neighboring tissue structures are likely to represent relatively the same (or very similar) clinical information and thus should be represented similarly as embeddings and 2) in the unsupervised nature of the proposed model. As the learned WSI embeddings have general applicability to many machine learning tasks (classification and regression), many other applications can benefit from these results.
The lack of the need for labels opens several opportunities for research that are simply not possible with existing methods. SS-MIL allows for datasets to be combined even though they may have different kinds of slide-level labels or if no slide-level labels exist for a particular data source (i.e., missing data). For example, the TUPAC16 dataset could be combined with the TCGA-BRCA (breast invasive carcinoma). Our proposed model could be trained on TCGA-BRCA and then either frozen or fine-tuned on TUPAC16, or visa-versa. Furthermore, SS-MIL could be trained on both TCGA-BRCA and TUPAC16 and then frozen or fine-tuned on a smaller, external dataset. The benefit SS-MIL derives from combining different datasets to learn slide-level representation is impossible with existing methods. It is not just publicly available datasets that could theoretically be combined, and benefit derived from. Hospitals and clinical trials have large swaths of WSIs which are easily accessible, but their slide-level labels are either prohibitively difficult to obtain, irrelevant to the problem of interest, or non-existent.
Furthermore, we have yet to examine the attention weights from the SS-MIL. We hypothesized that the features learned in bag-level encodings correspond to features of the slide with the maximum variance across the dataset and that instances with the highest attention would correspond to these histopathological features. It may prove difficult to support such a hypothesis with this dataset. However, perhaps with a hand-crafted dataset with a diversity of tissue structures (such as one WSI per organ), we may indeed demonstrate that the SS-MIL method possesses learning imaging features with the highest variance across a given dataset.
6. ConclusionsIn conclusion, we have presented a method to learn compact representations of WSIs without supervision. Our method trains a patch-wise encoder using SimCLR. Each patch is embedded to yield a collection of instances for each WSI. Each collection is divided into several subsets, each representing the WSI, which MIL then fuses to yield multiple slide-level representations. Through this operation on the same slide, a positive pair for contrastive learning is created. Similarly, a negative pair is created for a different slide. This MIL model is then trained using contrastive loss. These unsupervised representations can be utilized to classify and regress WSIs to weakly labeled outcomes. We applied our method to both NSCLC subtyping and TUPAC proliferation scoring, achieving an AUC of 0.8641 ± 0.0115 and R2 of 0.5740 ± 0.0970. Though our method does not achieve the same level of performance as supervised attention-based MIL, CLAM, or Attention2majority, we have shown through ablation experiments that the slide-level feature space that its learning is robust and its performance is likely limited solely by the number of available slides. In future experiments, we plan to apply our method to larger datasets in order to observe whether the apparent benefits from increasing the number of slide (as evidence in ablation studies) continues to return benefits in performance. We expect that the performance of our method may indeed exceed that of supervised methods when limited labels are available. Secondly, we plan to modify our tile-level augmentation space to more accurately reflect the histopathology-specific transformation invariance (i.e., center crops). We will also perform separate experiments to find an optimal transformation policy. Similarly, we plan to modify our slide-level augmentation space (via shifting the overlaid grid or random zeroing) to represent each slide fully rather than randomly subsampling as in the current study. Thirdly, we will apply our resulting method to Camelyon16, a breast cancer seminal lymph node metastasis dataset, in conjunction with the novel, in-house MIL models. From a technical standpoint, our proposed method is a novel approach to computational pathology, where meaningful features can be learned from WSIs without needing any annotations. The proposed method can benefit computational pathology from a practical standpoint, as it would theoretically enable researchers to benefit from vast amounts of unlabeled or irrelevantly labeled.
留言 (0)