Histopathology language-image representation learning for fine-grained digital pathology cross-modal retrieval

Automatic analysis and processing of histopathology whole slide images (WSIs) have become a popular research topic in digital pathology (Koohbanani et al., 2021, Li et al., 2022b, Cao et al., 2023, Lee et al., 2022), especially for the applications of WSI classification (Zheng et al., 2022, Chanchal et al., 2023, Hameed et al., 2022) and segmentation (Falk et al., 2019, Lutnick et al., 2022, Saednia et al., 2022).

Content-based histopathological image retrieval (CBHIR) (Zheng et al., 2018, Hu et al., 2020) is an emerging computer-aided diagnosis application that enables pathologists to input the region of interest and search for similar regions in terms of semantics and patterns from a well-established database (Hegde et al., 2019, Chen et al., 2022). Compared to predicting classification labels and segmentation maps, CBHIR application is more flexible and comprehensive to aid pathologists in making diagnoses. However, current region-level CBHIR methods often require extensive pixel-wise data annotations (Gu and Yang, 2019, Shi et al., 2018, Zheng et al., 2022), which are both expensive and time-consuming for practice. The WSI-level CBHIR methods can be trained with the slide label (Wang et al., 2023, Kalra et al., 2020a), but it is hard to provide fine-grained results for various areas of heterogeneity in the WSI.

In clinical practice, pathologists make diagnostic reports based on the WSI, which contains rich semantic information. These reports could be used to train the CBHIR system without the need for additional manual annotation, which has the potential to resolve the problem of data scarcity. Motivated by this, we aim to investigate a representation learning model between histopathology images and diagnosis reports for CBHIR referring to the paradigm of contrastive language-image pre-training (CLIP) (Radford et al., 2021, Jia et al., 2021, Zhou et al., 2022), which has shown remarkable success in training models for image-text cross-modal applications. However, WSIs present unique challenges for building CLIP models due to their variable dimensions and complexity. Specifically, WSIs are much larger and more complex than natural scene images. Moreover, we observe that diagnostic reports use more abstract and generalized vocabulary rather than detailed visual descriptions, and different pathologists might exhibit varying descriptive habits and characteristics when writing diagnosis reports. The above problems decide that current cross-modal learning methods are hard to be directly applied to histopathology WSIs and diagnostic reports.

In this paper, we propose a novel fine-grained image-language representation learning framework for cross-modal retrieval of whole slide images (WSIs) and diagnosis reports. We introduce an anchor-based attention module that extracts hierarchical regional features from the micro to macro scale of WSIs. Additionally, we design a prompt-based text representation learning scheme to guide the learning of semantic information in the reports. The proposed framework enables four types of retrieval tasks based on the multi-modal database which can be applied in four different application scenarios.

(1) Image-to-Image: The input is the WSI or its sub-region of the currently diagnosed case, and the retrieval returns semantically similar WSIs or regions from the database, which is the most widely discussed application in the CBHIR studies.

(2) Image-to-Text: The input is the WSI or its sub-region of the currently diagnosed case, and the retrieval returns diagnosis reports of the related cases from the database. This is useful when searching for cases that reports have been traditionally archived but the corresponding slides have not yet been digitized.

(3) Text-to-Image: The input is description text, and the retrieval returns WSIs or regions with the most semantic similarity from the database. It is helpful when pathologists require a reference case described by certain texts and search for a database that lacks diagnostic texts, for instance, the public online pathology communication platform.

(4) Text-to-Text: This application matches relevant cases through textual modality data that has strong summarization and high accuracy.

The contribution of this paper can be summarized in three aspects:

(1) We propose a novel fine-grained cross-modal representation learning model for four types of retrieval tasks between WSI and diagnosis reports, which can be trained automatically without handcrafted annotations. To our knowledge, this is the first study to tackle the problem of fine-grained cross-modal retrieval between histopathology WSIs and diagnosis reports.

(2) We propose a novel anchor-prompt alignment scheme to establish connections between region representations in WSI and keywords in the diagnosis reports. Specifically, we introduce a hierarchical kernel attention module to learn region features in various sizes and construct a prompt list that aids the encoding of diagnosis texts. Additionally, we design a weakly supervised constraint based on prompts and anchors, realizing fine-grained language-image alignment.

(3) We conducted extensive experiments on an in-house gastric dataset and the public dataset GastricADC (Kervadec et al., 2019). Through the ablation study, we demonstrate the effectiveness of the anchor-based attention module. The visualization results illustrate the capability of our anchor-prompt alignment scheme in capturing the fine-grained semantic information of WSIs from the diagnosis reports. Further, we benchmarked the proposed method and obtained superior performance against several state-of-the-art retrieval methods.

留言 (0)

沒有登入
gif