GenAI synthesis of histopathological images from Raman imaging for intraoperative tongue squamous cell carcinoma assessment

Patient cohort and demographics

This study involved a cohort of 15 patients diagnosed with TSCC who were admitted to our facility between January and December 2023. All participants had received pathological confirmation of TSCC before their inclusion in this study. None of the enrolled individuals exhibited any other systemic diseases or a history of medication use. Prior to the commencement of the study, informed consent was obtained from all patients, and the research protocol received approval from the Ethics Committee of the First Affiliated Hospital of Xiamen University (Grant No. XMYY-2022KYSB030) and the Ethics Committee of West China Hospital of Stomatology, Sichuan University (Grant No.WCHSIRB-CT-2023-438). Among the 15 TSCC patients, 12 were admitted in the early stage of 2023 (from January to July of 2023) and were subsequently assigned to our training-validating process. The remaining three patients, newly diagnosed and admitted in late 2023, were allocated to the external test set for model evaluation. Demographic information for the TSCC patients is detailed in Supplementary Table 1.

Preparation of tissue slides

Tissue samples from patients with TSCC and corresponding normal muscle tissues were obtained from surgical resection specimens in the operating room. The normal muscle tissues were specifically harvested from areas located 5 mm beyond the tumor margins. Both the tumor and muscle tissues underwent processing using the freezing microtome to generate standard sections, subsequently subjected to H&E staining for histopathological confirmation. Simultaneously, 6-µm-thick tissue sections from the same specimens were sliced using the freezing microtome and affixed to custom-made pure aluminum slides for subsequent Raman spectroscopic scanning. In total, 30 H&E-stained slides were prepared for histopathological imaging and confirmation, along with an additional 30 slides for Raman imaging, derived from the samples of 15 patients diagnosed with TSCC.

Generation of H&E images

The fixed tissue samples underwent a systematic dehydration process utilizing an automatic dehydrator. The dehydration sequence involved sequential immersion in various alcohol concentrations, totaling 5 h and 20 min: 75% alcohol for 1 h, 85% alcohol for 1 h, 95% alcohol I for 50 min, 95% alcohol II for 50 min, 100% alcohol I for 50 min, and 100% alcohol II for 50 min. Subsequently, a mixture comprising 100% alcohol and xylene in a 1:1 ratio was applied for 20 min, followed by xylene I for 25 min and xylene II for 25 min. To achieve optimal embedding, the tissues underwent additional steps, including immersion in paraffin I for 1 h, paraffin II for 2 h, and paraffin III for 3 h. Following this comprehensive dehydration and embedding process, the tissues were prepared for sectioning.

For deparaffinization, sections underwent a meticulous process involving xylene treatment. This included immersion in xylene I for 5–10 min, followed by xylene II for 5–10 min. Subsequently, the sections underwent sequential immersions in absolute ethanol I for 5 min, absolute ethanol II for 5 min, 95% alcohol for 5 min, 85% alcohol for 5 min, and 75% alcohol for 5 min. Finally, the sections were soaked in UP water for 5 min, completing the deparaffinization process. This systematic approach ensures the removal of paraffin, preparing the sections for further analysis.

Following deparaffinization, the sections underwent a staining process to highlight cellular structures. Initially, the sections were stained with hematoxylin for 10–20 min, then rinsed with tap water for 1–3 min. Subsequent differentiation occurred in hydrochloric acid alcohol for 5–10 s, followed by another rinse with tap water for 1–3 min. The sections were then blued in warm water at 50 °C or a weak alkaline aqueous solution until achieving a blue coloration. After a thorough tap water rinse for 1–3 min, the sections were immersed in 85% alcohol for 3–5 min. Eosin staining followed, lasting 3–5 min, and another rinse with tap water for 3–5 s. The subsequent steps included dehydration in graded alcohols, clearing in xylene, and finally, mounting with neutral resin.

Raman imaging for tissue sections

Raman imaging of tissue slides was conducted using the Nanophoton Raman-11 laser Raman microscope (Nanophoton, Japan). The slides were precisely positioned on the motorized XYZ stage to ensure precise spatial alignment. To achieve comprehensive coverage, eight regions of interest were systematically recorded across various sections of the slides. The 532 nm excitation laser beam was focused onto the sample using an X20 0.45 NA Nikon lens. Linear scanning imaging was employed to generate Raman images of the tissue samples, with scanning parameters meticulously configured as follows: a lateral range of 400 µm and a vertical range of 50 µm, both with a resolution of 1 µm. Lateral linear scanning was executed with an exposure time of 3 s, maintaining a power level of ~0.2 mW. Each region underwent scanning for a duration of 2.5 min.

ImgAlignNet model

As a key component of our methodology, the ImgAlignNet (Supplementary Fig. 2b), analogous to CLIP, captures and aligns features from Raman and H&E images. The ImgAlignNet was developed to address two challenges:

1.

Different from text-image model, which generally relies on spatial information from only image modal, in this study, the ImgAlignNet needs to manage the inherent spatial information in both Raman images and H&E images, focusing on aligning their local contents.

2.

Limited data in clinical settings constrains the ability of Transformer architectures which require extensive parameter tuning and benefit from training on larger datasets. Thus, the ImgAlignNet should have the ability to handle small datasets.

To address the alignment challenge, the model adopts ViT’s segmentation strategy to partition the encoded spatial data into patches. It then leverages a unique approach, inspired by VQ-VAE, where the same seed is used to generate corresponding targets in both Raman and H&E latent spaces, ensuring coherent alignment. Since the targets can be aligned by using the same seed, the CLIP’s contrastive loss can be replaced by a basic binary classification loss for TSCC/normal classification to avoid the issue of limited samples. To further address the challenge of limited sample size, an SVM-inspired classification framework, utilizing the cosine similarity as distance measurement between patches and their corresponding targets, was incorporated into the model.

In more detail, data from both modalities are first downsampled using a CNN-pooling architecture, reducing the dimensionality, and thus easing the computational load for subsequent processes. The data and images are then segmented into patches using the ViT strategy, resulting in \(__}\) for Raman images and \(__}\) for H&E images, where \(i\) and \(j\) represent the indices of the respective patches. After segmentation, each patch is flattened into a one-dimensional vector with the dimensionality of \(d\), where \(d\) equals the product of patch size \(\times \) and the number of channels \(c\) after the last convolution layer. Consequently, the seeds are initialized to a dimension \(_}\), and targets of the dimension \(d\) to match the patches can be generated using a few fully connected layers with nonlinearity operations from the seeds. Specifically, the seeds are associated with the final classes (i.e., normal and cancer), leading to the generation of targets within their respective latent spaces:

$$_,k}^=^\left(_,k}\right)$$

(1)

Where \(\in \left\,\right\}\), \(M\in \left\,\mathrm\right\}\) and \(k\) is the index of seed and targets. With the targets established, distances in each class can be calculated as follows:

$$_,i,k}^=\cos \left(__},__,k}}\right)$$

(2)

Where and \(i\) serves as a general index for patches in both modalities, corresponding to index \(i\) for \(\) and \(j\) for \(\mathrm\) in the original data. Based on the distances, the logits can be calculated using the equation:

$$logit_^=\sum _}}_^}}\sum _\exp (\lambda (1+dis_^))$$

(3)

Where \(}}_,i}^}}\) and \(\lambda\) are trainable nonnegative parameters, denoting the weight and temperature factor, respectively. Equation \(\left(3\right)\) initially scales the range of cosine similarity values from \(\left[-\mathrm\right]\) to \(\left[\mathrm\lambda \right]\). The exponential function applied after scaling predominantly amplifies larger values, making the large values more influential across feature axis. Following this, a weighted summation across the target axis is performed to compute the logits. The final loss value is calculated as follows:

$$^=\left(_}\left(_}^\right),^\right)$$

(4)

The equations above ensure that in the model’s assessment, the focus is more on the larger cosine similarity values, reflecting the importance of the large similarity between patches and targets in the overall calculation.

Diffusion model

The diffusion model, another key component in our framework, is employed to generate H&E images (Fig. 2b). The diffusion model is trained using noised images and corresponding conditioning information, with the objective of transforming a random noise distribution into a coherent image structure during the inference process. In our approach, the primary effort involves integrating the ImgAlignNet model’s outputs into this diffusion process, which enhances the model’s ability to generate H&E images in a more controlled and less random manner. This enhancement is achieved by utilizing the features extracted from Raman images differently in the training and inference stages.

In more detail, during the training phase of the diffusion model, each diffused H&E image is combined with all corresponding \(_}^}\) that share the same class label, serving as conditioning inputs. While in the inference phase, the conditioning is generated from a Raman image. Initially, the distances \(_,i,k}^}\) are computed according to equation\(\left(3\right)\), where \(i\) and \(k\) are the indices of patches and targets, respectively. Then, a masking mechanism is applied to filter out smaller distance values using the equation:

$$\beginmas_^=\left\\,1\,if\,dis_^\, > \,thre_\\\ \!-inf\,otherwise\,\end\right.\end$$

(5)

In equation \(\left(5\right)\), \(_,i,k}\) can vary across the three axes, indicating a dynamic threshold. After experimenting with various approaches, a two-step top-K selection process was determined for setting the thresholds. This process initially selects the kth largest values along the patch axis, and subsequently performs another top-K selection along the combined class and target axis. Having obtained the mask, the conditioning for the inference process is computed in conjunction with the aligned H&E targets using the equation:

$$con_=\,\sum _softma_(mas_^\cdot dis_^)Targe_^$$

(6)

Here, \(_,i}\) denotes the SoftMax operation applied along the combined axis of class and targets. Equations \(\left(5\right)\) and \(\left(6\right)\) are based on the attention mechanism, where the attention matrix is computed using a temperature-modified cosine similarity distance. Additionally, a sparse filter is applied to eliminate smaller values. After filtering to retain only the larger distance values, a SoftMax operation is applied to these values, emphasizing the most significant interactions. This refined data is then linearly combined with the aligned H&E targets to guide the image generation in the reverse diffusion loops.

Model implementations

In total, for training ImgAlignNet, 88 × 20 = 1 760 H&E sub-images and 8 × 20 = 160 Raman sub-images were generated for the training dataset; 88 × 4 = 352 H&E sub-images and 8 × 4 = 32 Raman sub-images were generated for the validation dataset; and 88 × 6 = 528 H&E sub-images and 8 × 6 = 48 Raman sub-images were generated for the testing dataset. For the diffusion model, the number of generated H&E sub-images increased to 180 × 20 = 3 600 for the training dataset, 180 × 4 = 720 for the validation dataset, and 180 × 6 = 1 080 for the testing dataset. The number of generated Raman sub-images remained the same as those used for training ImgAlignNet.

The detailed aspects of our model development are depicted in Fig. 2b, and the pseudo code was provided in the Supplementary Methods. Firstly, the H&E images with shape \(\left[\right]\) were processed to the same resolution where 1 pixel equal to \(1\mu ^\) to be the same as Raman images with shape \(\left[\right]\) by using the resize operation from torchvision library. The shape of H&E images becomes \(\left[\mathrm\right]\) after the resizing operation. Then an unfold-like operation was developed to extract sub-images from the processed images. For the ImgAlignNet model, the H&E sub-image has the shape of \(\left[\mathrm\right]\) with a stride of 32 pixels in both height and width, and the Raman sub-image has the shape of \(\left[\right]\) without stride. For the diffusion model, the shape of H&E images becomes to \(\left[\mathrm\right]\) and others keep to the same.

Secondly, the ImgAlignNet model employs separate initial convolution layers for processing Raman and H&E images, respectively. The outputs from these layers are then individually processed through respective CNN downsampling blocks. Each downsampling block is composed of two residual CNN blocks, followed by a downsampling layer. The residual CNN block comprises two convolution layers, each preceded by a groupNorm layer followed by a SiLU activation layer. Across each residual CNN block, a residual shortcut is implemented by summing the block’s input with the output from its last convolution layer. The final downsampling layer is achieved using a CNN layer with a stride of 2. Using the two downsampling blocks, the dimensions of the H&E image were reduced from \(\left[3,\,256,\,256\right]\) to \(\left[64,\,64,\,64\right]\), and those of the Raman images from \([1\;340,\,400,\,48]\) to \([64,\,100,\,12]\). Subsequently, ViT segmentation was employed to split the data into 256 and 75 patches of size \([64,\,4,\,4]\) for H&E and Raman images, respectively. Seed arrays were generated from independent normal distributions with a dimension of 64, and then expanded to a dimension of 128 through three fully connected layers. Between each pair of fully connected layers, a layerNorm layer and a SiLU layer were inserted. The patches were then flattened and projected to a size of 128 by another fully connected layer for the cosine similarity calculation and subsequent classification. The temperature factor λ was initialized at \(\mathrm\left(2\right)\approx 0.693\;1\), and a label smoothing mechanism with a value of 0.15 was employed for the final cross-entropy computation. Specifically, after exploring various approaches, we decided to pre-train the CNN downsampling blocks for H&E images using an auto-encoder. This auto-encoder is composed of CNN downsampling blocks, each paired with a corresponding upsampling block, differing only in the final layer where the downsampling layer is replaced by upsampling layer. When training the ImgAlignNet model, we extracted and froze the downsampling blocks from this pretrained auto-encoder to abstract the image features. This freezing was found to be crucial, as it prevented overfitting, which was evident from observing the validation dataset loss values. Contrarily, using a pretrained auto-encoder for Raman images, based on our tests, was found to adversely affect classification performance, even reducing it.

Finally, the diffusion model employs a UNET structure to synthesize the H&E image with the Raman targets as conditioning. It comprises five CNN downsampling blocks, each reducing the image dimensions to half, and an equal number of upsampling blocks, sequentially restoring the image back to its original size. Following the common UNET design, outputs from the downsampling blocks are also utilized as shortcut inputs for the corresponding up-sampling blocks by concatenating these outputs with the respective inputs of the up-sampling pathway. Specifically, self-attention and cross-attention layers, placed before the down/up-sampling layers, are selectively implemented when the H&E image resolution is reduced to 16 × 16 or smaller, in consideration of balancing the computational burden. When training the diffusion UNET, the Raman targets are integrated as conditioning through these cross-attention layers.

The running environment and parameters

During training, the AdamW optimizer was used for both models with an initial learning rate of 1E-5 and a decay rate of 1E-5. MSE (mean squared error) loss was employed for both the auto-encoder and diffusion models, while cross-entropy loss was used for the ImgAlignNet model. Since the ImgAlignNet model is designed as a multi-task classification model, the accuracy score was used to evaluate the classification performance. Training was conducted on up to 16 NVIDIA 3080 GPUs in a cluster, each with 10 GB of graphics memory. All models were developed using PyTorch, with version 1.13 utilized in the cluster environment and version 2.0 for local development. The frequency of backward operation was set to occur every 64 samples, equivalent to a local mini-batch size of 64, achieved by utilizing the distributed data parallel (DDP) module and gradient accumulation, regardless of the varying number of GPUs. For the ImgAlignNet model, considering the limited dataset size, we implemented an early stopping mechanism with a threshold of 300 epochs (about 10 000 steps), monitored by the minimal loss value on the validation set. In contrast, the diffusion model was trained for a total of 1 000 epochs (equivalent to 33 000 steps), with checkpoints saved every 25 epochs. The best validation set checkpoint of the diffusion model was preserved as well, even though early stopping was not employed.

留言 (0)

沒有登入
gif