Channel-wise attention enhanced and structural similarity constrained cycleGAN for effective synthetic CT generation from head and neck MRI images

Data acquisition and processing

One hundred and sixty nasopharyngeal carcinoma (NPC) patients underwent volumetric-modulated arc radiotherapy (VMAT) between Dec 2020 and Dec 2021 in our hospital were enrolled in this study. The planning CT (pCT) images and simulated MRI were acquired from a CT (SOMATOM Definition AS, Siemens Medical Systems) and MRI (Ingenia 3.0T, Philips Medical Systems), respectively. This study was approved by the medical ethics committee of our hospital (2022ky298). The original dataset is split into training (120), validation (20), and test set (20), which contain 5734, 954 and 935 images, respectively. Both the dimension of CT and MR images were 512 × 512 on the axial plane, while the spatial resolution was 1.27 mm × 1.27 mm × 3.00 mm for CT scans, and 0.729 mm × 0.729 mm × 3 mm for MR scans. All images were preprocessed with in-house developed software in the following ways. First, all the MR and CT images are normalized to have a resolution of 1 mm ×1 mm ×1 mm, and then CT and MRI were rigidly registered to align them. After that, deformable registration from CT to MRI was performed. Afterward, the pCT images were used as a benchmark for the sCTs during training and testing.

Architecture of cycleSimulationGAN

Figure 1 illustrates the overall workflow of our method. The original MRI is output through the generator to sCT images, and the sCT images are then used to reconstruct the original MR images through the generator, thus forming a cycle. The original cycleGAN learns to discriminate between sCT and real CT (unpaired data) by using the discriminator of the generative adversarial network under the constraint of the corresponding loss function, and the generator gradually learns to synthesize CT with better quality. In addition, cycleGAN also constrains the reconstructed MR to be as similar to the original MR as possible by applying an L1 loss function between the original MR and the reconstructed MR as a cycle loss [45]. On this basis, our method makes the following two innovations, which can effectively improve the structural retention characteristics, and the degree of detail recovery of synthetic CT results. Moreover, background interference is also effectively suppressed. The cycleSimulationGAN architecture is shown in Fig. 2.

Fig. 1figure 1

The overall process of this study

Fig. 2figure 2

Proposed cycleSimulationGAN architecture used to map a MRI image to a CT image

Structural similarity constraint based on contour mutual information loss

The contours of the original MRI and the synthesized CT are first extracted, and then the contours of the synthesized CT are more similar to those of the original MRI constrained by the mutual information loss function. In order to extract the contours of the original MRI and synthetic CT effectively, the pixel value of the original MR and synthetic CT greater than a certain threshold value is set to 1, and the pixel value less than the threshold value is set to 0, so that the body region segmentation results of the two different modal images can be obtained, and the body contour results can be effectively preserved. Later, in order to effectively measure the similarity of the regional contours of MRI and CT images, a contour loss calculation method based on mutual information was proposed. For the regional contours extracted from the original MRI and synthetic CT images, the following mutual information method is used to calculate contour similarity.

$$ _=__\right)}__}p(x,y)\text\text\text(\frac)$$

(1)

P (x) and P (y) represent the probability distribution of the contour extracted from the original MR image \( _\) and the synthesized CT image \( G\left(_\right)\), respectively. And \( p(x,y)\) represents the joint probability distribution of \( p\left(x\right)\) and \( p\left(y\right)\). It can be observed that when calculating the loss of contours of different modes, the above formula no longer pays attention to the difference of pixel value per pixel at the bottom level, but to the data distribution characteristics of contours of different modes, thus paying more attention to the similarity of overall contours of higher levels.

Channel-wise attention enhanced mechanism

In order to enhance the feature representation capability of deep network and effectively extract more effective features representing images, a channel-wise attention mechanism is used to selectively enhance the multi-channel features (channel dimension is 256 in our experiment) extracted from the encoder of U-NET structure generator in cycleGAN network architecture. The channel-wise attention operation flow chart is shown in Fig. 3. The attention mechanism of the feature channel dimension (selective enhancement) can be achieved through end-to-end learning of the feature channel with stronger information enhancement and the feature channel with suppression of information redundancy.

Fig. 3figure 3

The channel-wise attention operation flow chart

The channel attention map\( \varvec\in ^\) is directly calculated from the original features \( \varvec\in ^\). Specifically, \( \varvec\) is reshaped to \( ^\), and then perform a matrix multiplication between \( \varvec\) and the transpose of \( \varvec\). Finally, a softmax layer is applied to obtain the channel attention map \( \in ^\):

$$} = }(\, \cdot \,)} \over ^C}(\, \cdot \,)}}$$

(2)

where \( _\) measures the \( ^\) channel’s impact on the \( ^ \)channel. In addition, we perform a matrix multiplication between the transpose of \( \varvec\) and \( \varvec\) and reshape their result to \( ^\). Then we multiply the result by a scale parameter β and perform an element-wise sum operation with A to obtain the final output \( \varvec\)∈ \( ^\):

$$ _=\beta \sum _^\left(__\right)+_$$

(3)

where β gradually learns a weight from 0. Equation 3 shows that the final feature of each channel is a weighted sum of the features of all channels and original features, which models the long-range semantic dependencies between feature maps. It helps to boost feature discriminability. Note that we do not employ convolution layers to embed features before computing relationships of two channels, since it can maintain the relationship between different channel maps. In addition, we exploit spatial information at all corresponding positions to model channel correlations.

Experimental evaluation and statistical analysis

Four criteria including mean absolute error (MAE), root-mean-square error (RMSE), peak-signal-noise-ratio (PSNR), and structural similarity index (SSIM) were used to evaluate the performance of different synthetic image generators from pixel-wise HU accuracy, noise level, and structure similarity. The formula of MAE, RMSE, PSNR, and SSIM are shown in Eqs. 4, 5, 6, and 7, respectively.

$$ \text\text\text= \frac__}\sum _^__}\left|sCT\left(i,j\right)-CT\left(i,j\right)\right|$$

(4)

$$ \text\text\text\text=\sqrt__}\sum _^__}^} $$

(5)

$$ \text\text\text\text=10 \times }_\left(\frac^}__}\sum _^__}^}\right)$$

(6)

$$ \text\text\text\text= \frac__+_\right)\left(2_+_\right)}_^+_^+_\right)\left(_^+_^+_\right)}$$

(7)

where \( sCT\left(i,j\right)\) is the value of the pixel at \( \left(i,j\right)\) in the sCT, \( CT\left(i,j\right)\) is the value of the pixel at \( \left(i,j\right)\) in the pCT, \( __\) is the total number of pixels in one slice, MAX is the maximum intensity in sCT,\( _\) is the mean of pixel values of sCT, \( _\) is the mean of pixel values of pCT image, \( _\) is the standard deviation of sCT, \( _\) is the standard deviation of pCT image. As described in the above formulas, MSE and RMSE calculate the difference between the generated image and the original image, and the larger the value, the worse the quality, and the smaller the value, indicating that the prediction model has better accuracy. The denominator of PSNR is the energy difference between the generated image and the original image, which is also equivalent to noise. The smaller the noise, the better PSNR. MAE, PSNR and RMSE are all based on gray values to calculate the differences, while SSIM mainly considers image contrast, brightness and structure information. The larger the value, the more similar the SSIM is, and the maximum value is 1 entropy: it reflects the amount of average information in the image.

To validate and evaluate the performance of the present cycleSimulationGAN, two additional approaches were carried out for comparison purpose including the original cycleGAN and the structural similarity constrained cycleGAN (SSC-cycleGAN). SSC-cycleGAN is derived from cycleGAN, which proposes a contour consistency loss function that explicitly imposes structural constraints between the reconstructed MR and the original MR, aiming to effectively preserve structural features such as contour shape during sCT, as described above in our innovation 1 of cycleSimulationGAN. The three different networks (original cycleGAN, SSC-cycleGAN, and cycleSimulationGAN) were trained with the same training and validation dataset under the same environment.

Through a large number of experiments and quantitative analysis, all the related parameters were appropriately determined in the multiple tuning. The proposed deep network architecture was implemented in Pytorch and the training/validation was run on an Nvidia Geforce RTX 3090 GPU (24G memories) with CUDA acceleration. An Adam algorithm was chosen as the optimizer to minimize the L1 loss function with a learning rate of 1 × 10− 5, the maximum number of epochs is set as 200 and the convolution block sizes was set to 3 × 3. With the trained generation model, which takes about 46 h of computation time, the sCT for a new case with MR data takes only a few tenths of a second. All statistical analyses were performed to compare the three models using a paired t-test, a p-value ≤ 0.05 was considered statistically significant.

留言 (0)

沒有登入
gif