Ultra high speed SPECT bone imaging enabled by a deep learning enhancement method: a proof of concept

Subjects and image acquisition

Patients received systemic bone imaging after the injection of 25–30 mCi (925–1110 MBq) technetium 99 m-methyl diphosphonate (99mTc-MDP) at Shanghai East Hospital. This study was approved by the Institutional Review Board, and all patients signed informed consent before the examination. The SPECT/CT data were collected using Siemens Symbia Intevo T16 with two continuous protocols: one standard scan with 20 s per projection (referred as the standard SPECT) and one fast scan with 3 s per projection (referred as the 1/7 SPECT). Sixty projections were performed each scan. Projection data were then reconstructed based on CT attenuation map of ordered-subset conjugate gradient (OSCG) algorithm enhanced with 2 subsets and 28 iterations without post-smoothing. Low-dose CT images were collected at 130 kV and 10 valid mAs and reconstructed using a smooth attenuation-correction kernel B31s with 3 mm slice thickness. The resolution of each SPECT images is 256 ×  256, and 200 images were collected each scan. Each SPECT voxel represented a 1.95 mm × 1.95 mm × 1.95 mm space. The resolution of CT images is 512 ×  512, and 131 slices were collected for each patient. Each CT voxel represented a 0.97 mm × 0.97 mm x 3 mm space. All patients were informed to stay still during the examination to keep the quality and correspondence of images. Unmatched 1/7 SPECT and standard SPECT were discarded. Twenty matched groups (11 males, mean age: 56 years, age range: 26–75) of fast and standard SPECT/CT images were collected for further research. Ten subjects were used for training the proposed deep learning model, while the rest 10 subjects were set for testing the synthesized images. One example of training data is shown in Fig. 1.

Fig. 1figure 1

One example of training dataset. The left images are coronal and axial views of fused SPECT and CT images. The middle images are coronal and axial views of 1/7 SPECT. The right images are coronal and axial views of the standard SPECT

National Electrical Manufacturers Association (NEMA) International Electrotechnical Commission (IEC) Body Phantom Set was applied as the model, which is a hollow cylinder made of plexiglass with 6 spheres of different diameters (10, 13, 17, 22, 28, 37 mm), and the volume of the hollow cylinder is 9700 ml. The center of the spheres all locates on a circle 5 cm from the center of the cylinder and a plane 70 mm away from the upper surface of the cylinder. SPECT/CT quantitative tomography images with 20 s, 3 s/view and total of 60 views were performed at a specific activity of 12:1 after one hour of instilling. 200 slices of SPECT images and 131 slices of CT images were collected. The first 80 slices of SPECT images and corresponding 53 CT images were inserted to the training dataset, and the rest matched images were treated as testing samples.

Image preprocess

The simultaneous SPECT and CT acquisition modes facilitate the integration of input data from the two modalities. SPECT image provides diagnostic information at the expense of the anatomic features which can be supplemented by the corresponding CT image. Hence, we propose to combine 1/7 SPECT image and CT image as the input and take the standard SPECT as the network output. To facilitate the combination of SPECT and CT images, each collection of CT images was reshaped into a 200 × 256 × 256 matrix which has the same shape with SPECT. To flatten the difference of voxel values, all the input images and the output were divided by their own average. Then corresponding SPECT and CT slices were concatenate in the first dimension before sending to the proposed U2-Net architecture.

The ablation study in the experiment shows the effectiveness of the combination compared with using only 1/7 SPECT as input.

Residual U-block and U2-Net

The 1/7 SPECT varies greatly from the standard SPECT both from the presence of bone structure and voxel value of lesions. So, both local and global contextual features are important for this image synthesis task. U2-Net was originally proposed for salient object detection (SOD). The neural network architecture is different from modern CNN designs, such as AlexNet [14], ResNet [15], GoogLeNet [16], DenseNet [17] and VGG [18]. These networks were originally built for image classification tasks which prefer to use small convolutional filters with a size of 1*1 or 3*3 to extract features. U2-Net [19] is a simple yet powerful deep network architecture. It contains a novel two-layer nested U-shaped structure [20]. The proposed residual U-block (RSU) consists of a mixture of different-sized receive domains that helps capture contextual information on different scales more efficiently. It also uses pooling operations to increase the overall architecture depth without affecting the computational cost much. There are three major components in RSU which are input convolutional layer, U-Net-like symmetric encoder–decoder structure of ‘L’ height and residual connection to fuse local and multi-scale features using summation. The RSU module with height = 7 is also shown in Fig. 2.

Fig. 2figure 2

Illusion of proposed U2-Net architecture. It consists of 4 residual U-block (RSU) encoders with height of L, 3 bottom residual block (RS) and 4 symmetric RSU decoder. Skip connections are used to save spatial information along matched encoders and decoders. 1/7 SPECT and corresponding CT image are used as the input. The output of the network is images convoluted and upsampled from the second and third RS and follow-up decoders

Based on RSU blocks alone, U2-Net was developed. It consists of a 6-stage encoder, a 5-stage decoder and a graph fusion module attached to the decoders at different stages. (i) In the encoder stages, RSU-7, RSU-6, RSU-5 and RSU-4 are used, respectively, in which ‘7,’ ‘6,’ ‘5’ and ‘4’ denote the heights (L) of RSU blocks. As the resolution of feature maps in the middle part of U2-Net is relatively low, further downsampling of these feature maps would cause loss of useful context. Hence, we use dilated convolution to replace the pooling and upsampling operations, and this kind of special block is referred to as ‘RS-L’ which is also shown in Fig. 2 with height = 4. (ii) In the decoder stages, it has similar structures to their symmetrical encoder stages starting from RS-4. There are 5 stages in total in which each decoder stage takes the concatenation of the upsampled feature maps from its previous stage and those from its symmetrical encoder stage as the input as shown in Fig. 2. (iii) The last graph fusion module is used for generating final synthesized SPECT images. First, this U2-Net generates six side output synthesized SPECT images Sup(6), Sup(5), Sup(4), Sup(3), Sup(2), Sup(1), which are upsampled to have the same size as input 1/7 SPECT image. Then, these outputs are fused with a concatenation operation and input to a 1*1 convolution layer followed by a long skip connection with 1/7 SPECT to generate the final synthesized SPECT image.

Lesion attention loss and deep supervision

To ensure the accuracy of the synthesized image value and distinguishability of the structure and important ROIs, we adapt a combination loss function of the structural similarity index (SSIM) loss and the L1 loss. The total loss for each output SPECT at different decoder stages is

$$L = L_ + \alpha L_}}}$$

where \(\alpha\) is a fixed weight (\(\alpha\) = 0.5) that balances the SSIM loss and L1 loss.

To further improve the performance on lesion regions, we add lesion attention masks to emphasize the loss in these areas. The lesion masks were contoured on standard SPECT by physicians. So, the improved lesion attention loss is defined as

where \(\beta\) is a fixed weight (\(\beta\) = 100) that balances the lesion region loss and whole image loss. M is the lesion mask.

We also use deep supervision strategy in the training process to speed up the training process and acquire stable media layers. The total loss for training the U2-Net is defined as

$$}_}}} = \mathop \sum \limits_^ w_}}}^ \ell_}}}^ + w_}}} \ell_}}}$$

where \(\ell_}}}^\) (N = 6, as Sup1, Sup2, Sup3, …, Sup6 in Fig. 2) is the loss of the side output and \(\ell_}}}\) is the loss of final output of the network. \(w_}}}^\) and \(w_}}}\) control the weights of each component in the total loss. In the testing process, final output is the only part for synthesizing SPECT images.

Implementation details

The proposed method is implemented using PyTorch 1.6.0 and trained on four NVIDIA GEFORCE 3090 (24 GB). The network is trained for 100 epochs, and the batch size is set to 4 by using axial slice as inference plain. VGG net is used as the discriminator. Adam optimizer is used with the learning rate of 0.0002 for both the generator and discriminator and divided by 10 after 80 epochs.

Quantitative assessment

To quantitatively evaluate the performance of synthesized images, PSNR and SSIM are used as evaluation metrics. PSNR for synthesized image is defined as

$$} = 10 \cdot \log_ \left( }_}}}^ }}}}}} \right)$$

where \(}_}}}\) is the maximum pixel value of ground truth standard SPECT. MSE is the mean square error of synthesized images compared to the standard SPECT.

SSIM for synthesized image is defined as

$$}\left( \right) = \frac \mu_ + c_ } \right)\left( + c_ } \right)}}^ + \mu_^ + c_ } \right)\left( ^ + \sigma_^ + c_ } \right)}}$$

where \(\mu_\) and \(\sigma_^\) are average value and variance of input synthesized image. \(\mu_\) and \(\sigma_^\) are the average value and variance of input standard SPECT. \(\sigma_\) is covariance of two images. \(c_\) and \(c_\) are small constants. SSIM is calculated using scikit-image package.

Clinical assessment

Two readers independently grade 1/7 SPECT, synthesized SPECT by proposed method and standard SPECT in terms of general image quality, detail of 99mTc-MDP distribution, presence of artifacts and general diagnostic confidence using 5-point Likert scale (1 for unacceptable image quality; 2 for suboptimal image quality; 3 for acceptable image quality; 4 for good image quality; and 5 for excellent image quality). Readers are blinded to meta-information of compared images.

Average scores of each kind of image are compared. Paired t test is used to identify significant differences between each criterion.

Phantom study

Half phantom images are used to training the model. The rest half including the center position of the phantom are used for testing. The images of the sphere centers are used to distinguish different spheres. SUVmax and SUVmean are measured and compared for each recognizable sphere. SUV is defined as

$$} = }\;}*}\;}\;}/}\;}\;}\;}$$

PSNR and SSIM are calculated for 1/7 SPECT and generated SPECT.

留言 (0)

沒有登入
gif