SPECT-MPI iterative denoising during the reconstruction process using a two-phase learned convolutional neural network

Problem formulation

In SPECT reconstruction, the measured data in the jth projection element, \(_\), can be formulated as below:

where \(_\) is the intensity (or activity concentration) within the ith pixel in the image, and \(_\) is the probability that emitted photon from the ith pixel will be detected by the jth projection element. \(M\) is known as detection probability matrix or system matrix which is very large. \(M\in }^\) where l equals the full set of pixel elements (\(l=m\times n\) | \(m,n\) are the image dimensions) and k equals all the projection elements (\(k=d\times \alpha\) | \(d,\alpha\) are the detector elements and projection angles respectively). The objective is to find the maximum likelihood estimation of the image \(\widehat\in }^\), which can be written as:

$$\widehat=\mathit\underset}L\left(P|Q\right)$$

(2)

where \(L\) represents likelihood estimation function, \(P\in }^\) is the projection count distribution that was measured from an estimated intensity distribution in the image space, labelled \(Q\).

Assuming the measured data to be Poisson distributed, by using iterative reconstruction to find the maximum likelihood estimation problem, function (2) is rewritten as below [15]:

$$}_^=\frac}_^}_}\times \sum__\times \frac_}_\times }_^\right)}$$

(3)

where, n is the iteration number, the ith pixel element of \(\widehat\) image after n iterations, \(}_^\), is an estimation of \(_\), and t is the pixel index similar to i (since the denominator must be computed before the rest of formula, t is indexed instead of i to avoid confusion).

After n iterations, each estimated pixel value \(}_^\) is related to the real pixel value \(_\), as follows:

where \(_\) represents the error magnitude after n iterations, which is influenced by both deterministic and random noise. It is anticipated that in the absence of random noise, by accounting for all sources of deterministic noise and patient-induced attenuation and scattering in the system matrix M, \(_^\) is significantly reduced as the number of iterations increases [16]. However, increasing the number of reconstruction iterations introduces higher frequency components into the reconstructed image [16]. In practical scenarios, random noise contains high-frequency components. Consequently, increasing the number of iterations results in higher noise levels in the reconstructed image [16]. One of the simplest strategies to suppress both deterministic and random noise is to increase the number of iterations while applying a random noise-controlling filter, such as a low-pass filter, after the reconstruction [16]. The corresponding equation can be expressed as follows:

$$\Theta = f(\widehat)$$

(5)

where \(f: }^\to }^\) (\(n\times m\) representing the image dimension) is a filtering function to regulate the noise in the estimated image \(\widehat\). \(\Theta \in }^\) is a noise-suppressed estimation of the real image. Conventionally, \(f\) is modeled by a low-pass filter to reduce noise. However, the primary drawback of using simple low-pass filters is that they blur the image, suppressing image edges and altering contrast values during filtering. Recent studies have demonstrated that deep neural networks provide superior noise filtering performance while retaining spatial resolution and quantitative accuracy compared to classical filtering methods [13].

If the \(f\) function is used after completing the iterations [17], it is known as the post-reconstruction filtering/denoising method. In our work, we propose to apply the \(f\) function after each iteration. In this manner, the number of times when \(f\) is applied equals (or less than) the number of iterations. This idea can be formulated as follows:

$$}_^=\frac_^}_}\times \sum__\times \frac_}_\times _^\right)}$$

(6)

where, \(^\) is the ith pixel element of the \(^\) and \(^= f\left(}^\right).\)

Here, we have applied the proposed \(f\) function two times lesser than the number of iterations. (f is not applied in the first two iterations). We have also explored the effects of using specific \(f\) for each iteration. We have proposed a GAN to estimate the \(f\) as a denoising function in this work. To do that, the GAN has been trained with a pair of high- and low-noise images in three training approaches. Moreover, to reduce the number of iterations, the OSEM [18] reconstruction method is used. In this way after performing all of the sub-iterations and finishing an iteration, \(f\) is applied.

Network architecture

Our implemented network, which we call a two-phase learned convolutional neural network (TPL-CNN), is based on a GAN consisting of two parts; a generator and a discriminator. After exploring various architectures, the CNN network model was proposed for both the generator and the discriminator, as shown in Fig. 1.

Fig. 1figure 1

Architecture of the generator and discriminator networks in the GAN model of the TPL-CNN

In the generator, two convolutional and two deconvolutional layers were used. Ten 3 × 3 convolutional kernels with no padding and a stride of one were utilized throughout all layers. Each convolutional layer was followed by a batch normalization layer (BN) and a leaky rectifier linear unit (ReLU) layer, with a slope of 0.3, as the activation function. Except for the last convolution layer, ReLU was used for the activation function. We also considered including skip connections, which directly concatenate the values from an input layer to the values at the output layer having the same dimension. These skip connections were introduced to address the potential risk of sacrificing essential image details when increasing the number of encodings. They facilitate a parallel pathway for conveying image features from shallower encoding layers to deeper decoding layers, effectively resolving this concern [19].

The discriminator comprised two convolutional layers with ten kernel sizes of 3 × 3. Each convolution layer was connected to a BN, and a leaky ReLU with a slope of 0.3 was used as the activation function. A flatten layer was used to convert the spatial dimensions of the previous layers into a single one-dimensional vector that can be used as input for the last layer, which is a dense layer.

The information about the architecture of the applied network has been represented in appendix with the caption of “TPL-CNN Characteristics.”

Network implementation

In SPECT-MPI, the myocardium is of interest, which is much smaller than the entire reconstructed image, and the vast majority of the image pixels correspond to background regions (or other organs) outside the heart volume, which reduces the accuracy of a network. Hence, the network can be trained only with this ROI to improve its accuracy [10, 12]. On the other hand, the network must see the entire reconstructed images to improve its generalization. Therefore, the network is trained by a two-phase algorithm presented at the SNMMI 2022 conference [20].

In the first phase of this algorithm, the aforementioned network is trained with reconstructed images from high-noise and low-noise data that contain only the cardiac region as the input and output of the network. These smaller image pairs, each of size \(32\times 24\) pixels, are automatically prepared using QCard-NM package as proposed in [21, 22]. This package identifies a fixed-size rectangle encompassing the cardiac region. The ROI is then presented to the user. If the ROI does not properly surround the left ventricle, the user has the option to manually adjust its position.

Subsequently, the cropped images are rotated with angles of + 20° and − 20° and translated along the x and y axes in both positive and negative directions using Python libraries. This expands the image set to seven times its original amount. The image order is then randomly shuffled before being fed as input to the neural network.

When the random initial weights of the first network are updated, they serve as the initial weights of the second network in a process called transfer learning. The second network has the same architecture as the first network, except for the input and output layer sizes, which remain the same as the original reconstructed image sizes, i.e., \(64\times 64\) pixels. In transfer learning, the size of the filters (kernels) and weights within the convolutional layers generally do not need to be altered if the input size changes. This process is illustrated in Fig. 2.

Fig. 2figure 2

Flowchart of the two-phase learning process of the TPL-CNN training

In our implementation, we first applied one iteration of OSEM with four sub-iterations and used it as the input for the iterative denoising process. The iterative denoising process tries to reduce the noise of the input image in four iterations. Hereby, the input data is denoised in each iteration of the iterative denoising process by the TPL-CNN before being reconstructed for one more iteration with OSEM. The overall algorithm flowchart is presented in Fig. 3.

Fig. 3figure 3

Flowchart of the iterative denoising algorithm. OSEM is used as the reconstruction tool. Each iteration of the OSEM reconstruction method consists of four sub-iterations

All image reconstruction methods in our work were done in fully two-dimensional mode. The neural network was implemented using the Keras deep learning framework 2.8.0 based on the TensorFlow libraries 2.8.2 in Python 3.9. All the experiments were conducted on NVIDIA GeForce GTX 1050 with a 4GB memory graphical processing unit. During training, the generator is updated via a weighted sum of both adversarial and L2-norm loss. The trainable parameters were updated to minimize the L2-norm loss calculated between the predicted images and the reference input images. The binary cross entropy loss function was used to train the discriminator model. Adaptive moment estimation (Adam) optimizer with a learning rate of \(1\times ^\) was used to minimize the loss functions.

Dataset

In total, 247 patients were referred for MPI, and two immediate serially high and low-noise SPECT data were acquired as the input and ground truth images. For each scan, 32 projections with matrix size of 64 \(\times\) 64 were acquired and the data is reconstructed using OSEM algorithm for five iterations with four sub-iterations. The data were divided into three subsets: 1) one with 149 patients used for training, 2) one with 49 patients used as the validation set, and 3) the other 49 patients were used for performance evaluation (denoted as the test set). Patients were unmoved between acquisitions. Cases were randomly divided into training, validation, and evaluation sets. Acquired high and ground truth low-noise image-pairs were obtained from 10- and 30-s acquisition time per projection, respectively.

Experimental setup

Due to the device’s inability to capture the list mode, we were compelled to conduct the high-noise scan in a serial manner. The experimental setup was meticulously designed, involving the standard scan (low-noise), which took 10 min, followed by the fast scan (high-noise), which took 3 min. Patients remained unmoved on the bed for 13 min after an hour of injection, during which the changes in activity distribution and decay were deemed insignificant based on the assumptions of the myocardial scan. Tc-99 m sestamibi 2-days protocol was used to perform the rest/stress scan. The used dataset contains both the rest and stress phases. The administered doses were based on the weight of the patients (8 MBq/kg) [22]. Rest acquisition started 60 min after the tracer was injected. Pharmacological stress was induced by an infusion of dipyridamole, and 3 min after the dipyridamole slow infusion, the radiotracer was injected. Stress acquisition started 45–90 min after the injection.

The SPECT studies were acquired with a dual-head detector camera, with low-energy, high-resolution collimators, a 20% symmetrical window at 140 keV, and a 64 \(\times\) 64 matrix size. For each scan, 32 projections with 10- and 30-s per projection were acquired.

CorCam Gamma Camera System (DDD-Diagnostic, Denmark) was used to acquire the images. No attenuation, scatter, and detector response corrections were applied to maintain the generality of the proposed approach.

Parameter selection

Parameter selection in neural networks involves determining various parameters to achieve optimal denoising effects. The learning rate is considered the most important hyperparameter in training neural networks that network converges and has been set to 0.001 in our experiments. We used Adam, an adaptive gradient descent optimizer, which generally works well with a fixed learning rate. Moreover, we used normalization, allowing good results even with exotic learning rates. Therefore, we set the batch size to 32 to prevent occupying a large amount of memory, which can slow down the training process. As the number of layers or filters increased, the computation time also increased, but with a slight improvement in the output. Therefore, setting fewer layers and filters was reasonable for finding the optimal balance between computational performance and cost, which was four and 32 for layers and filters, respectively. To select the epoch number, we used a validation set given to the network after each epoch, and the PSNR, RMSE, and SSIM metrics values were measured in each epoch. An acceptable epoch was selected when there was no significant improvement in the value of the mentioned metrics, and they were oscillating between the same values. The results are presented in Table S-1 in the Supplementary Information section.

TPL-CNN training approaches

We considered three different approaches to train the TPL-CNN. The neural network’s architecture in all approaches is the same, though the input fed to the neural network to be trained is different.

Solo training

This method trains the TPL-CNN using reconstructed data images that have undergone two iterations of the OSEM, including both input and ground truth counterparts. The same neural network is used throughout the denoising process, as the algorithm calls it. This means that the function f in Eq. (5) remains constant throughout the whole process.

Mélange training

This method uses reconstructed image pairs acquired from different stages of the OSEM algorithm to train the TPL-CNN. The training data for this network is a mixture of image pairs acquired from two, three, four, and five iterations of the OSEM. However, throughout the entire process of denoising, this trained TPL_CNN network is consistently utilized, as stated by the algorithm. In a manner similar to the solo training approach, the function f in Eq. (5) remains constant throughout the entire procedure.

Distinctive training

This method employs several neural networks with identical architectures but distinct training data. Depending on the stage of the OSEM reconstruction, a neural network is explicitly trained. For instance, the neural network inputs consist of image pairs reconstructed using two iterations of OSEM for the second iteration of OSEM reconstruction, and a new network is trained using image pairs reconstructed using three iterations of OSEM for the third iteration of OSEM reconstruction. The remaining OSEM iteration stages also follow this same pattern. Accordingly, in contrast to two previous approaches, the function f in Eq. (5) is changed after each OSEM iteration.

Evaluation

We compared the proposed method with the conventional reconstruction method (OSEM), Gaussian post filtering method, iterative Gaussian filtering method, and CNN denoising method. All of the Gaussian filters used in this study are in two dimensions and were created using the MATLAB library named “imgaussfilt,” which is mainly defined for two-dimensional objects like images.

Conventional reconstruction method

This method is the conventional application of reconstruction done with the OSEM algorithm on our acquired high- and low-noise data without any filtration.

Gaussian post filtering

This method uses a Gaussian filter directly to the reconstructed images. After images are completely reconstructed using OSEM, a Gaussian filter with Full width at half maximum (FWHM) of 8.4 mm, according to the device manufacturer’s suggestion, is applied as the post-reconstruction filter to reduce their noise level.

Iterative gaussian filtering

In this method, a Gaussian filter is used as the smoothing filter though it is not used as the post-processing filter. This method is made up of repetitive application of reconstructing and filtering. In this way, after each step of the OSEM reconstruction algorithm, a Gaussian filter with FWHM of 8.4 mm is used to smooth the noise of the reconstructed image, and then the next step of OSEM is applied to this smoothed image. This means that, like our proposed method, we use Eq. (5) to reconstruct the image. However, in this case, the function f is a Gaussian filter instead of TPL_CNN.

CNN denoising

This method uses a GAN with a similar architecture as the TPL-CNN as a post-processing noise reduction tool instead of conventional filters. To do this, first, a GAN is needed to be trained in a way to learn to reduce the noise level of the given images. Then, this trained GAN denoising filter is used after the OSEM reconstruction algorithm as the post filter, and the output images of the OSEM algorithm, which is completely done for five iterations, are fed to it to be filtered. High- and low-noise image pairs reconstructed with five iterations and four sub-iterations of the OSEM are fed as the input to train this denoising network. This network is also trained in two phases. First, images with the size of \(32\times 24\) that only contain cardiac regions are used to train the first network, and then its trained weights are used as the initial weights of the second network, which architecture is the same as the first one except for the input layer size that receives \(64\times 64\) image size. This network is the same as the network used in the Distinctive Training method for image pairs with five iterations of the OSEM. Therefore, the trainable parameters are updated using the exact loss function and optimizer with the same learning rate. Hence, the learned parameters are in their optimal values.

Qualitative and quantitative evaluations

Qualitative and quantitative evaluations of the proposed framework were performed on 49 subjects, referred to as the test dataset. Three SPECT images (acquired from solo, mélange, and distinctive training of the TPL-CNN) were compared to reconstructed reference images with the OSEM algorithm (five iterations with four sub-iterations) derived from high-noise projections. The performance of the TPL-CNN was evaluated through five metrics including root mean squared error (RMSE), the coefficient of variation in the background (COV_bkg), structural similarity index metrics (SSIM), contrast-to-noise ratio (CNR) and edge preservation index (EPI) all presented in Eqs. (7)–(11) respectively. Moreover, these metrics were also calculated for the high noise images to provide a baseline for the performance assessment of methods. After selecting the appropriate training method for TPL-CNN, its SNR, and contrast were assessed and compared with those of the other methods.

RMSE

RMSE measures the differences between the predicted and the observed values using Eq. (7).

$$RMSE=\sqrt\sum_^_(\text)-_(\text)\right)}^}$$

(7)

where \(_\) is the observed low-noise reconstructed image, \(_\) is the predicted image, \(w\) is the total number of pixels and \(g\) is the intended pixel.

Cov_bkg

Coefficient of variation (COV) measures the relative variability or dispersion of data around the mean in a sample or population. When computed over a uniform region of interest (ROI), it represents the amount of random noise. Therefore, the COV of the same ROIs (lung region) in the background of the estimated outputs and their low-noise counterparts was considered and calculated using Eq. (8):

$$COV\_bkg=\frac_}_}$$

(8)

where \(S\) and \(\sigma\) represent the mean and standard deviation for ROI, respectively.

CNR

CNR is also used to assess image quality. However, CNR can consider a significant bias in the image by subtracting one term before calculating the ratio. As a result, an image may have a high SNR but a low CNR [23]. In this order, in one of the slices, the location of the left ventricle and its hole (cavity) are once selected manually and saved as the ROI and the background region, respectively. The slice number for each patient is also saved. Then, these saved locations of the specific slices are used to calculate CNR using Eq. (9), [23].

$$CNR=\frac_(\text)}-_}\right|}_(\text)}+ _}}}$$

(9)

where \(S\) and \(\sigma\) represent the mean and standard deviation for ROI, respectively.

SSIM

SSIM is a model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena, including both luminance masking and contrast masking terms. It is used to measure the similarity between two images and is defined as in Eq. (10).

$$}\left( }_}} ,}_}} } \right) = \frac}_}} }} \mu_}_}} }} + }_ } \right)\left( }_}} }_}} }} + }_ } \right)}}}_}} }}^ + \mu_}_}} }}^ + }_ } \right)\left( }_}} }}^ + \delta_}_}} }}^ + }_ } \right)}}$$

(10)

where \(\mu_}_}} }}\) is mean of the predicted image, \(\mu_}_}} }}\) is mean of the low-noise image, \(\delta_}_}} }}^\) is variance of the predicted image, \(\delta_}_}} }}^\) is variance of the low-noise image, and \(\delta_}_}} }_}} }}\) is covariance between the predicted and the low-noise images. c1 and c2 are regularization constants dependent on the image’s luminance dynamic range, which depends on the data type of the input image and are equal to \(0.01 \times\) luminance dynamic range and \(0.03\times\) luminance dynamic range, respectively. As in our implementation, this range equals [0, 1], c1 and c2 are equal to 0.01 and 0.03.

EPI

EPI can effectively evaluate the degree of edge preservation of denoised images. As our denoised estimations of the different methods are slice-based, which are two-dimensional images, this index has been measured in all two-dimensional slices of the test set that can be calculated using Eq. (11) [24].

$$\text= \frac}_-\overline }_}\right)\left(\Delta }_-\overline }_}\right)}}_-\overline }_}\right)}^\sum }_-\overline }_}\right)}^}}$$

(11)

where \(\Delta }_\) and \(\Delta }_\) represent the high pass filtered versions of the denoised and noisy images, obtained with a \(3\times 3\) pixel standard approximation of the Laplacian operator, respectively. \(\overline }_}\) and \(\overline }_}\) are the \(\Delta }_\) and \(\Delta }_\) mean values, respectively. The larger EPI value indicates a greater ability to preserve edges.

Statistical analysis

The mean and standard deviation of the above metrics were calculated and reported in the results. Additionally, the One-Way ANOVA test, a statistical method that compares the mean values of samples, was conducted for statistical analysis to compare the image-derived metrics using IBM SPSS Statistics software Version 26. The p values obtained from the statistical test were also presented in the results. A p value less than 0.05 was considered significant. Additionally, the post hoc Tukey’s Honest Significant Difference (HSD) test was also used to assess the significance of differences between pairs of group means.

留言 (0)

沒有登入
gif