Analysis of a deep learning-based method for generation of SPECT projections based on a large Monte Carlo simulated dataset

Figure 1 shows an overview of the experimental setup used in this work. The individual steps will be explained in the following sections.

Fig. 1figure 1

Schematic representation of the experimental setup and analysis

Generation of a training dataset of simulated SPECT projections

Three-dimensional (3D) activity distributions of randomly arranged random shapes were generated as a basis to create a large database of projection datasets for training and assessment of different neural networks (Fig. 2). These voxelized 3D shapes were constructed with a random shape generator based on (i) a sphere perturbed with spherical harmonics [8], (ii) a 3D implementation of the super-formula [9], or (iii) volumetric Perlin noise with randomized parameters [10]. For each simulation, 10 to 25 of these shapes with volumes ranging from 8 to 64 voxels (voxel size 2.4 × 2.4 × 2.4 mm3) were chosen, randomly rotated in 3D, and then placed in a cylindrical water phantom (Jaszczak without inserts, diameter 21.6 cm, length 18.6 cm, volume 6.8 L). The result was a 256 × 256 × 256 binary activity mask that described the presence of activity in each voxel (0: no activity, 1: activity). An illustration of the generation of the binary activity masks is shown in Fig. 1. While the majority of the mask dataset (7500) consisted of these random activity distributions, a few special cases (2500) were added to further expand the diversity of the dataset, as further specified in Table 1. An illustration of the activity mask dataset is shown in Fig. 3. In addition, the graph in the lower right of Fig. 3 shows the distribution of the volumes of all 10,000 activity masks (solid blue curve). To generate realistic projections of the activity masks, MC-based SPECT simulations were performed using the SIMIND MC program [7]. The simulated system was a Siemens Intevo Bold SPECT/CT system with a 9.5-mm crystal, medium-energy collimator (Siemens medium-energy low penetration, MELP), and 9% energy resolution.

Fig. 2figure 2

Illustration of the creation of the binary activity masks. For each binary activity mask, a set of random shapes of different sizes is generated. Afterwards, a random 3D rotation is performed for each shape and it is placed at a random location inside the Jaszczak phantom

Table 1 Composition of the activity mask dataset (total of 10,000 masks)Fig. 3figure 3

Illustration of the activity mask dataset. For each activity mask, the transversal (top) and sagittal sections (bottom) are shown. First row from left to right: hot Jaszczak with cold random shapes, chessboard pattern, rod pattern, cross pattern, stripe pattern, NEMA phantom spheres. Second row from left to right: example random shape, example Fourier series F(x, y, z), random shape with non-uniform activity distribution. The plot shows the distribution of the maximum activity concentration (orange, bottom and left axes) and volumes (blue, top and left axes) of the activity distributions contained in the dataset. The dashed green curve represents the distribution of the maximum activity concentration of 104 peritherapeutic 177Lu SPECT/CT examinations (bottom and right axes)

To increase the realism of the simulated SPECT projections, the binary activity masks were transformed into heterogeneous activity distributions by a voxel-by-voxel multiplication with a spatially contiguous, non-uniform pattern. This function F(x, y, z) was constructed as Fourier series according to

$$F\left( \right) = \sum\limits_^ ^ ^ \frac \right)}} + \varphi_ } \right)}} + j^ + l^ } \right)^ }}} } } ,$$

(1)

where \(g_\) is a random Gaussian distributed number (mean = 0, SD = 1), \(\varphi_\) is a random phase between \(- \pi \) and \(\pi\), and (x, y, z) is the position of the voxel [11]. Here, the parameter β determines how quickly higher frequencies are attenuated, \(p\) is the period of \(F\), and \(M\) is the spatial cutoff frequency. To create a realistic activity distribution within a reasonable calculation time, the following parameters were empirically chosen: β was set to 0.9 according to [11], M was set to 8 to achieve a complex surface in a reasonable calculation time, and the period p was set to 50 voxels, so that the dimension of the activity distribution covered the dimension of the Jaszczak cylinder. After applying this function to the binary activity masks, the results were rescaled to integers between 0 and 100 (100: highest activity concentration, 0: no activity).

The resulting activity distributions served as input for the SPECT simulations with an analytical water-filled cylinder (Jaszczak dimensions) as attenuating and scattering medium. The simulations mimicked a SPECT acquisition with 120 projections of 22 s each, matrix size 128 × 128, pixel size 4.79 × 4.79 mm2, and a 20% main energy window at 208 keV following a non-circular orbit based on the one from a physical phantom measurement of the NEMA phantom. An analytical expression was used for modeling the MELP collimator using the specifications provided by the manufacturer [12], thereby excluding penetration and scattering in the collimator from the simulation. The simulations of all 10,000 projection sets were performed on a local high-performance computing cluster (High Performance Computing Cluster, University of Würzburg). The SIMIND MC program uses a number of variance reduction techniques to speed up simulations [13]. As a consequence, the MC noise in the simulated projections is not representative of the noise of the corresponding physical measurement. Hence, the normal mode of operation is to use a large number of histories in the simulations in order to achieve essentially noise-free projections (i.e., residual MC noise negligible compared with noise in a real measurement), which are then scaled to the desired projection time and activity before adding Poisson-distributed noise. The simulated SPECT projections before and after the addition of Poisson noise will be referred to as noise-free and noisy projections, respectively. The output of the simulations was scaled to maximum activity concentrations between 0.2 MBq/mL and 14 MBq/mL. This distribution was based on the maximum activity concentration of 104 peritherapeutic SPECT/CT examinations, which had been performed as part of 177Lu-PSMA-based endoradiotherapy at University Hospital Würzburg (DICOM Tag (0028,0107): Largest Image Pixel Value). The distributions of maximum activity concentrations of patients (dashed green curve) and heterogeneous activity distributions (solid orange curve) are illustrated in the graph in Fig. 3.

U-net architecture

A u-shaped 3D convolutional neural network (u-net) design was used to generate “synthetic” SPECT projections (output) based on “original” SPECT projections (input). Technically, the u-net was trained to calculate intermediate projections at shifted projection angles (e.g., shifts of 3°, 6°, or 9° for a circular arch of 12° between two projections in the input dataset, used to generate 3 × 30 projections from an input dataset of 30 projections) with respect to the original projections. For better readability, the technically imprecise terms “rotation of the projections” and “rotated projections” will be used when referring to these intermediate projections. As an example, u-net U1 was trained to rotate 60 original projections by 3° to generate 60 synthetic projections (0°, 6°, …, 354° → 3°, 9°, …, 357°).

The u-nets were based on the fastMRI architecture [14] and were implemented using the PyTorch library [15]. The network consists of two main components: a down-sampling path that compresses image information and acts as an encoder, and an up-sampling path that reconstructs the image from the compressed data and thus acts as a decoder. Both paths consist of four convolutional blocks, each executing two 3 × 3 × 3 3D convolutions with instance normalization and leaky rectified linear unit (leaky ReLU) activation function. In the encoder path, the number of channels is doubled after each convolutional block, whereas in the decoder path, the number of channels is halved. Skip connections between opposing blocks of the two paths ensure that the decoder can reconstruct an image using fine-grained features learned in the encoder phase. After each convolutional block, the image size in each spatial dimension is halved using a max-pool operation with stride 2 in the encoder phase. In contrast, the image size in each spatial dimension is doubled after each convolutional block in the decoder phase using transposed convolutions with kernel size 2 × 2 × 2 and stride 2. At the end of the up-sampling path, two 1 × 1 × 1 convolutions with ReLU activation function are performed to reduce the number of channels to one while maintaining the size of the image. An illustration of the u-net’s architecture can be found in Additional file 1.

Generation of a set of u-nets trained with different parameters

To understand the underlying mechanisms of the generation of synthetic projections by the u-net (with and without rotation), seven different u-nets were trained (U1–U7). Their properties such as the number of input and output projections, the size of the training and validation datasets, the Poisson noise realization used, and by which angle \(\theta\) the output projections were rotated relative to the input projections are listed in Table 2. As an example, u-net U1 was trained after adding Poisson noise realization A to 9,500 noise-free datasets. As described above, training was performed based on 60 input and output projections with a rotation of 3° between each input and output. To force a cyclical projection dataset, the first and last projections were added to the back and the front of the projection series, respectively, until the integer power of 2 was reached (e.g., the first 2 and the last 2 of 60 available projections [1, 2, …, 59, 60] were added at the front and at the back, respectively, to reach a total of 64 projections [59, 60, 1, 2, …, 59, 60, 1, 2]). This was done to ensure that there is an original projection on both sides of all intermediate projections to be generated, thereby avoiding extrapolation. Prior to being used as input for the u-net, each projection dataset was normalized to an interval between 0 and 1 by dividing each voxel by the maximum voxel value of the respective input projections. After applying the u-net, the projections were rescaled with the same value. The dataset was then separated into training and validation datasets of sizes 9000 and 500, respectively. Training was performed for 60 epochs using an Adam optimizer [16], a mini-batch size of 5 and an L1 loss function. The initial learning rate was set to 7 × 10−5, which was halved every 20 epochs. After every epoch, the mean L1 loss on the validation dataset was calculated and the network weights with the lowest validation loss were saved.

Table 2 Overview over all u-nets trained in this studyInfluence of noise

To investigate the influence of noise on the u-net performance, two different u-nets U1 and U2 with different Poisson noise realizations A and B, created by using different seeds for the random number generator, were trained (Table 2).

Influence of the size of the training dataset

To examine the effect of the amount of training data on the performance of the u-net, a third u-net, U3, was trained (Table 2), where the sizes of the training/validation datasets were similar to the training/validation dataset size of 352/37 used by Rydén et al. [6]. This was achieved by selecting 400/40 projection datasets from the training/validation datasets of u-net U1. The proportions of the different types of activity masks (Table 1) were retained. To ensure convergence of the training, the number of epochs was increased to 200 and a linearly decreasing learning rate from 1.2 × 10−5 to 0.8 × 10−5, as described in [6], was selected.

Influence of the number of input projections and rotation angle

To assess whether the number of input projections has an impact on the performance of the methodology, three additional u-nets were trained (U4, U5, and U6). Each of these u-nets generates 30 intermediate projections at differently shifted projection angles (3°, 6°, and − 3°) with respect to the 30 input projections.

Analysis of the u-net for denoising the projections

To determine whether and to what extent the synthetically created projections of u-nets U1 to U6 are denoised, an additional u-net U7 was trained. This network was trained to create 120 noise-free output projections from 120 noisy input projections, i.e., without shifting the projection angles between input and output (“no rotation”).

Evaluation criteria for quantifying the u-net performance

For all trained u-nets, a quantitative analysis was performed based on the same test dataset consisting of 500 projection sets with noise realization A. To determine the agreement between synthetic (e.g., projections for 3°, 9°, …, 357° for u-net U1) and ground-truth projections (i.e., acquired for the same angles as the output projections), the structural similarity index measure (SSIM) [17] and normalized root-mean-squared error (NRMSE) were calculated for all test data. The synthetic projections (i.e., generated by the u-net) were compared to both the noise-free and the noisy projections to assess and quantify the denoising effect of the methodology. Each comparison was made based on statistical hypothesis tests between the SSIM or NRMSE values of two u-nets to be compared. Since the NRMSE values are normally distributed (Shapiro–Wilk test), a paired two-sided t test was performed. For the non-normally distributed SSIM values, a paired two-sided Wilcoxon signed-rank test was performed. The following sections describe the aspects investigated with the various u-nets (see Table 2).

Influence of the detector orbit

All projection data used for training the u-nets were simulated with identical detector orbits. Therefore, it could be suspected that the generation of synthetic projections works only for SPECT images with that specific detector orbit. To determine the impact of the detector orbit on the performance of the u-nets, four additional MC simulations were run for each activity distribution in the test dataset. For these simulations, the distance of the detector at all angular positions was increased by 2 cm, 4 cm, 8 cm, and 16 cm, respectively, compared to the original NEMA detector orbit. To get an impression of the extent of detector orbit variations in real patient examinations, the orbits of a total of 436 177Lu-PSMA-based SPECT/CT examinations performed at University Hospital Würzburg were analyzed. For each angle, both the mean and the maximum distance of the gamma camera to the center were determined. The results are shown in Fig. 4 together with the NEMA detector orbit and its expanded versions (4 cm and 16 cm). The mean and maximum patient detector orbits have similar projection distances as the NEMA orbit enlarged by 4 cm and 16 cm, respectively.

Fig. 4figure 4

CT of NEMA body phantom placed on patient bed with different gamma camera orbits. The orange contour illustrates the original NEMA detector orbit, which is used for the simulations of all activity distributions in the dataset. The green and the red contours represent two expanded versions of the NEMA detector orbit, used for analyzing the influence of deviations in detector orbit between training and input data on the u-net performance. The blue and purple contours represent the mean and the maximum (maximum distance for every angle) patient detector orbits for all 436 SPECT/CT scans

Based on each of these four new detector orbits, synthetic projections were generated using u-net U1, which had been trained with data from the original orbit (i.e., without radial expansion). These synthetic projections were compared with the corresponding ground-truth projections (i.e., noisy simulations for the respective detector orbit).

Physical phantom-based verification of the simulation-based findings

Two phantom measurements were performed with the same clinical SPECT/CT system that had been used as template for the MC simulations. In the first measurement, a large Jaszczak cylinder without inserts was filled with a uniform 177Lu stock solution (activity concentration 59.9 kBq/mL). The second measurement was a water-filled NEMA phantom equipped with six 177Lu-filled spheres (activity concentration 1.99 MBq/mL). Both experiments were performed using a MELP collimator, 180° head configuration, auto-contouring, continuous mode, 60 views, 30 s per view, 128 × 128 matrix, and a 20% energy window around the 208 keV photopeak. After each SPECT acquisition, two low-dose CT scans were acquired (tube voltage 130 kV, 26–30 mAs, 1.0 × 1.0 mm2 in-plane pixel size, 1.5 pitch): in addition to a standard low-dose CT acquisition for attenuation correction (3.0 mm slice thickness), a high-resolution low-dose CT (1.0 mm slice thickness) was acquired for determining the phantom positioning.

The measurements were replicated in SIMIND as follows: The centers of the spheres of the NEMA phantom and the filled Jaszczak cylinder were determined using the high-resolution CT. Two simulations were then performed in SIMIND using the known dimensions of both phantoms (diameter of the NEMA spheres: 10, 13, 17, 22, 28, 37 mm; height and diameter of the Jaszczak cylinder: 186 and 216 mm). Attenuation and scatter were simulated based on the CT images of the physical phantom measurements: First, the attenuation CT was scaled by linear interpolation to the standard resolution of the simulations performed (256 × 256 matrix, 2.4 × 2.4 × 2.4 mm3 voxel size). Hounsfield units were then converted to mass density using a two-segment linear function according to Schneider et al. [18]. SIMIND simulations of both activity distributions were performed as described before, with the detector orbit adjusted to the actual non-circular orbits of the physical phantom measurements.

All SPECT/CT reconstruction in this work was performed based on 120 projections using OS-EM with 6 iterations and 8 subsets, employing compensation for attenuation and scatter using the ESSE method [19]. To convert the reconstructed counts into activity concentration, an image calibration factor (unit: counts-per-second-per-Megabecquerel) was determined as described in [20] based on the physical SPECT/CT measurement of the Jaszczak phantom described above. For these reconstructions, additional metrics were used to quantify the image quality. For the reconstructions of the Jaszczak phantom, the signal-to-noise ratio (SNR) was calculated in a cubic VOI (1519 mL) inside the cylinder:

where \(\overline\) is the mean activity concentration in the VOI and \(\sigma_}}\) is the standard deviation of the voxel-to-voxel activity concentrations within the VOI. For the reconstructions of the NEMA phantom, the recovery, defined as the SPECT-derived activity in the spheres divided by the activity derived at phantom preparation, was calculated for all six spheres. The SPECT-based activity in each sphere was calculated by multiplying interpolated SPECT/CT reconstructions (nearest-neighbor interpolation, 256 × 256 matrix) with a binary mask (256 × 256 matrix), which was created using the known positions and dimensions of the spheres inside the phantom.

留言 (0)

沒有登入
gif