Convolutional neural networks for automatic image quality control and EARL compliance of PET images

DatasetsTraining and cross-validation

The dataset used for training and cross-validating the algorithm consists of 96 images from cancer patients acquired on three PET/CT systems: 36 images acquired on a Philips Gemini (Philips Medical Systems, Best, The Netherlands), 30 images on a Siemens Biograph mCT40 (Siemens, Knoxville, TN, USA), and 30 images on a General Electric Discovery system (General Electric, Boston, Massachusetts, USA). The training data included images from 17 lymphoma and 79 lung cancer patients. All images were reconstructed with (1) the locally clinically preferred, (2) EARL1 and (3) EARL2 compliant reconstruction settings and with a 120 s scan duration per bed position. The exact reconstruction settings for each scanner are displayed in Table 1. All data included in this study are taken from clinical routine. The use of the data was approved by the Institutional Medical Ethics Committees (case number VUMC 2018.029, UMCG 2017/489). All data were fully anonymized.

Table 1 Reconstruction settings for different scanner typesIndependent testing datasets

To test the CNN performance, 30 images prospectively acquired on the Siemens Biograph mCT40 were used. These images were also reconstructed with clinical, EARL1, and EARL2 compliant reconstruction settings.

Moreover, 24 images acquired on a scanner that was not included in the training data (Siemens Biograph Vision) were used for independent external validation. This dataset included 9 lung cancer, 7 lymphoma, and 8 head and neck cancer patients. In order to determine if image noise has an impact on the CNN decision, all images acquired at the Biograph Vision were reconstructed with 30 s, 60 s, 120 s, and 180 s scan duration. The different scan durations were chosen to assess if the networks would also perform well when acquired in a hospital which scans their patients with other scan durations than 120 s per bed position.

Training and validation of the CNN

Data preparation, as well as data analysis were performed with python 3.4. All implemented code used in this study can be found on Zenodo (https://doi.org/10.5281/zenodo.5540390).

Data preparation and augmentation

First, all images were normalized to SUV units. Next, before images were used for training or validation, the images were converted to ‘edge images’. For this purpose, all images were blurred with a 6 mm Full-Width-At-Half-Maximum Gaussian kernel. Next, the blurred image was subtracted from the original image. The so generated ‘edge image’ pronounces the edges of high intensity areas while minimizing scanner specific noise. These ‘edge images’ are pronouncing the resolution of an image and resulted therefore in higher accuracies in training and validation than the original PET image (Additional file 1, Tables 1 and 2). An example is displayed in Fig. 1. The edge images were used for training, cross-validating, and independent validating the CNN.

Table 2 Training and cross-validation accuracy for the first CNN trained to separate clinical and EARL compliant reconstructionsFig. 1figure 1

Original and edge enhanced image for EARL 1 and EARL 2 compliant images and the three PET systems included in the training dataset

Before the conversion to edge images, all images were resampled to a cubic voxel size of 3 mm. After resampling, the images were cut or expanded to an image size of 300 × 200 pixels. Hereby, the images were padded with a constant value of 0 if their original image size was smaller than 300 × 200. If the image size was larger, they were cut by randomly choosing an image part with the required size. Hereby, the rows and columns for the cut were randomly assigned to lie between 0 and (300—original image height) or (200—original image width), respectively. This randomly cutting was performed to ensure that different body parts were present in the different images.

In order to avoid overfitting and to increase the number of training images, data augmentation was performed. Data augmentation included zooming the image, i.e., making it 10% larger or smaller, flipping the image either in vertical or horizontal direction, and adding a height or width shift of 10%.

CNN architecture and training details

The CNNs were trained using the keras library version 2.2.4 with tensorflow backend. The trained CNNs as well as some example images and a manual how to train and validate the CNNs can be found on Zenodo. In this study, a 2D CNN is trained to classify single image slices (more details see below). The CNN architecture is displayed in Fig. 2. It consists of one convolutional block followed by a dropout layer and two dense layers. The convolutional block contains a convolutional layer with a kernel size of 3, a stride size of 2, and 8 filters. The initial weights of the convolutional layer are following a normal distribution. The dropout percentage in the dropout layer is set to 0.6 to avoid overfitting and to increase the generalizability of the CNN. The output of the dropout layer is flattened and fed into a dense layer with 8 units and a ReLU activation function. The second and last dense layer contains 2 units and a softmax activation function to perform the classification. The CNN is trained with 25 epochs and a batch size of 20. CNNs learn features that are important for the classification task automatically. To build a generalizable CNN, it is important to learn only features representing the specific task and not representing other details such as scanner specific noise. Therefore, it is essential to build a CNN that is generalizable and can be applied to data that is not included in the training process. In order to improve the generalizability, the CNN stopped training when the accuracy in the training set did not improve during 5 epochs. The CNN leading to the best validation accuracy in the training process is saved and used for further analysis.

Fig. 2figure 2

Workflow of the CNN used in this study: The convolutional block consists of a convolutional layer followed by a LeakyReLU layer with alpha set to 0.2. The convolutional layer consists of 8 filters. The dropout percentage is set to 0.6. The first dense layer consists of 8, the second one of 2 units

To identify EARL compliant images, two CNNs with the described architecture were trained. Two separate CNNs were trained as the use of only one CNN with three outcomes (Clinical, EARL1, and EARL2) resulted in worse performance (see Additional file 1, Table 3). The first CNN was trained to separate images that are meeting EARL standards (EARL1 and EARL2) from images that are not EARL compliant. For training this CNN, the clinical reconstructions of the three scanners as well as the EARL2 compliant images from the GE and the Siemens system and the EARL1 compliant images from the Philips system were used. This subselection was performed to avoid data imbalance. The second CNN was trained to determine if an image that was identified to be EARL compliant is compliant with either EARL1 or EARL2 standards. For this task, EARL1 and EARL2 compliant images of all scanners were used for training.

Table 3 Training and cross-validation accuracy for the second CNN trained to classify EARL1 and EARL2 compliant reconstructionsSlice selection

In the present study, only sagittal slices were used for classification as they showed the best results in initial experiments as compared to using axial or coronal slices (data not shown). For the first CNN trained to separate clinical and EARL compliant images, ten slices in the middle of the patient were chosen. Hereby, the middle of the image was determined and the five slices on the left-hand and the five slices on the right-hand side were included in the training dataset. For the second CNN trained to separate EARL1 and EARL2 compliant images, slices with a high gradient (i.e., with a high uptake) were chosen. For this purpose, from the forty slices in the middle of the patient, the ten slices with the highest edge intensity values were selected. In both cases, the mean probability value of the ten slices was calculated and used for final classification.

As described above, images were randomly cut before they were used as input to the network. To assess if the random cut had an impact on the CNN results, ten randomly selected images were analyzed using five different random cuts. The mean, standard deviation, and coefficient of variation (COV, defined as ratio of standard deviation and mean value) of these ten CNN probabilities were calculated and compared.

Model performance

For training and cross-validating the CNNs, the training dataset was split in five parts such that each part contained a comparable number of images from each scanner. Fivefold cross-validation was performed so that each part served once as validation dataset. The performance of the trained CNN was evaluated by analyzing prediction accuracy. As the softmax layer of the CNN gives certainty about its decision in terms of a probability (1 for a completely certain decision, 0.5 for an uncertain decision), the mean probability of the ten image slices was used for final classification. For the first CNN, all images with a probability equal or above 0.5 to be clinical/EARL compliant images were considered to be clinical/EARL compliant images. While for the second CNN, images with a corresponding mean probability of 0.4 or higher were considered as being EARL2 compliant while images with a mean probability of 0.6 or higher were considered as EARL1 compliant.

Last, the CNNs were trained on the whole training dataset and then applied to the independent validation datasets.

留言 (0)

沒有登入
gif