Detection of pulmonary nodules in chest radiographs: novel cost function for effective network training with purely synthesized datasets

Artificial image creation, nodule creation and embedding

The methods for generating artificial normal chest X-ray images, creating artificial nodules, and embedding were also applied in our previous study [20]. Normal chest X-ray images were generated using the Glow algorithm [21], a flow-based generative model that is more robust against mode collapse than the more popular GAN-based methods. The Glow model was trained using a combination of 27,504 normal cases from the ChestX-ray14 dataset [22] and 18,304 domestic normal chest radiographs from the University of Tokyo Hospital. However, in this study, nodules were not created by generative deep learning models (as in most related studies). Each initial 3-D nodule shape was created as a simple union of overlapping spheres using our in-house model-free algorithm. After the nodules were embedded in the lung field using a 3-D to 2-D projection that simulated an X-ray imaging system, the embedded images were modified using a latent-space interpolation technique, giving each nodule a more natural and indistinct appearance. In latent-space interpolation, each embedded image was first mapped to the latent vector representation space using the Glow algorithm. The latent vector representations of the embedded and original images were then interpolated using a random interpolation ratio. The interpolated vector representation was then mapped back to the image vector space to create the final interpolated image.

Because our training datasets were purely artificial, we created an open dataset containing 131,072 sets of positive and corresponding negative chest X-rays with pixel-level ground truth labels. This dataset is available at https://zenodo.org/records/10648433.

The baseline nodule detection system with U-Net and dice loss

First, a simple U-Net framework was formulated for object detection. Suppose the image domain (the set of all pixel positions) of the given X-ray images is \(\Omega\). Let a given pixel position in an image be represented as \(\mathbf\) where \(\mathbf\in\Omega\). Suppose that \(I\) is an image function of the input image; i.e., \(I(\mathbf)\) represents the intensity of the pixel indicated by \(\mathbf\). Consider a U-Net (or other image-to-image network) that receives an input \(I(\mathbf)\) and outputs the pixel-wise likelihood of lesions \(f(\mathbf)\). Suppose that we want to detect lung nodules (or any other local lesions) by \(f\). Let the range of \(f\) be \(f\left(\mathbf\right)\in \left[0,+\infty \right)\) and let the resulting (detected) pixel set be \(C=\left\|f\left(\mathbf\right)>0\right\}\). Finally, the resulting output region \(C\) is divided into individual candidate regions \(_,_, \dots _,\dots\) using connecting component analysis.

During the evaluation phase, the probability of candidates \(_,_,\dots ,_,\dots\) should also be determined. In this study, we simply determined \(_\) as the maximum output of U-Net in the candidate region \(_\). That is, \(L_j = }_} \in C_j } f(})\).

To train the U-Net, the loss function must be minimized. The Dice loss function is well-known for segmentation. Suppose that the binary label of the real (ground truth) lung nodule is \(t\left(\mathbf\right)\) where \(t\left(\mathbf\right)\in \left\\right\}, \mathbf\in\Omega\). In this case, suppose that \(f\left(\mathbf\right)\in [\text]\). The Dice loss for a given input is expressed as follows:

$$_\left(f,t\right)=-\frac\in\Omega }f\left(\mathbf\right)\cdot t\left(\mathbf\right)+\epsilon }\in\Omega }f\left(\mathbf\right)+\sum_\in\Omega }t\left(\mathbf\right)+\epsilon }$$

(1)

where \(\epsilon\) is a small positive constant (e.g., 1) to avoid division by zero. The total loss is the sum of \(_\left(f,t\right)\) for all training images. If the real nodule region is defined as \(R=\left\|t\left(\mathbf\right)=1\right\}\), this loss can be rewritten as:

$$_\left(f,t\right)=-\frac\in R}f\left(\mathbf\right)+\epsilon }\in\Omega }f\left(\mathbf\right)\right)+\left(\sum_\in R}1\right)+\epsilon }$$

(2)

Although Dice loss works well for positive cases (i.e., \(R\ne \varnothing\)), it is not suitable for learning negative cases effectively. Let us substitute Eq. (2) with \(R=\varnothing\). Then the following equation is obtained:

$$_\left(f,t\right)=-\frac\in\Omega }f\left(\mathbf\right)\right)+\epsilon }$$

(3)

The value of Eq. (3) is usually very small and is largely affected by a non-essential constant \(\epsilon\). Consequently, Dice loss is ineffective for learning negative samples. In addition, with Dice loss, the pairwise relationship between positive and the corresponding negative examples cannot be used (Fig. 1).

Fig. 1figure 1

Outline of the nodule creation and embedding method

Negative case max output suppression (Necmos) loss

First, the simple use of negative cases to reduce false positives is proposed (Fig. 2). Suppose that \(_(\mathbf)\) and \(_(\mathbf)\) are the corresponding positive and negative images and let these U-Net outputs be \(_(\mathbf)\) and \(_(\mathbf)\), respectively. Let the range of \(_(\mathbf)\) and \(_(\mathbf)\) be \(\left[0,+\infty \right)\). Then, the following new term is added to the Dice loss:

Fig. 2figure 2

Outline of the proposed losses in the training phase

$$_\left(_\right)=\underset\in\Omega }}_(\mathbf)$$

(4)

This term is named the negative case maximum output suppression loss, or Necmos loss. Its simple design effectively reduces the number of false positives as described below. Note that in the negative example, all detection results are FPs. Recall also that our detector is defined to output the maximum value of \(f\), or \(L_j = }_} \in C_j } f(})\), for each individual candidate region \(_\) as a likelihood. Here, in the negative case, \(}_} \in \Omega } f_ - (})\) is the only factor that determines whether the number of FPs (nFPs) is zero or nonzero. The number of FPs will be zero if and only if \(}_} \in \Omega } f_ - (}) \le 0\), whereas it will be nonzero if and only if \(\underset\in\Omega }}_(\mathbf)>0\). In this sense, the design of Necmos loss is intrinsic to the FP suppression.

Normal-abnormal contrastive (Nac) loss

Second, another loss function was added that effectively enables U-Net to learn the positive and negative pairs (Fig. 2). This is named the normal-abnormal contrastive (Nac) loss.

The true nodule region is \(R=\left\|t\left(\mathbf\right)=1\right\}\). Then, the Nac loss is defined as follows:

$$\qquad=-\text\left\\in R}}_\left(\mathbf\right)\right)-\left(\underset\in R}}_\left(\mathbf\right)\right)\right\}$$

(5)

Here, \(}_} \in R} f_ + \left( } \right)\) and \(}_} \in R} f_ - \left( } \right)\) are the maximum values of the U-Net outputs within the true lesion region \(R,\) respectively, in the positive and corresponding negative cases. The subtraction between these two values is then clipped with the range \(\left(-\infty ,1\right]\), and then negated. It measures the discrepancy between the positive and negative images of a diseased lesion. Clipping with \(\left(-\infty ,1\right]\) is required to avoid too large a value of \(_\left(\mathbf\right)\).

The design of \(_\) is justified as follows. Suppose that the positive images and the corresponding negative images in the same case are processed as individual images with the same U-Net. Assume that there is only one true lesion in the positive image. Suppose also that in both the positive and negative cases, one of the detected candidate regions \(_\) is equal to the true disease region \(R\). Thus, \(}_} \in R} f_ + \left( } \right)\) becomes the probability of a particular (true positive) candidate in the positive image. Similarly, \(}_} \in R} f_ - \left( } \right)\) becomes a likelihood of a certain (false positive) candidate in the negative image. Next, the results are analyzed using a free receiver operating characteristic (FROC) curve and the area under the FROC curve (AUC) among cases, including the target case. Under these assumptions and conditions, the FROC curve (and the AUC value) changes when the relationship between \(}_} \in R} f_ + \left( } \right)\) and \(}_} \in R} f_ - \left( } \right)\) changes from large to small. Otherwise described, the AUC value changes when the sign of \(\left(\underset\in R}}_\left(\mathbf\right)\right)-\left(\underset\in R}}_\left(\mathbf\right)\right)\) changes. The Nac loss can be understood as an approximate evaluation of this replacement (sign change of the difference of the maximum U-Net outputs in \(R\)).

Combination of three losses

The Dice, Necmos, and Nac losses are combined to form the proposed loss function. A variable transformation is needed before the combination because Dice loss was defined for \(f\left(\mathbf\right)\in [\text]\), whereas the Necmos and Nac losses were defined for \(_\left(\mathbf\right)\in [0,+\infty )\) and \(_\left(\mathbf\right)\in \left[0,+\infty \right)\). A simple transfer \(f\to \frac\) is performed before the Dice loss calculation. The final loss function is

$$\beginl=&_\cdot \text_\left(\frac_}_},t\right)\hfill\\&+_\cdot _\left(_\right)+_\cdot _\left(_,_,t\right)\end$$

(6)

where \(_\), \(_,\) and \(_\) are hyperparameters yet to be determined.

A U-Net with a deep residual network (ResNet) substructure [23] with nine multiresolution layers was used in this study. A max pool layer with a stride of 2 was inserted between each adjacent encoding layer pair, whereas an up-sampling layer was inserted between every adjacent decoding layer pair. A shortcut path was also placed between each pair of corresponding encoding and decoding layers. The input and output image size was 1024 × 1024 pixels. Flatten and fully connected layers were inserted into the bottom layer of the network. The final output values were clipped by a ReLU function for the output range for each pixel to be \([0,+\infty )\).

Figure 2 illustrates the entire system of the proposed losses. During training, the positive and negative images are input into the same U-Net and the resulting likelihood maps are \(_(x)\) and \(_(x)\), respectively. The former is used to calculate the Dice loss. The latter is used to calculate the Necmos loss. In addition, both maps are used to calculate the Nac loss (Fig. 3).

Fig. 3figure 3

Outline of the LOCO fine-tuning and testing phases

Evaluation

This retrospective study was approved by our institutional review board.

Training

For training, the positive and corresponding negative images were input individually into the same U-Net and the resulting outputs were input into our loss function (Eq. 6). The U-Net was then trained using backpropagation and the Adam optimizer [24]. A computational node (equipped with eight NVIDIA A100 GPUs one of which had 40 GB of memory) in the Wisteria/BDEC-01 supercomputer system was used for the training. Each model was trained for 48 h.

Of the 131,072 artificially generated cases, 128 were used for internal validation. This internal validation set was then used to determine the optimal epoch for the entire training phase. The criterion for choosing the best epoch was R-CPM [18], i.e., the average of four sensitivities, where the number of FPs per case was 1/8, 1/4, 1/2, and 1. The remaining cases (131,072–128 = 130,944) were used for training. Each candidate was classified as a true positive if the intercept on union (IoU) between the candidate and ground truth regions was not < 0.1.

In addition to training with a pure synthesized dataset, we attempted to fine-tune the U-Net trained model with the JSRT dataset [15] through the leave-one-case-out (LOCO) method. Starting from the model pretrained with our synthesized dataset, fine-tuning was performed using LOCO for five epochs. Fine-tuning was performed using only the Dice loss, as the JSRT dataset did not contain corresponding negative images. Differently expressed, although the input of the initial training phase was a pair of positive and negative images, the input for LOCO fine-tuning was each single image (with or without a lesion).

Testing

Because our proposed cost function was used only in training the model, no corresponding negative image was needed in the testing. Therefore, our trained model could be applied with a single chest radiograph as the only input.

For the external evaluation, the JSRT dataset [15] was used in which the presence or absence of nodules was confirmed by CT scans. All positive and negative cases were included in this study. FROC and ROC analyses were performed. For the latter, only positive (i.e., nonzero lesions were present/detected) or negative (i.e., no lesion was present/detected in the image) was determined and assessed for each case. R-CPM and sensitivity, where nFPs/case = 0.2, were used as metrics in the FROC-based analysis. In contrast, the area under the ROC (AUROC) and sensitivity were calculated in the ROC-based analysis, with specificity = 0.8. These metrics were compared with different settings of (\(_\), \(_\), \(_\)) and also compared with the results of RetinaNet [25] and SSD [26]. When RetinaNet and SSD were trained, only artificial positive images were used because their results outperformed those with positive and negative images.

Turing test

We also undertook a Turing test in which three radiologists checked whether the given image (with a nodule) was real or fabricated (i.e., created with our proposed method). One hundred real chest radiographs with nodules from the NIH chest dataset [22] and 100 created nodule radiographs were mixed and shuffled. Then they were shown to three radiologists individually and judged as real or fabricated. Each radiologist’s sensitivity and specificity were evaluated by analyzing the results.

留言 (0)

沒有登入
gif