Our database consists of lung CT-screening images obtained from 1500 examinees at Medical Imaging Clinic (Toyonaka, Osaka, Japan). It includes 1200 normal cases and 300 abnormal cases (147 lung cancers, 60 emphysemas, 49 pneumonias, and 44 pneumothoraxes). The image sizes are 512 × 512 pixels, whereas the number of slices in each case is 72–124. The pixel size and the slice thickness are 0.55–0.86 mm and 3.75 mm, respectively. Those images are resized to 256 × 256 pixels using a bicubic interpolation. The density resolutions are also re-scaled to [0, 1] by min-max-scaling.
The proposed network is trained and tested using a sixfold cross validation test method. In the sixfold cross validation, 1200 normal cases are randomly divided into six subsets of 200 cases each. Four subsets, one subset, and the remaining one subset are used as the training dataset, validation dataset, and test dataset, respectively. The proposed network is trained and evaluated six times until each of the six subsets is used once as the test dataset. Note that all the 300 abnormal cases are used as the test dataset six times.
The proposed method is developed and evaluated using PyTorch on a workstation (central processing unit: Intel Core i9-10900X processor, random-access memory: 128 GB, and graphics processing unit: NVIDIA RTX 3090).
2.2 VQ-VAE with SVDDIn this study, VQ-VAE with SVDD for anomaly detection is constructed by introducing SVDD to VQ-VAE. Figure 1 shows the architecture of VQ-VAE with SVDD. Original VQ-VAE consists of Encoder 1, Decoder 1, and Embedding Space. VQ-VAE can generate an accurate reconstructed image by replacing the continuous latent space of VAE with a discrete embedding space. However, VQ-VAE is an image-generation model, not an anomaly-detection model. SVDD for mapping the normal latent variables into a hypersphere as small as possible on the latent space is introduced to VQ-VAE to apply VQ-VAE to anomaly detection. In VQ-VAE with SVDD, Encoder 2 and Decoder 2 are added between Encoder 1 and Decoder 1 in VQ-VAE. SVDD employs generally kernel functions to project normal latent variables onto latent space. In this study, learnable Encoder 2 instead of a kernel function is used to make the projective transformation more suitable for the training data [20, 21].
Fig. 1Architecture of VQ-VAE with SVDD
Figures 2 and 3 illustrate the architectures of Encoders 1, 2, Decoders 1, and 2. Characters f, k, and s in parentheses represent the number of filters, kernel size, and stride size, respectively. Character in_f in the residual block is the number of filters inputted to the block. The number of embedding representations in the Embedding Space is set to 128, whereas the dimension of each representation is set to 128. Encoder 1 compresses an input image \(x\) to the latent-variable map \(_ (x)\in ^\). \(H\) and \(W\) are the height and width of the latent-variable map, whereas \(D\) is the number of channels. Encoder 2 also compresses the dimensionally of \(_\left(x\right)\) to the latent variables \(z(x)\) while mapping those to a hypersphere as small as possible on the latent space. Decoder 2 then up-samples \(z(x)\) to the latent-variable map \(_\left(x\right)\) with the same size as \(_\left(x\right)\). A newly latent-variable map \(_\left(x\right)\) is obtained by replacing the latent variables in the latent-variable map \(_\left(x\right)\) with the embedding representations \(_\) in the Embedding Space. The distances are first determined between the channel direction vectors at each pixel in \(_\left(x\right)\) and all \(_\) in this replacement defined by Eq. (1). The closest embedding representation is then selected and placed at each corresponding pixel in \(_\left(x\right)\).
Fig. 2Architectures of Encoders 1 and 2
Fig. 3Architectures of Decoders 1 and 2
$$\begin_\left(x\right)=_, \quad where\quad k=}__\left(x\right)-_\Vert }_\end$$
(1)
Decoder 2 reconstructs the image \(\widehat\) from \(_\left(x\right)\).
2.3 Training of VQ-VAE with SVDDVQ-VAE with SVDD is trained by the following three steps:
Step 1: The embedding space must be trained to reflect the features of the latent variables created from normal CT images. Only the VQ-VAE part consisting of Encoder 1, Decoder 1, and Embedding Space is first trained with the loss function Eq. (2). Note that \(_\left(x\right)\) in Fig. 1 does not pass through Encoder 2 and Decoder 2 and becomes \(_\left(x\right)\).
$$\begin}_=\Vert }_^+\left(1-MS\_SSIM(x, \widehat)\right)+_\left(x\right)\right]-e\Vert }_^+\lambda _\left(x\right)-sg\left[e\right]\Vert }_^\end$$
(2)
The first term represents the L2 norm between the input image \(x\) and the reconstructed image \(\widehat\). MS-SSIM (Multi Scale-Structural Similarity) [22, 23] in the second term measures the similarity of brightness, contrast, and structure between two images. MS-SSIM is defined by Eq. (3).
$$\beginMS\_SSIM(x, \widehat)=\right)\right]}^\cdot\prod_^_\left(x,\widehat\right)\right]}^\cdot _\left(x,\widehat\right)\right]}^)\end$$
(3)
The terms for brightness \(L\left(x,\widehat\right)\), contrast \(C(x,\widehat)\), and structure \(S(x,\widehat)\) are defined by the following equations:
$$\beginL\left(x,\widehat\right)=\frac_ _}+_\right)}_^+_}^+_\right)}, C\left(x,\widehat\right)=\frac_ _}+_\right)}_^+_}^+_\right)}, S\left(x,\widehat\right)=\frac_}+_\right)}_+_}+_\right)}\end$$
(4)
\(j\) indicates the number of times \(x\) and \(\widehat\) are downsampled to the size of 1/2. Contrast \(_(x,\widehat)\) and structure \(_(x,\widehat)\) are determined at different resolutions \(j\). \(_\), \(_}\), \(_\), \(_}\), \(_}\) are mean, standard deviation, mutual covariance of \(x\) and \(\widehat\), whereas \(_\), \(_\), \(_\) are the normalization constants. In this study, \(_\), \(_\), and \(_\) are given by 0.0001, 0.0009, and 0.00045, respectively. \(\alpha\), \(\beta\), and \(\gamma\) are set to 1, whereas \(M\) is set to 5. In the third and fourth terms of Eq. (2), sg (stop gradient) is the operator that does not compute the gradient in error backpropagation. When each latent variable in \(_\left(x\right)\) is replaced with the embedding representation \(_\), the distances between the latent variables and the embedding representations are determined in the embedding space. This process is not differentiable. The third term updates the embedding representations \(_\) to close to \(_\left(x\right)\). Note that the gradient of this term is not propagated to Encoder 1. The fourth term contributes to updating Encoder 1 to close \(_\left(x\right)\) to \(_\). Note that the gradient of this term is unchangingly propagated to Encoder 1. The coefficient \(\lambda\) is set to 0.25 empirically.
Step 2: To determine an initial hypersphere containing only normal latent variables extracted from normal CT images, the proposed network without SVDD is trained using the loss function of Eq. (5). The initial values of Encoder 1, Embedding Space, and Decoder 1 are given by the weights learned in Step 1. Note that Encoder 2 does not train to map the normal latent variables into a hypersphere as small as possible.
$$}_ = \Vert }_^ + \left( } \right)} \right) + \left( x \right)} \right] - e \Vert }_^ \beginc} \left( x \right) - sg\left[ e \right] \Vert }_^ + \left( x \right) - z_ \left( x \right) \Vert }_^ } \\ \end$$
(5)
The first to fourth terms correspond to those in Eq. (2). The fifth term is the L2 norm between the latent-variable maps \(_\left(x\right)\) input to Encoder 2 and \(_(x)\) output from Decoder 2. The center of the initial hypersphere for SVDD is then determined by the average latent variables \(z(x)\) output from Encoder 2 in the trained network without SVDD.
Step 3: The initial values of VQ-VAE with SVDD are given by the weights learned in Step 2. The entire proposed network with SVDD is trained using the loss function of Eq. (6). The center of the hypersphere is updated every epoch by the average latent variables \(z\left(x\right)\) output from Encoder 2.
$$}_ = \Vert }_^ + \left( } \right)} \right) + \left( x \right)} \right] - e \Vert }_^ \beginc} \left( x \right) - sg\left[ e \right] \Vert }_^ + \left( x \right) - z_ \left( x \right) \Vert }_^ } + _^ \\ \end$$
(6)
The first to fifth terms correspond to those in Eq. (5). The sixth term contributes to updating Encoder 2 to close the latent variables \(z(x)\) to the center \(c\) of the hypersphere.
The same training dataset consisting of only normal CT images is used for training of Step 1, Step 2, and Step 3. The hyperparameters for training are set with a batch size of 64, a learning rate of \(1e-4\), and a maximum epoch number of 200. In all steps, Adam is employed to optimize network weights.
2.4 Anomaly score and evaluation indicesAnomaly score \(S(x)\) at each slice image is defined as the combination of the difference between the input image \(x\) and the reconstructed image \(\widehat\) and the distance between the latent variables \(z\left(x\right)\) and the center \(c\) of the hypersphere.
$$\beginS\left(x\right)=\Vert }_^+_^\end$$
(7)
The maximum anomaly score in all slice images for each examinee is determined as the representative score. If the representative anomaly score is higher than a threshold value, the examinee is classified as abnormal. Here, the threshold value is set to maximize the Youden index based on Receiving Operating Characteristic (ROC) analysis for all the representative scores. The classification accuracy, sensitivity, the specificity are also evaluated based on the representative anomaly scores.
The fidelity of the reconstructed images is compared between normal and abnormal images, as well as between a conventional VAE and the proposed model. In this study, the root mean squared error (RMSE), peak signal to noise ratio (PSNR), and structural similarity index measure (SSIM) are employed as evaluation metrics [24, 25].
These evaluation metrics are determined for each subset of the sixfold cross validation test. Then, the significance is assessed using a paired t-test.
留言 (0)