Dose prediction for cervical cancer VMAT patients with a full-scale 3D-cGAN-based model and the comparison of different input data on the prediction results

Patient database

The database collected in this research contained 118 cervical cancer patients treated with VMAT technology from 2017 to 2020 in Shandong Cancer Hospital. The studies involving human participants were reviewed and approved by Ethics Committee of Shandong First Medical University Affiliated Cancer Hospital (Approval ID: SDTHEC2020008005). Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements. The study was conducted in accordance with the principles of the Declaration of Helsinki and in accordance with local statutory requirements. All the raw data were exported from the Varian Eclipse Treatment Planning System (TPS) (Varian Medical Systems, Palo Alto, CA, USA), which included CT data, delineated contour information and clinical dose distribution map of each patient. The CT slices were obtained via scanning by a Siemens SOMATOM Confidence CT simulation positioning machine (Siemens Healthcare, Forchheim, Germany) while the patient was immobilized on a vacuum cushion. The PTV and OARs including the bladder, rectum and femoral heads on both sides, which revealed delineation information, were manually established and reviewed by experienced radiotherapists. The involved clinical dose distribution maps were acquired by iterative optimization according to empirical weight parameters and the radiotherapy plans based on that were approved by senior radiation physicists. The selected cases were all from the same group of radiotherapists, and they used the same delineation guidelines of NRG Oncology/RTOG [39,40,41]. The beam energy was 6 MV and all plans had been delivered and calculated through heterogeneity correction. The prescribed dose for all patients was 54 Gy, using a grading scheme: 180 cGy * 30 fractions. All the plans involved were formulated through three coplanar arcs, and no discrepancies existed in the clinical protocols and treatment criterion for any case.

Dataset preparation and classification

The voxel value in 3D CT images was truncated, easily for standardization and normalization, while preserving valuable information, in the range of − 600 ~ 1000 HU (Hounsfield unit). The interval contained effective information since the pelvic cavity was the research object, a value below − 600 HU was basically air, and the bone information was included. With regard to contour data, masks of the body, femur-head-R, femur-head-L, rectum, bladder, and PTV were assigned various label values of − 400, − 200, 200, 600, 800, and 1000, respectively, which independently existed as binary images with the value of blank area set to -600. The pixel value of the RT Dose data exported from the TPS was converted to the dose value of the voxel through the dose grid scale factor. Taking into account the voxel-to-voxel level based on the establishment of correlation mapping relations, 3D input images mentioned above were all resampled to a space size of 3.5*3.5*3.5 mm3, assuring that the voxels contained the same physical-size information. In addition, all 3D data was cropped to matrix with a size of 128*128*128, considering the burden of model training as well as the preservation of the voxels which could be plenty utilized.

For all patients, 20 cases were randomly selected as the testing set, and the remaining 98 cases were divided into training-validation in a ratio of 6:1, to implement seven fold cross-validation. In order to cater to the original intention of this research, three datasets were formulated through different input data combination schemes, as shown in Table 1. For each dataset, the clinical dose distribution map was set as the first channel, which was used to compute the loss items and train the discriminator. In addition, Dataset-A was composed of data with 8 channels, in which channels 2–8 contained binary masks for the PTV, body and OARs delineation information; Dataset-C only contained two channels of which the second channel was CT images; Dataset-B includes 3 channels based on Dataset-C, where the PTV delineation information was added as a third channel to take into account the radiotherapist’s clinical experience and individualized differences in patients.

Table 1 Data composition of four multi-channel datasetsModel architecture

With the focus of the establishment of correlation mapping relations between multichannel data consisting of anatomical features and medical images mentioned above with the radiation dose distribution at the voxel-voxel level, the network architecture adopted in this study mainly combined the two deep learning ideas of cGAN and full-scale fusion of feature information Encoding-Decoding structures. The overall structure of the model involved in this study which was presented as a 3D-cGAN-based framework, is shown in Fig. 1. The model was composed of a pair of discriminator and generator that confronted and promoted each other in the image generation process.

Fig. 1figure 1

The overall framework and workflow of the generator and discriminator networks: X denotes the stack data of various input data; GT and G(X) denote the 3D clinical and predicted dose distribution respectively

The discriminator used in this study was a 3D-Patch-GAN classifier discerning locally overlapping image blocks instead of a traditional classifier based on the entire image. The convolutional neural network architecture of the discriminator is shown in Fig. 2. The input of the discriminator was a concatenated array of multichannel mask or CT image data and a dose distribution map matrix. Additionally, the synthetically predicted dose distribution or the clinical one were distinguished by identification probability values in [0,1]. Probability values corresponding to overlapping image blocks constituted the 3D probability value matrixes \(_\) for the ground truth and \(_\) for the predicted dose map, and a combination of the binary cross entropy (BCE) and sigmoid function was adopted as the objective function of the discriminator :

$$_=BCE\left\=-\frac\sum _^\left(\frac^_}}\right) , _\in _$$

(1)

$$_=BCE\=-\frac\sum _^\left(1-\frac^_}}\right) , _\in _$$

(2)

where X denotes the input data and \(_\) or \(_\) denotes the loss items of the discriminator for the ground truth or prediction dose map, respectively. The discriminator’s output probability value \(D\left(X\right|GT)\)=1 represents the concatenated input data containing the clinical plan dose distribution map, while \(D\left(X\right|Pre)\)=0 represents the dose distribution from synthetic prediction, under ideal-trained conditions.

As a full-scale feature connection deep learning network model, UNet3 + was originally proposed for medical image segmentation [33]. However, in this study, the novel framework was modified adaptively to carry out the dose distribution prediction based on the multichannel input and play the role of the generator of the cGAN model. The network architecture is shown in Fig. 3. \(_^\)and \(_^\) represent steps of different levels of encoding and decoding, respectively, where a larger index i represents a higher level. At the beginning of the network, three convolutional layers with a convolution kernel size of 3*3*3 and stride of 1 were applied to extract image features, compared with a 7*7*7 convolution kernel, increasing the depth of the network to improve the nonlinear expression ability as well as to establish fewer parameters, which integrated 32 feature maps initially. In the subsequent encoding part of the network, the max-pooling layers preset were substituted by convolutional layers to avoid loss of feature information pertaining to location and intensity. The image was scaled to 1/8 of the original size as four downsamplings were performed during encoding. Between each two down-samplings, two convolution layers with a kernel size of 3*3*3 and stride of 1 were implemented. Furthermore, the number of feature maps was doubled, as conventional practice, each time downsampling was performed and the number of feature maps at each level of encoding step was 64, 128, 256, 512 respectively. As for decoding network, upsamplings instead of transposed convolutions were implemented in each step followed by convolution layers in order to prevent the checkerboard effect resulting from uneven overlap for the predicted dose distribution maps. The input of each step of the decoding network, apart from the decoding layer \(_^\), was associated with the output of the corresponding or low-level encoding and decoding steps to achieve the fusion of the full-scale features of delineation masks, CT images and clinical dose maps. The output of the encoding steps or the low-level decoding steps was upsampled or downsampled combined with the convolutional layers to keep the size of the feature maps consistent. Additionally, the number of multiscale feature maps to be concatenated was set to 32 equally. Since the encoding and decoding networks were all composed of 5 steps, 160 feature maps were separately fed to each decoding step. Considering the accuracy and stability of the generator for dose prediction, an objective function composed of three loss terms was proposed. First, for the purpose of joining the generator and the discriminator based on the overall structure of the cGAN, the adversarial loss term of the generator with the same form as the discriminator was adopted:

$$_=BCE\left\=-\frac\sum _^\left(\frac^_}}\right) , _\in _$$

(4)

where \(G\left(X\right)\) denotes the dose map predicted by the generator based on the input delineation or CT data, \(D\left(X|G\left(X\right)\right)\) denotes the identification probability value of the output result of the generator based on each overlapping image block by the discriminator, \(_\) represents a matrix constituted of probability values, and \(_\) denotes the adversarial loss term of the generator. Second, the adjacent voxels difference loss term (AVD-loss) was adopted:

$$_=\sum _^_-1}\left\_-_)-(_-_))}^\right\}+\sum _^_-1}\left\_-_)-(_-_))}^\right\}+\sum _^_-1}\left\_-_)-(_-_))}^\right\}$$

(5)

where \(_\), \(_\), and\(_\) denote the numbers of transverse, coronal and sagittal slices of the 3D dose distribution image, respectively. \(_\), \(_\), and \(_\) denote the dose value matrixes of certain slices of the three sections above for the clinical dose distribution map, and \(_\), \(_\), and \(_\) denote the predicted dose distribution maps of the generator. \(_\) denotes the loss term which attempts to keep the edge features by minimizing the adjacent voxel dose value difference. The AVD-loss also removes the blur problem in the predicted dose, and maintains the sharpness of the image. Finally, the \(_\) distance loss term was added to the optimization process of the generator:

where GT denotes the clinical dose, G(X) denotes the predicted dose, and \(_\) denotes the \(_\) loss term. Therefore, the total objective function of the generator can be obtained as:

$$L_ = \lambda _} \cdot L_} + \lambda _} \cdot L_} + \lambda _} \cdot L_}$$

(7)

where \(_\), \(_\) and \(_\) denote the weighting parameters for balancing the adversarial loss term, the adjacent voxels dose difference loss term and the L1 loss term respectively.

Fig. 2figure 2

The architecture of the discriminator contains a total of 3 convolution layers with a stride of 2 and two convolution layers with a stride of 1, as well as the related normalization layer and activation layer. Each element in the 3D probability value matrixes has a 70*70*70 receptive field

Fig. 3figure 3

The proposed encoding-decoding generator architecture improved by UNet3+. \(_^\)and \(_^\) represent the encoding and decoding steps respectively. The size and the number of channels are marked below the feature maps. The orange and blue arrows represent the convolution block and the downsampling blocks, and the straight lines with arrows in the figure represent the full-scale feature connection and fusion operation

The generator and the discriminator were trained by turns so that the capability of both could be mutually promoted in the adversarial process. For the discriminator, batch normalization and leaky rectified linear unit (Leaky ReLU) layers were adopted after each convolutional layer, while batch normalization and rectified linear unit (ReLU) layers were adopted for the generators. In particular, the hyperbolic tangent was used as activation function at the end of the generator. The total number of parameters of the involved generator and discriminator models were 11.077 M and 29.026 M respectively. For the whole deep learning network, Kaiming initialization was adopted. The Adam solver with high computational efficiency and stability was selected as an optimizer, in which the momentum parameters were set to \(_=\)0.5 and \(_=\)0.999. The initial learning rate of 0–200 epochs was set to 0.002, and attenuated linearly to 0 in 200–500 epochs. Two NVIDIA Tesla V100S GPUs were used for model training and testing in this study and the memory of each graphics card was 32GB. Considering to maximize the memory utilization of the graphics card, the batch size was set to 4. In addition, to increase the data sample and to prevent over-fitting, data augmentation was performed on the training set. Specifically, each input data was firstly randomly cropped as a 96*96*96 dose array and then randomly rotated by 90°, 180° or 270° in the transverse planes. The model training results of all epochs were recorded to research the impact of the training epoch on the model performance and to choose the optimal epoch to test.

Performance analysis

Multiple methods were adopted to evaluate the proposed dose prediction methods. Based on this, the results obtained from the three different input datasets mentioned above were compared. First, not only the dose difference statistical histogram but also the mean absolute error (MAE), \(MAE=_^\left|_-_\right|\), were calculated to compare the clinical and predicted dose distribution maps voxel by voxel, where i denotes the index of the voxel, and m denotes the total number of voxels. Second, the dose-volume histogram (DVH) curve was obtained to intuitively evaluate the consistency of the PTV and OARs dose distribution between the clinical and the dose maps predicted from three different datasets. Third, to specifically evaluate the clinical indicators of the clinical and predicted dose, the dosimetry indexes (DI) of the PTV or OARs were adopted, which included \(_\) and \(_\), where \(_\) denotes the minimum dose received by volume in percentage of p, and \(_\) denotes the maximum volume percentage receiving dose d. With respect to PTV, we calculated \(_\), \(_\), \(_\), \(_\), and \(_\). In addition, we also used the homogeneity index (HI) and conformity index (CI):

$$CI = \frac} ^ }} \cdot V_ }}$$

(9)

where \(_\) denotes the volume of the PTV, \(_\) represents the area covered by the prescribed dose, and \(_\) denotes the PTV area covered by the prescribed dose. Furthermore, \(_\) and \(_\), which denote the maximum and average dose respectively, are taken into consideration for OARs. In addition, the \(_\) of the bladder and rectum were considered. The paired-sample t-test was adopted, in which the threshold of statistical significance was set to P < 0.05, to statistically analyze the results of dose prediction. The Dice similarity coefficients (DSC) between the 3D isodose volumes of the clinical and predicted dose distribution images were calculated:

$$DSC\left(GT,Pre\right)=\frac__ }\bigcap __}|}__ }\right|+\left|__ }\right|}$$

(10)

where \(__ }\) denotes the clinical isodose volume and \(__}\) denotes the predicted isodose volume. Using the same deep learning architecture, Model-A, Model-B and Model-C were trained based on different input data of Dataset-A, Dataset-B and Dataset-C, respectively. In this study, all predicted dose distribution results based on three different models were compared with clinical dose maps in the same way. Finally, the method of gamma analysis was adopted to further evaluate the results of the dose distributions predicted by different models at the voxel level. The gamma passing rate (GPR) were calculated for PTV, OARs and the global, where the dose difference and distance-to-agreement (DTA) criteria were set to 3% and 3 mm respectively, with a dose threshold of 10% of the prescribed value. This facilitated us to simultaneously evaluate the location and dose value differences in 3D space.

留言 (0)

沒有登入
gif