Automatic detection and recognition of nasopharynx gross tumour volume (GTVnx) by deep learning for nasopharyngeal cancer radiotherapy through magnetic resonance imaging

Data collection and GTVnx contours

We retrospectively collected 2088 enhanced T1-weighted turbo spin echo sequence MR images with an axial slice thickness of 5 millimetres sized of 512 × 512 from 200 patients treated at The First Affiliated Hospital of WenZhou Medical University between January 2020 and December 2021. All the image data was acquired from 3.0T MRI (GE signa HDxt, USA or Philips Achieva, Holland). The final diagnosis of nasopharyngeal cancer was proven by pathology and immunohistochemical analysis. The macroscopic GTVnx contours were manually delineated by radiation oncologists with more than 10 years of clinical experience. To guarantee delineation accuracy, all the GTVnx contours were reviewed and rectified by more senior radiation oncologists together with other two radiation experts. Once the entire cohort of images datasets (2088 images of 200 patients) had been manually drawn, we followed the 90%-10% split rules to randomly split the data as training-validation and testing set. To make the testing independent from the training dataset, the 90%-10% random split was performed over the 200 patients rather than the entire 2088 images. The set split over patients could avoid the images from the same patient appear in both train-validation and testing set, which will make the model overfit and result in false high delineation accuracy.

The proposed ACNC framework

Figure 1. shows the framework of our proposed Automatic Contour of Nasopharynx Cancer (ACNC). We first collect the nasopharynx cancer MRI dataset, and manually annotate the contour of tumour location generating ground truth mask images for each raw image. The raw images are then fed into the training system for some specified epochs in the guidance of the ground truth masks to search for the optimized weights solutions for the training system. Our training system include three deep neural networks, which are Fcn8s, U-Net and deeplabv3. Once the models are well trained with the 180 patients examples under the guidance of the corresponding ground truth, we then fed the trained system with the remaining 20 patients to test the performance of the trained system. As shown in Fig. 1. the raw image will pass the trained networks, the system will classify each pixel into binary that “1” denotes the tumour and “0” denotes the background. Finally, the one-dimensional binary classification result array will be reshaped to the original size of the raw input image producing the predicted tumour contour locations. To observe the prediction results better visually, we merge the raw image, ground truth mask image and the predicted result into one with ground truth in red and predicted result in green.

Fig. 1figure 1

Automatic contour of nasopharynx cancer framework

FCN8s

Fully Convolutional Network (FCN)[14] was proposed for semantic image segmentation in 2015. Compared to traditional Convolution Neural Network (CNN), the fully convolutional network replaces the fully connection layers with the convolutional layers. Our backbone is using the vgg19 architecture that starts with two blocks consisting of two convolution layers with one Maxpooling operation, then followed by three blocks consisting of the four convolution layers with one Maxpooling operation. All the convolutional outputs are activated by the ReLU activation function. The network architecture is very similar to the one shown in Supplementary Fig. S1, the vgg16 architecture.

U-Net architecture

U-Net [15], mainly for biomedical image segmentation, was proposed in 2015 right two months after FCN proposed. To date, It has been widely applied for biomedical image segmentation due to its simplicity and superior performance. U-Net is well-known by its U-like network architecture, which is the process of encoding on the left half part of “U” and decoding on the right half part of “U”. The left half of “U” is used to extract the feature details of our nasopharyngeal tumour images, and the right half of “U” with skip connection of left part feature maps is used to recover the location information of each pixel. Then output the tumour contoured image with the same size as the original’s. Supplementary Fig. S1 shows the details architecture of the left down sampling “U”-Net which applies vgg16 as the backbone. The backbone consists two types of repeating blocks that are two convolutional layers followed by one max pooling layers and three convolutional layers followed by one max pooling layers respectively. The details of the kernel size, stride and padding information please refer to Supplementary Fig. S1. The detail of up sampling on the right part of U-Net is shown in Supplementary Fig. S2. To show how the Maxpool results are contracted as the up-sampling path, we present it in Supplementary Fig. S1. As shown in Supplementary Fig. S2, the number in braces is the feature layer derived from Supplementary Fig. S1. For example (4) and (9) as shown in Fig. 2. are derived by the repeating blocks of two convolutional layers and ReLU layers, then followed by one Maxpool. 64 and 128 beside to (4) and (9) are the outputs channels after passing the repeats blocks. The arrow path derives total channels by the concatenation of the channels from two paths. Furthermore, the features from left path will be randomly cropped for features augmentation purpose.

Fig. 2figure 2

The details of skip connection for concatenation of U-Net

DeepLabv3

Deeplabv3[16] was proposed for semantic image segmentation in 2017. It was improved based on the former two versions, we will brief the former two versions first to introduce the general knowledge of the Deeplab architecture, then highlight the key improvements of the third version. Deeplabv1 was derived by improving the DCNN by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF) to solve the poor localization problem existing in DCNN. In Deeplabv2, the backbone of VGG-16 was replaced by ResNet. Besides, Atrous Spatial Pyramid Pooling (ASPP) was proposed to solve the problem of detecting the object with various scales. Supplementary Fig. S3 and Supplementary Fig. S4 show the ASPP blocks of Deeplabv2 and Deeplabv3 respectively. From those figures, we can clearly see the improvements of ASPP proposed in Deeplabv3. In Deeplabv2, the ASPP block consists of four same sized 3 × 3 conv2d. Whilst in Deeplabv3, it starts with the block of one Conv2d sized at 3 × 3 stacked up by BN and RELU, then followed by another three blocks with same sized 3 × 3 Conv3d attached by BN and ReLU as well, finally it ends up with the block of AdaptiveAvg Pool2d, stack up with Conv2d, BN, ReLU and Bilinear Interpolate. Since ASPP is the main improvement of Deeplabv3, the rest of details please refer to the paper of interest. However, in our experiment, we use resnet101 to replace the backbone of Xception mentioned in paper [16].

留言 (0)

沒有登入
gif