Uncertainty estimation- and attention-based semi-supervised models for automatically delineate clinical target volume in CBCT images of breast cancer

Data acquisition

A total of 60 patients with breast cancer treated with right-side breast-conserving therapy in our hospital from February 2017 to September 2023, including 60 PCT and 380 CBCT, were collected. The CTV labels on CBCT in 52 cases were obtained by the deformable registration of CTV labels on CT images to CBCT images, and then manually refined by senior clinicians. The CT and CBCT of the same patient were only used for training or testing simultaneously, and the specific data distribution is shown in Table 1. Only those patients who received whole breast irradiation were included in this study; therefore, patients who received axillary or supraclavicular irradiation were excluded. All patients were female, with age ranging from 30 to 72 years. The supine position was adopted with the hands crossed over the head and fixed on the vacuum pad. The PCT images of all patients were obtained by Siemens CT (Somatom Force, Germany) with a size of 512 × 512, a spatial resolution of 0.98 mm × 0.98 mm, and a slicer thickness of 5 mm. CBCT images were acquired using the XVI system from Elekta Infinity (Elekta,Stockholm,Sweden) between 2 and 4 weeks after PCT acquisition. Compared with the standard chest M20 technology, the gantry speed was increased from 180 to 360°/min using fast chest M20 technology, and the projection frame was reduced from 720 to 360, which not only reduced the patient's scanning time and radiation dose but also reduced the image quality to a certain extent [26]. The tube voltage was 120 kV, and the current was 20 mA. The kV detector panel had a field of view of 42.5 cm × 42.5 cm, a reconstruction matrix of 410 × 410, and a pixel size of 1 mm × 1 mm. The acquired CBCT images of breast cancer had a truncation. This study was approved by the Medical Ethics Committee of our hospital (#2020KY154-01).

Table 1 Summary of patient characteristicsContour delineation

In order to reduce the influence of subjective differences in the delineation of CTV between doctors on the network, we invited an oncologist to delineate the CTV of all the included data according to a unified standard. (1) Upper margin: the upper margin of breast tissue was referred to clinical markers and CT, and the highest level of sternoclavicular joint was observed. (2) Lower margin: refer to the lower margin of breast tissue seen by clinical markers and CT, or the level of breast folds. (3) Internal margin: the inner margin of breast tissue was referred to clinical markers and CT, and did not exceed the parasternal. (4) External: referring to clinical markers and the outer edge of breast tissue visible on CT, or referring to the contralateral breast. (5) Anterior margin: 5 mm subcutaneous (mainly including breast tissue, if the breast volume is small, 3 mm subcutaneous can be considered). (6) Posterior border: excluding ribs, intercostal muscles and pectoralis major muscles. When delineating CBCT images, we first deformable registration the CTV on CT images to CBCT images, and then the doctor compared the two images and delineated the CTV on CBCT images according to the standard to form the ground true(GT) on CBCT images.

Proposed methodology

Our proposed RCBA-UAMT is shown in Fig. 1, where labeled CT and CBCT images are inputted to the student model, and unlabeled CBCT images are inputted to the student model and the teacher model, with different noise perturbations added randomly to each input. Features are randomly lost in the teacher network, and N forward propagation is performed to obtain N sets of prediction results. Therefore, for each pixel of the input image, N groups of SoftMax probability vectors can be obtained. Subsequently, the average probability vector can be calculated. Finally, the information entropy can be calculated as the evaluation measure of uncertainty. The supervision loss, Lsup. is calculated by the student model on the input and output of the labeled image. The consistency loss Lcon. is calculated from the output of the student model and the teacher model, and utilizes the uncertainty guided consistency loss by using the information of the uncertain feature map of the target. The teacher model was optimized using an exponential moving average (EMA), which refers to the average of the student model weights. In this section, a detailed explanation of the proposed RCBA-UAMT segmentation model is given.

Fig. 1figure 1

Schematic illustration of our RCBA-UAMT framework

Backbone architecture

RCBA-UAMT model with the same structure model of the teachers and students model, as shown in Fig. 2a. In this study, the residual module [27] and convolutional block attention module (CBAM) [28] are integrated on 3D UNet [29] to optimize the network. The residual module can connect the feature information between the two layers, prevent the degradation problem caused by the deepening of the network layer, and optimize the segmentation performance. As shown in Fig. 2b, the CBAM module is used to adjust the attention weight of output features from channel and space in detail to extract more effective feature information and enable the network to pay attention to more important information adaptively. In the encoder part, a convolution operation consists of a 3 × 3 × 3 convolution, InstanceNorm [30], and Leaky ReLU [31], using a maximum pooling (MaxPool) layer as downsampling. CBAM is mainly composed of two serial modules, namely, channel attention module (CAM) and spatial attention module (SAM). CAM is mainly used to perform attention weighting on the channel dimension of the input features, and the MaxPool and average pooling operations are performed after the feature map is inputted to aggregate the spatial information of the feature map. SAM is mainly used for the attention weighting of the spatial dimension of the input features. Finally, the convolution and sigmoid activation functions were used to obtain the spatial attention map, which was multiplied with the input feature map to obtain the final output feature map.

Fig. 2figure 2

a Architecture of residual convolutional block attention 3D UNet, which is used as the backbone network in the RCBA-UAMT. b Architecture of 3D convolutional block attention module

The parameters of the teacher model are obtained from the student model through EMA, and the formula is expressed as follows:

$$z_ = \varepsilon z_ + \left( \right)z_ ,$$

where \(\text\) and \(}}}\) represent the parameters of the student network and the teacher network, respectively. \(\upvarepsilon\) is the fixed value parameter, which is set to 0.99 in this study. When the teacher network is updated, 99% of its own parameters remain unchanged, and 1% is transferred from the student network.

Uncertainty estimation

Given that the boundary between CTV and normal tissue is fuzzy, the CTV edge is inevitably prone to uncertainty during automatic segmentation. In this paper, an uncertainty estimation method based on Monte Carlo dropout [32] is used to add uncertainty estimation to the network to provide reliable segmentation possibilities with different confidence levels and explain incorrect predictions. In this method, dropout is used to train the model so that the model parameters seem to follow a Bernoulli distribution. For each input, different outputs will be generated, and the variance of different outputs is calculated to obtain the uncertainty. Specifically, noise was added randomly to each input image and entered into the teacher network multiple times. It is used to conduct N times of forward propagation in the teacher network to obtain N groups of prediction results. Therefore, for each pixel of the input image, N groups of SoftMax probability vectors can be obtained, and the average probability vector can be calculated. The formula is expressed as follows:

$$M_ = \frac\mathop \sum \limits_ p_^ .$$

The formula for calculating the uncertainty of the average probability is as follows:

$$U = - \mathop \sum \limits_ M_ \log \left( } \right),$$

where N is the number of forward propagation, which is set to 8 in this study; c is the segmentation category; pt is the probability graph of the t degree; M is the probability map after averaging; and U is the information entropy and is the probability weighting of the entropy of all segmentation categories.

Loss functions

The semi-supervised 3D segmentation model was proposed to minimize the following joint objective loss functions:

$$L = argmin_ \mathop \prod \limits_^ L_ (f(x_ ;z),y_ ) + \lambda \mathop \prod \limits_^ L_ (f\left( ;z,\xi } \right),f(x_ ;z^,\xi ),$$

where Lsup. represents the supervised loss function, and the Dice loss function [33] combined with the cross-entropy loss function [34] is used in this study to evaluate the segmentation quality of the labeled data. Lcon. is denoted as the unsupervised consistency loss function [35]. The segmentation neural network is denoted by \(f\), \(\text\), and \(}}}\), which denote the parameters of the student and teacher networks. \(\upxi\) and \(^\) are random noises with different teacher and student models. y is the label. M is a case of labeled data. Q is a case of unlabeled data. i is the data index, and \(\uplambda\) is a weighting coefficient to regulate the trade-off between unsupervised and supervised losses.

The consistency loss is only calculated in the region of low uncertainty, and the formula is expressed as follows:

$$L_ \left( } \right) = \frac H(u_ < I)\left( - f_ } \right)^ }} H(u_ < I)}},$$

where H is the sign function (u < I is 1, u > I is 0); \(_\) and \(_\) are the prediction results of the student and teacher networks at the ith voxel, respectively; \(_\) is the uncertainty of the prediction results of the teacher network; and \(I\) is the uncertainty threshold used to filter the uncertain voxels.

Construction of comparative experiments

We compare and analyze with several advanced SSL segmentation methods, including MT, uncertainty-aware mean teacher (UAMT) [21] and Uncertainty Rectified Pyramid Consistency (URPC) [36]. For fair comparison, we used the same network backbone (3D UNet) with the same epoch for testing in these methods. In addition, the above networks were trained with 5%, 10% and 20% labeled data to evaluate the effect of different proportions of data on the segmentation effect of the network. In the labeled part, the ratio of CT to CBCT data was 5:3.

Three sets of network experiments were constructed to evaluate the effects of different modules on the segmentation performance of the network. The first group is UAMT with only 3D U-Net in the backbone network, and the second group is Res-UAMT with residual fast added to the backbone network. The third group is for our proposed network RCBA-UAMT.

Experimental setup and evaluation metrics

This study is implemented based on the PyTorch framework using the SGD optimizer to update the network parameters, the initial learning rate is set to 0.001, the batch size is 2, it consists of 1 labeled image and 1 unlabeled image, and the training epoch is 100. A sub-volume of 400 × 400 × 48 in the center of the 3D image was trimmed as the network input, and the final segmentation result was obtained using a sliding window strategy.

In this study, four indicators, namely, dice similarity coefficient (DSC), Jaccard, the average surface distance (ASD), and 95% Hausdorff distance (95HD), were used for quantitative assessment. DSC is used to measure the similarity of two sets, and Jaccard coefficient is used to calculate the problem of whether the common features among individuals are consistent and to compare the similarity and difference between finite sample sets. The larger the values of these two, the higher the sample similarity will be. ASD is used to measure the distance between two surfaces. 95HD calculates the distance between two sets and is sensitive to segmenting the boundary region. The smaller the values of these two, the higher the similarity of the two sets. DSC, Jaccard, 95HD, and ASD are defined as follows:

$$\text=\frac\bigcap \text)}+\text}$$

$$\text=\frac\bigcap \text)}\bigcup \text}$$

$$\text(\text,\text)=\text(\text||\text-\text||),\text\in \text,\text\in \text$$

$$} = \frac}\left( } \right)} \right| + \left| }\left( } \right)} \right|}}\left( } \in }\left( } \right)}} \mathop \limits_} \in }\left( } \right)}} \left| } - }} \right|} \right| + \mathop \sum \limits_} \in }\left( } \right)}} \mathop \limits_} \in }\left( } \right)}} \left| } - }} \right|} \right|} \right),$$

where A represents the predicted segmentation result, B represents the GT, S (A) represents the surface voxels in the set A, and S (B) represents the surface voxels in the set B.

留言 (0)

沒有登入
gif