Semi-supervised segmentation of cardiac chambers from LGE-CMR using feature consistency awareness

Methods background and design motivation

The mean teacher model has been widely used in semi-supervised learning tasks [39, 43,44,45]. Inspired by these works, we adopted the mean teacher model as the base architecture. The mean teacher model consists of two branches: a teacher and a student. The different perturbed versions of the same image are input to both branches during training. By minimizing the differences between the teacher and student outputs, the model utilizes unlabeled data for semi-supervised learning. This pattern referred to data-level consistency (consistency among data). However, Current semi-supervised segmentation methods using the mean teacher model typically only compute data-level consistency loss in high-confidence regions [38]. To identify the distribution of low-confidence regions, this study applied Monte Carlo dropout to estimate prediction uncertainty which was used in previous research [38]. Specifically, we performed multiple forward passes through the teacher model with random dropout and added gaussian noise for each input volume, calculated softmax probabilities for each voxel, and used predictive entropy as a metric to estimate uncertainty and assess the confidence of each prediction. As shown in Fig. 1, our analysis of the uncertainty map revealed that low-confidence regions were predominantly located around object edges, which are also areas prone to segmentation errors. To enhance the focus on the edge region, this manuscript explicitly introduced an edge prediction task to strengthen the constraint on the segmentation boundary and designed a multi-task network architecture. Different tasks can complement each other, allowing the network to focus on capturing global semantic information and attending to fine-grained details of edge positions.

Fig. 1figure 1

Example of an uncertainty map from the left atrium dataset. The uncertainty map was obtained using Monte Carlo dropout, with highlighted regions indicating high uncertainty (low confidence). In this dataset, high uncertainty areas were primarily located along the edges of the atrium

To further leverage the potential of the edge prediction task, inspired by the DTC network [36], this study introduced consistency between the segmentation task and the edge prediction task. Due to the differences in optimization objectives for specific tasks, segmentation and edge prediction branches may focus on different scales of information, and different focuses of tasks can also introduce perturbations. By mapping/transferring the segmentation results to edge prediction, we can enforce the consistency regularization between the two tasks, thereby establishing task-level consistency (consistency among tasks).

Based on this task-level consistency and data-level consistency design, our model can attempt to maintain consistent segmentation masks for the same image and its perturbed version between the teacher and student networks. Moreover, since the original image and its perturbed versions represented the same object, their semantic features should be similar even after different perturbations. This means that the feature embedding obtained by a feature extractor of the teacher and student networks should be similar in the feature space, corresponding to the feature-level consistency (consistency among features). Simple feature-level consistency constraints can be imposed by applying absolute error loss (L1) /absolute error loss (L2) on the encoder output features between the teacher and student networks [46]. However, in addition to ensuring that the encoder outputs of the teacher and the student network are similar, it is essential to ensure the contrastive property of feature embedding in the feature space. In other words, the feature embeddings of voxels belonging to the same category should cluster closely together in the feature space, while those from different categories should be pushed apart. Contrastive learning is perfectly suited to meet this requirement. But current popular contrastive learning approaches like MoCo [47] and SimCLR [48], which treat entire images as instances for contrastive learning, may not be optimal for medical image segmentation tasks. On the one hand, instance-level contrastive methods emphasize the minimization of the distance between augmented versions of the same image while maximizing the distance from other images. This approach may potentially overlook the detailed structural information within each image critical for segmentation. On the other hand, using a large number of samples for contrast has been proven to be a critical factor in pretraining performance during the construction of positive and negative pairs [48]. However, due to resource limitations, it is challenging to adopt methods like SimCLR [48] that increase the batch size to get a large number of contrastive sample pairs, particularly for 3D medical images with multiple slices. Inspired by previous research [33, 49,50,51], this study introduced a voxel-level contrastive learning with a memory bank to enforce feature consistency. Specifically, the specified size memory bank stores and dynamically updates voxel features generated during training. So, the model can retrieve a large number of previously stored voxel features from different categories in the memory bank for voxel-level contrastive learning during the training process. This design eliminates the need to recalculate features for each contrastive sample and reduces the dependency on large batch sizes to gather a sufficient number of contrastive pairs. The memory bank effectively increases the number and diversity of contrastive samples without significantly increasing computational overhead.

Fig. 2figure 2

The overall architecture of the proposed model. The architecture follows the mean teacher model, where student and teacher networks have the same structure. The model consists of an encoder (shown in deep cyan) for feature extraction and two task-specific output heads for segmentation (shown in red) and edge prediction (shown in deep blue). The network processes 3D medical imaging data as input and the dual-task branches simultaneously generate segmentation probability maps and edge prediction results. The model parameter is optimized by minimizing supervised loss (Sup-Loss, represented by red arrows) and three types of semi-supervised losses (Semi-Loss, indicated by yellow, green, and purple arrows) targeting consistency across data, tasks, and features. The teacher network is updated via the exponential moving average (EMA) of the student network’s weights. The “Erode” operation refers to the transformation from segmentation results to edge prediction

In summary, shown in Fig. 1, this study designed a semi-supervised medical image segmentation network. The model takes 3D LGE-CMR as input and output object segmentation and edge prediction results. The overall framework consists of three main parts: a multi-task mean teacher structure (shown in khaki on the left), an inter-task transformation module (shown in green on the upper right), and a contrastive learning module for feature consistency (shown in purple on the left). These three parts achieve consistency constraints at the data-level, task-level, and feature-level.

In this manuscript, we defined the semi-supervised problem as follows: Given a semi-supervised training dataset \(\:_=\_,_\}\), where \(\:_\) consists of N labeled data \(\:_=\_,_\}\) and M unlabeled data \(\:_=\left\_\right\}\), with N < < M. \(\:_\) and \(\:_\:\)represent the input images and corresponding segmentation annotations from the labeled subset, and \(\:_\) represents the input images from the unlabeled subset. Assuming the model’s predicted segmentation output is \(\:_\), the semi-supervised approach computes the supervised loss \(\:}_\) based on the comparison between \(\:_\) and \(\:_\). Additionally, it calculates the unsupervised loss \(\:}_\) through consistency measures. The model optimizes its parameters by jointly considering the supervised loss \(\:}_\) and the unsupervised loss \(\:}_\) as constraints. The overall optimization objective of the model was to minimize the loss function in Eq. (1):

$$\:\mathcal=}_+\lambda\:}_=\left(}_+\beta\:}_+\gamma\:_^\right)+\lambda\:\left(}_^+}_^+}_^\right)$$

(1)

where \(\:}_\) and \(\:}_\) represent the supervised loss and unsupervised loss, respectively. \(\:}_\) and \(\:}_\) are the segmentation loss and edge prediction loss in the multi-task framework (indicated by red dashed arrows), which would be specifically explained in The mean teacher architecture of multiple tasks section \(\:}_^,\:\:}_^,\:\:}_^\) represent the consistency constraints at the data-level, task-level, and feature-level (indicated by yellow, green, and purple arrows, respectively), which will be introduced sequentially in The mean teacher architecture of multiple tasks section to Voxel-level contrastive learning and feature-level consistency section. λ, β and γ are the weighting coefficients for the supervised loss and unsupervised loss, the segmentation loss and edge prediction loss, and the feature consistency loss and other consistency losses, respectively.

The mean teacher architecture of multiple tasks

As shown in Fig. 2, the khaki-colored parts on the left represent the student and teacher networks, which had the same network structure but were different in parameter update strategies. The student network was updated using gradient descent to minimize the supervised loss on labeled data and the consistency loss on unlabeled data. In contrast, the teacher network was updated using exponential moving average (EMA) of the student network’s weights. If we define the weight of the student network at time step t as \(\:_\), then the weight of the teacher network \(\:_\) at time step t is:

$$\:_=\alpha\:_+\left(1-\alpha\:\right)_$$

(2)

where α is the update rate for EMA (typically set to 0.99) to balance the proportion of the teacher network weight \(\:_\) at time step t coming from the student network’s weight \(\:_\) and the teacher network’s weight \(\:_\). This updated strategy allows the teacher network to provide more stable and reliable predictions, incorporating the knowledge learned by the student network over time.

In the mean teacher model, the architectures of student and teacher networks were identical. For the segmentation branch, this study adopted the V-Net, a classic encoder-decoder architecture widely used in medical imaging, which had demonstrated excellent performance in various medical image segmentation tasks. As shown in Fig. 2, the segmentation branch consisted of four levels of encoders and corresponding decoders. The encoders and decoders were connected through skip connections. Given an input image \(\:x\), the overall process of the segmentation task can be described as (3):

$$\:_=}_\left(x\right)$$

(3)

where \(\:}_\) represents the target segmentation branch, and \(\:_\) represents the obtained segmentation result.

For the edge prediction task, the shallow layers of the network tend to generate edges that are unrelated to the classes, while the deeper layers are responsible for detecting class-aware semantic edges. Taking the left atrium segmentation task as an example, we aimed to obtain edge prediction results that complemented the segmentation task and focused only on the edges of the left atrium instead of other cardiac cavities. Therefore, fusing features from shallow and deep layers was particularly important in the model design. Inspired by the DDS network [52], this study adopted a deep supervision-based edge prediction approach to attempt multi-scale fusion and then output. During actual implementation, feature maps from each stage in the encoder were upsampled by trilinear interpolation. Then, the upsampled feature map was concatenated and processed by a 1 × 1 convolutional layer with a single output channel to generate the edge prediction map \(\:_\). The trainable parameters of the edge prediction module were confined to a convolutional layer dedicated to channel transformation. This design ensured that there was no additional burden imposed on the overall model structure. The overall process can be described as (4):

$$\:_=}_\left(\^,up\left(^\right),up\left(^\right),up\left(^\right)\}\right)$$

(4)

where \(\:^\) represents the output feature map from the i-th stage of the encoder, \(\:i\in\:\left[1,4\right]\);\(\:\^,up\left(^\right),up\left(^\right),up\left(^\right)\}\) represents the result of upsampling and concatenating of feature maps from the four different scales. \(\:}_\) represents the edge prediction branch, which was implemented using a 1 × 1 convolutional layer.

In summary, for the labeled data \(\:_=\_,_\}\), supervised learning can be performed using the segmentation branch and edge prediction branch. Before calculating the loss, this study first extracted the target edges \(\:_\) that matched with input \(\:_\) from the segmentation label \(\:_\) using an edge extraction algorithm. Since edges appear as single pixels in the image and have weaker constraints, this study empirically extracted edges with equal thickness edges of 2 pixels as supervision signals for optimization. The loss function for the segmentation task is a combination of cross-entropy loss \(\:}_\) and Dice loss \(\:}_\), given by (5):

$$\:}_=0.5\times\:(}_\left(_,_\right)+}_\left(_,_\right))$$

(5)

The loss function for the edge prediction task is cross-entropy loss \(\:}_\) as (6):

$$\:}_=}_\left(_,_\right)$$

(6)

For the unlabeled data \(\:_=\left\_\right\}\), as the labels \(\:_\) were missing, the loss cannot be directly calculated. Ideally, for a same input \(\:x\) experiencing different perturbations, the outputs of the teacher network \(\:^\) and the student network \(\:^\) should be consistent. Therefore, in the framework of mean teacher model, this study introduced consistency constraints to impose unsupervised loss, encouraging consistent outputs under different perturbations of the same input. During the forward pass of the network, this study applied noise to the input \(\:x\) and used Dropout operations in the network. The differences between the predicted results of the teacher and student networks can serve as unsupervised constraint signals to aid in parameter updates. Since the perturbations were mainly applied to the input images, this approach can be viewed as consistency constraints at the data level. The loss term \(\:}_^\) can be described as (7):

$$\:}_^=}_\left(_^,_^\right)+}_\left(_^,_^\right)$$

(7)

where \(\:_^,_^,_^,_^\) represent the segmentation results and edge prediction results from the teacher and student networks, respectively. \(\:_\left(\cdot\right)}\) is an unsupervised loss used to measure the consistency between the predictions of the teacher and student networks for the same input \(\:x\) with different perturbations. In this study, the Mean Squared Error (MSE) loss was chosen for computing the consistency loss.

Inter-task transformation module and task-level consistency

In order to apply task-level consistency, this study first implemented the transformation from segmentation results to edge prediction conventional (“Erode” in Fig. 2) to minimize the difference in consistency between the two tasks. For a pixel point \(\:px\) belonging to the segmented object, the segmentation result can be transformed into edge prediction using the formula (8):

$$T\left(px\right)=\begin1,\;px\in seg\;\&\;min\left\<D\\0,\;otherwise\end\\$$

(8)

where \(\:min\left\\notin\:seg\right)\right\}\) describes the minimum distance from the current pixel point \(\:px\) to background pixels (not segmented objects). D is the distance threshold, which can be regarded as the thickness of the edge. In this study, an equal-thickness region of D = 2 pixels was selected empirically as the target edge. Erosion operation was used to extract the edge from the segmented object. This process was implemented using max-pooling operation. The transformation process did not interrupt gradient backpropagation, making it suitable for parameter optimization using gradient descent.

In this study, the consistency constraint between tasks was only applied to the unlabeled data \(\:_=\_,_\}\). This loss \(\:}_^\) can be described as (9):

$$\:}_^=}_\left(_,Erode\left(_\right)\right)$$

(9)

Voxel-level contrastive learning and feature-level consistency Fig. 3figure 3

The schematic diagram of contrastive learning. (a) represents a schematic diagram of applying contrastive learning loss. Unlabeled samples are simultaneously fed into both the teacher and student networks. The features generated by the student model’s encoder are projected through a mapping layer, and the segmentation results from the teacher model act as pseudo-labels, assigning class information to each voxel of the student model’s encoder output. The contrastive learning loss is then calculated. (b) demonstrates the updated rules for storing features in the memory bank. The memory bank functions as a fixed-length queue that operates on a first-in, first-out (FIFO) basis. When labeled data is input, the teacher model generates feature representations and segmentation results. High-quality feature vectors are selected based on these segmentation results and corresponding labels and are pushed into the memory bank, while the oldest feature vectors are popped out

As previously mentioned, this study attempts to enforce feature consistency constraints through voxel-level contrastive learning. Differing from the method of loss calculation for segmentation results based on the aforementioned two consistency constraints, feature consistency constraints attempt loss calculation at the encoder output in the encode-decode structure, thus encouraging the encoder to extract the consistent feature representation for the original image and its perturbed versions. As shown in Fig. 3(a), for an unlabeled sample, it was simultaneously input to both the teacher and student networks. The encoder output of the student network after projection mapping was used as the feature representation of the image. The segmentation result obtained from the teacher network served as a pseudo-label, assigning class information to each voxel in the feature representation. Contrastive learning loss was then computed voxel-wisely. In the specific calculation process, this study treated voxel feature representations of the same class in the Memory Bank as positive samples and different classes as negative samples. If the voxel feature vectors from the student network were considered as queries and the vectors from the Memory Bank were considered as keys, the optimization objective of contrastive learning was to maximize the similarity between queries and keys of the same class and minimize the similarity between queries and keys of different classes. The loss function used in this study is shown in Eq. (10):

$$\:}_^=\textstyle\sum_^C\frac^}\textstyle\sum_^^}\text\text\text\left(_,_^\right)}^}\textstyle\sum_^^}\text\text\text\left(_,_^\right)+\textstyle\sum_^^}\text\text\text\left(_,_^\right)}$$

(10)

where C is the number of classes, \(\:_\) represents the feature representation of a single voxel from the student network, \(\:^,^\) represent the voxel feature representations from the Memory Bank that belong to the same class and different classes as \(\:_\), respectively, \(\:^/^\) are the numbers of \(\:^,^\) in the Memory Bank for \(\:_\), \(\:sim\left(\cdot\right)\) is the similarity calculation function using \(\:\text\text\text\left(q,k\right)=\text\text\text\left(^k/\tau\:\right)\), and τ is the temperature coefficient. During the training phase, the network brought pixels in similar class closer together and these in different class farther apart while optimizing \(\:}_^\) to enforce feature consistency constrains. This ensured the contrastive properties between pixels of the same and different classes and compelling the encoder to learn a good feature representation.

When calculating the loss, a crucial issue is the maintenance of a high-quality Memory Bank. As there is limited storage space for voxel feature representations for all samples, and considering this study focused on semi-supervised segmentation, the Memory Bank only selected high-quality voxel feature vectors from labeled data for storage. The designed Memory Bank was a fixed-length queue with a length of η, and its update rule is shown in Fig. 3(b), following the first-in, first-out principle. The update process of Memory Bank only involved the teacher network. Given labeled data as input, the network outputs the feature representation and segmentation result of the image. The feature quality evaluation rule was applied using the segmentation prediction and segmentation label to select high-quality feature vectors from the feature representation. These selected feature vectors were pushed into the Memory Bank, and the earliest stored feature vectors were popped out. The teacher network was updated using EMA. Its output feature was a smoothed representation of the current and previous time steps. Accordingly, the Memory Bank obtained relatively stable and reliable feature storage. The feature quality evaluation rule was defined as follows: given the output feature representation \(\:^\), segmentation result \(\:_^\), and segmentation label \(\:_\), the voxel points in \(\:^\) that have high-quality feature vectors should satisfy the condition \(\:^==\left(Sigmoid\left(_^\right)>\mu\:\right)\), where µ is the confidence threshold. The selected voxel points were then sorted based on their confidence and the top K voxel points were used as the feature representation for updating the Memory Bank.

To make better use of the labeled data \(\:_=\_,_\}\) in semi-supervised tasks, we also applied the feature consistency loss to \(\:_\). When calculating the loss \(\:_^\), we replaced the pseudo label generated by the teacher network with the actual labels \(\:_\).

留言 (0)

沒有登入
gif