Residual refinement for interactive skin lesion segmentation

The network structure consists of three main parts: 1. The network encoder, which was used to encode the features of different levels of abstraction. 2. SBox-Net. In addition to the feature encoder, we obtain features of two abstract levels, namely, low-level features and high-level features. SBox-Net highlights high-level features by reducing the number of channels of low-level feature mapping and obtains rough prediction segmentation. 3. Click-Net, whose main goal is to restore details according to the user’s clicks. We generated a Gaussian distance map with the same size as the input image by using the user’s clicks and used the Gaussian distance map as the weight of the segmentation results of the final upsampling layer of Click-Net. Finally, the segmentation results of Click-Net are refined according to the weight. The architecture of our proposed method is illustrated in Fig. 2.

Fig. 2figure2

The overall architecture of our proposed method. It is composed of three major parts: a feature encoder for encoding features at different abstract levels, an SBox-Net for initial segmentation and a Click-Net for refinement. Using the feature encoder, we obtain feature maps at two levels of abstraction, namely, low-level features and high-level features. Our SBox-Net is used to predict segmentation at a coarse level; thus, we highlight our high-level features by reducing the number of channels of low-level feature maps. In addition, in our Click-Net, we reduce the channels of high-level features since our goal is to recover details according to user clicks. All previously mentioned channel reduction operations are performed by 1*1 convolution. Finally, we simulate user clicks by sampling from differences of SBox-Net segmentation and the ground truth (denoted by ⊗ in the figure)

SBox-Net

Our SBox-Net was designed to be a binary segmentation network. Except for the last inference layer, there is no difference between our SBox-Net and a normal semantic segmentation network. Thus, SBox-Net could smoothly utilize a pre-trained state-of-the-art semantic segmentation network by simply replacing the top segmentation layer of an existing state-of-the-art model with our binary segmentation layer. We can then fine-tune the network to fit our goal. This strategy saves us considerable training time and computational resources. As for the simulation of a user drawing a surrounding box, we take the bounding box of the ground truth mask jittered randomly by up to 30 px in each direction. In this way, the randomness of user behaviour is well modelled.

As shown in Supplementary Fig. 1, in SBox-Net, in order to concatenate the shallow and deep features in the encoder, the features extracted from the encoder should be ‘concatenated’ first. Then, the 3 × 3 convolution is used to refine the features, and the refined features have deeper semantic features. During the upsampling process, factor = 4 bilinear interpolation is used to recover the pixel-wise prediction of the image resolution entered in the encoder. We define this prediction as a rough prediction; and in clinical segmentation, if the physician is satisfied with this result, he can obtain a satisfactory result without any manipulation. Otherwise, he can use Click-Net for refinement. Section “User interaction simulation” and “User interaction transformation” introduce some preliminary information about Click-Net, and section “Click-Net” describes Click-Net in detail.

User interaction simulation

Surrounding box simulation is quite straight-forward, as previously stated in section “SBox-Net”. Click simulation requires slightly more caution.

Before delving into the details of click simulations, we need to go through the workflow of a typical interactive object segmentation process. First, a user draws a surrounding box around the target object. Based on the surrounding box, the SBox-Net will perform one pass of inference on the patch of the image cropped by the surrounding box. If the result needs to be refined, typically, it would contain two types of mistakes, namely, extra pixels and left-behind pixels (from a user’s perspective). In these two types of mistakes, a user adds clicks to refine the segmentation result.

By separating our architecture into SBox-Net and Click-Net, we can perfectly simulate those two types of mistakes during training time. After a forward pass of SBox-Net, we obtain a preliminary result. We then can calculate the differences between the preliminary result and the ground truth mask, obtaining the false positives and false negatives of the preliminary result, which are a close simulation of the two types of mistakes previously mentioned. Thus, we can directly sample clicks on false positives and false negatives (see Fig. 2). Our strategy for simulating user clicks is simpler, more straightforward and more effective than that introduced by [24].

User interaction transformation

At the inference time of our Click-Net, a user can provide positive and negative clicks to refine the results of SBox-Net. All user interactions can be grouped into two sets: a positive click set S1, which contains all user-proved positive clicks; and a negative click set S2, which contains all user-provided negative clicks. A Gaussian distance transformation was used to transform those two sets into two separate channels G1 and G2, respectively. Both G1 and G2 were initialized to zero. Let \( _^1 \) and \( _^2 \) be the elements at location (m, n) in matrices G1 and G2, respectively, which are calculated by:

$$ _^1=\underset\in ^1}^^2+^2\right)}} $$

(1)

$$ _^2=\underset\in ^2}^^2+^2\right)}} $$

(2)

where R is a radius parameter that controls the area of influence of a user click. After the transformation of user clicks, we concatenate the feature maps extracted from SBox-Net with G1 and G2, which are then fed into Click-Net for further processing.

Click-Net

The workflow of Click-Net is shown in Supplementary Fig. 2. On the basis of SBox-Net segmentation, our Click-Net was designed specifically for responding to user clicks when a user is seeking to refine the segmentation result. In order to achieve this, the training data for Click-Net must be collected carefully. The click simulation strategy is described in detail in section “User interaction simulation”. In Click-Net, we first transform the positive and negative clicks into two Gaussian centred maps. We then concatenate the transformed Gaussian maps with feature maps extracted from SBox-Net, which are then fed into Click-Net to generate our final segmentation. Contrary to previous works [24,25,26], we do not concatenate the transformed user clicks with raw images directly but with feature maps instead. The main motivation behind this is to decouple the segmentation process and the refinement process. Besides, it is obvious that user clicks are informative both semantically (positive or negative) and spatially (the absolute position of the clicks inside the surrounding box). Thus, their level of abstraction is more compatible with high-level features instead of low-level features such as raw pixels.

Inspired by the famous ResNet [19], which incorporates residual blocks to tackle the exploding gradient problem and significantly boosts the performance of artificial networks, we designed our Click-Net as a residual refinement network. Before yielding the final segmentation, our Click-Net fuses its output with that of the SBox-Net, which makes it in effect a residual refinement network. The fusion process considers the number and position of user clicks. We transform the user clicks into a weight map using Gaussian distance transformation. Unlike in user interaction transformation depicted in 2.3, we do not differentiate between positive and negative clicks. Besides, instead of setting the pixel value to the maximum Gaussian distance from all click points, we add those distances up. Finally, the radius parameter, R, which controls the area of influence of a user click, is set to a much larger value, allowing each click to adjust the weight of a much broader area. The final weight map is given as:

$$ _=\sum \limits_\in \left(^1\cup ^2\right)}^^2+^2\right)}} $$

(3)

In Formula 3, W(m, n) represents the sum of the Gaussian distances between all the click points and the element at location (m, n) in matrix W. \( ^^2+^2\right)}} \) represents the Gaussian distance from a single click point si, j in the set S1 ∪ S2 to the element at location (m, n) in matrix W.

After obtaining the weight map, we can fuse the SBox-Net result, denoted B, with the Click-Net result, denoted C, to produce our final result, denoted F, using the formula:

where ∗ is the bitwise multiplication operator and + is the bitwise addition operator.

留言 (0)

沒有登入
gif