Sparse keypoint segmentation of lung fissures: efficient geometric deep learning for abstracting volumetric images

We describe our pipeline for KP-based geometric segmentation of lung fissures from lung CT images as depicted in Fig. 1.

Keypoint and feature extraction

First, the input CT image is abstracted into a KP cloud by selecting a tiny fraction of all voxels as fissure candidates. We employ the two best-performing methods from [3]. These are the generic Förstner KPs or the CNN-based pre-segmentation KPs. We limit all points to lie inside the lung mask and choose $K=20\,000$ points at most per image. K was chosen heuristically to balance point cloud resolution and segmentation efficiency. All coordinates are normalized and the resulting point cloud is $\varvec\in [-1, 1]^$. The point cloud $\varvec$ carries shape information about the fissures. Providing image information in addition to the shape features greatly improves point segmentation [3]. Therefore, we adopt the most simple and effective method from [3], sampling $(5\times 5\times 5)$-sized patches of normalized image intensity around each point.Footnote 1

Förstner keypoints

Förstner KPs [16] describe locally distinctive points in an image [16]. This operator is widely used in classical computer vision approaches. Since it is purely unsupervised and image-based, it is fissure-agnostic and does not require prior knowledge about the target structure. The KPs are detected as described in [17]. First, the distinctiveness measure is computed based on first-order gradients of the image in the structure tensor. By extracting the local maxima of distinctiveness in $(5\times 5\times 5)$ neighborhoods, this method produces a rather uniformly distributed point cloud (cf. Fig. 1). The points tend to be corners or blobs.

CNN keypoints

We also perform fissure-specific KP extraction using a lightweight 3D-CNN trained for pre-segmentation. This helps our method to efficiently incorporate the dense image representation. We choose MobileNetV3-Large [18] as the CNN architecture. We modify it by replacing 2D convolutional layers with 3D convolutions, keeping the kernel sizes and channel dimensions the same. For pre-segmentation, we need a high recall of fissure points in the strongly imbalanced fissure segmentation. Therefore, during training we weight the cross-entropy loss with the false negative rate per class in each batch. This pushes the segmentation toward high recall while tolerating a loss in precision, effectively resulting in an over-segmentation of the fissures. We choose K foreground points at random out of the predicted fissure points in the segmentation map as the KP cloud. See Online Resource 1 for more details on the network architecture and training procedure. To reduce the memory needed, we apply the network to patches of size $(128\times 128\times 128)$ with at least 50 % overlap as in [2].

Point cloud segmentation networks

The point cloud segmentation network decides the fissure or background label for each candidate point based on the shape and image information. There are different paradigms of GDL for such networks, and we compare their applicability in the medical context. PointNets [4] can be universal function approximators for point sets [19]. However, they do not take point neighborhoods into account. Graph convolutional networks (GCNs) extend PointNets with convolutions on neighborhood graphs. This facilitates local information propagation on irregular point clouds [5]. More recently, the self-attention operator from transformer networks was adopted into point cloud processing architectures [6]. This allows for even more expressive information exchange in point neighborhoods. We choose a representative from each of the three paradigms as described in the following.

PointNet

PointNet [4] consists of per-point feature extraction with shared multi-layer perceptrons (MLPs) followed by a global max-pooling operation. The segmentation network then concatenates the global feature with point features and uses more MLPs to produce a point segmentation. The symmetric max-pooling function makes the network permutation-equivariant [4]. To stabilize training and since the structures of interest in the image are already roughly aligned, we omit the spatial transformer (T-Net) from PointNet.

Dynamic Graph CNN (DGCNN)

The DGCNN [5] is a GCN that replaces PointNet’s MLPs with the EdgeConv graph convolution while keeping the architecture very similar. EdgeConv works on points in a neighborhood, extracting edge features that get combined with local features. It uses the k-Nearest-Neighbor (kNN) graph $\mathcal $, which we construct once from $\varvec$ with $k=40$. Note that we ignore the image features for graph construction and that we keep $\mathcal $ for all EdgeConv layers. We also omit the T-Net from DGCNN.

PointTransformer

The PointTransformer [6] combines the self-attention operation from Transformers with graph-based local processing. PointTransformer applies self-attention in kNN neighborhoods with $k=16$. For more expressive local feature extraction, vector attention instead of scalar dot-product attention is chosen [6]. A parametrized position encoding is added to the attention vector as well as the local point features. This allows the shape information to inform the attention weights and the resulting representation. The segmentation model of PointTransformer follows a U-Net structure with skip connections between a contracting and a mirrored expanding path.

Implementation details

Independent of the model architecture, we randomly sample $N=2048$ from the K points for each forward pass during training. Then, the network outputs a point segmentation from the point coordinates $\varvec\in }\,}}^$ concatenated with the features $\varvec\in }\,}}^$. The loss function is the combined cross-entropy and Dice loss from [2]. Models are trained for 1000 epochs with the Adam optimizer [20], learning rate $\eta =0.001$, and weight decay of $10^$. The learning rate is successively lowered to $\eta \cdot 0.05$ in the last epoch using a cosine annealing schedule [21]. Since the networks are not translation-, rotation-, or scale-equivariant, we apply random rigid data augmentation to the coordinates in $\varvec$. For inference, we run the forward pass 50 times with different random selections of N points. We accumulate the segmentation scores, ensuring all K points are segmented.

Mesh reconstruction

The point segmentation networks output sparse point clouds representing the target objects. We need to reconstruct dense fissure surfaces to use them as lobar boundaries in downstream image analysis. The boundaries do not contain any relevant volume [1]. Therefore, topologically, we model the fissures as open surfaces with a single boundary component and construct our meshes accordingly.

Poisson surface reconstruction (PSR)

A state-of-the-art technique for mesh reconstruction is the PSR [7]. It solves the Poisson equation for the indicator function of the object implicitly described by a point cloud. The equation is based on point normals interpreted as samples of the indicator function gradient. See a schematic overview of PSR in Fig. 1. Point normals are estimated with principal component analysis and consistently oriented using the Open3D libraryFootnote 2. PSR solves the Poisson equation on an underlying octree structure [7]. We set the octree depth hyperparameter to 6, striking a balance between the smoothness and resolution of the resulting triangle mesh. Finally, we remove triangles with vertices outside the lung mask and keep only the largest connected component of the mesh. The last step produces an open surface according to the fissure topology. Without the post-processing, PSR reconstructs a closed, watertight surface.

Point cloud to mesh autoencoder (PC-AE)

Previously, PSR mesh reconstruction made up most of the inference time in our pipeline [3]. To speed this up, we propose a PC-AE for learned mesh reconstruction as shown in Fig. 1. Apart from a shorter runtime, a shape model can be an effective anatomical prior for medical deep learning [22]. The architecture is inspired by FoldingNet [23], which uses a PointNet encoder and two folding operations for decoding coordinates. We follow [24] in using a DGCNN [5] as its encoder. The global feature vector is interpreted as the latent representation. For the decoder, we found that predicting 3D deformations of a template mesh as in [12, 13] yielded much better results than the folding operation. A benefit of our approach is that we can define an initial mesh homeomorphic to our target structure. In this case, we choose the plane mesh shown in Fig. 2a. Incidentally, multiple deformed meshes contain corresponding vertices as illustrated in Fig. 2.

Fig. 2

A template mesh a is being deformed by our point cloud autoencoder to fit an input point cloud b–d. The color coding illustrates correspondences between reconstructed meshes

We adopt $N=2048$ input points and the latent vector $\varvec\in }\,}}^h$ with $h=512$ from [23]. The template mesh to deform by the decoder is a triangle mesh with M vertices, M being the closest square number to N. Its vertices $\varvec^=(\varvec,\varvec,\varvec)\in \mathbb ^$ are bilinearly sampled with $x,y\in [-0.3,0.3]$. The decoder takes M copies of $\varvec$ in $\varvec=(\varvec,..., \varvec)^\textrm\in }\,}}^$ concatenated with $\varvec^$. The two deforming steps are shared MLP layers $f_1, f_2:\mathbb ^\rightarrow \mathbb ^3$ that predict residual displacements of $\varvec^$

$$\begin \varvec^&= \varvec^ + f_1(\varvec,\varvec^) \\ \varvec^&= \varvec^ + f_2(\varvec,\varvec^). \end$$

We train the PC-AE with point clouds randomly sampled from our ground truth fissure meshes. Here, we make no distinction between the three fissures so the network models the shape of all three fissures. As the training objective, we adopt the regularized mesh loss from [12] with chamfer distance (CD) as the reconstruction loss and multiple regularization terms (normal consistency (NC), edge length (EL), and Laplacian smoothness (LS)). Weights for each term are $w_\textrm=1$, $w_\textrm=0.1$, $w_\textrm=1$, and $w_\textrm=0.1$ (cf. Online Resource 1 for details). For inference in our pipeline, N points are sampled from an input point cloud using farthest point sampling. To reconstruct the three fissures, we perform three separate forward passes.

Data and experiments

We choose the TotalSegmentator data set [25] for our experiments. It comprises CT images from clinical practice with various pathologies and semi-automatic segmentations, including pulmonary lobe labels. We select the 380 images that contain the lungs in their entirety. Fissure annotations are computed by finding voxels at the interface of two neighboring lobes. Ground truth fissure meshes are computed by first performing morphological binary thinning of the label maps [26, Ch. 9.5.5] and then applying PSR to the fissure voxels viewed as a point cloud with the procedure described in the “Poisson surface reconstruction (PSR)” section.

We perform a fivefold cross-validation of our pipeline in its different configurations. The results are compared to a powerful 3D-CNN trained in the nnU-Net framework [2], which is the current medical image segmentation gold standard. We choose the 3D U-Net configuration and train it for 200 epochs. To create a common fissure representation with our pipeline, meshes are reconstructed from the predicted label maps by applying binary thinning followed by PSR. Thus, we can compute surface distances between ground truth and predicted meshes. We report the average symmetric surface distance (ASSD), the standard deviation of surface distances (SDSD), and the Hausdorff distance (HD). Definitions of the metrics are given in Online Resource 1. We further validate the generalization ability of our models on a data set of COPD patients.

Table 1 Cross-validation of point segmentation networks compared to nnU-NetFig. 3

Qualitative results of our point-based pipeline with the PointTransformer b–d compared to the voxel-based nnU-Net e. a Shows ground truth meshes and the input image. Shown is case #70 from the TotalSegmentator data set. Top: reconstructed meshes with our PC-AE in d and PSR otherwise. Bottom left: sagittal slices of the left lung with left oblique fissure overlay (red). Bottom right: right lung with right oblique (green) and right horizontal fissure overlay (blue)

We gauge the efficiency of all parts in our pipeline by measuring the average inference time. All models, the KP extraction, and the PC-AE are implemented in PyTorch 2.2.0 and use GPU acceleration with CUDA 12.1. Our test hardware is one NVIDIA A100 80 GB GPU and an AMD EPYC 7713P CPU. Note that in our previous study [3], we measured inference times on an NVIDIA RTX 2080Ti GPU with 11 GB, which is sufficient memory to run all experiments presented here. For hardware-agnostic comparisons of the models, we also provide the number of multiply accumulate (MAC) operations per forward pass.

View original article

INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY

分享书签

0 0 0 0 0 0 0

More from this channel

Sparse keypoint segmentation of lung fissures: efficient geometric deep learning for abstracting volumetric images

留言 (0)