Deformable registration of magnetic resonance images using unsupervised deep learning in neuro-/radiation oncology

The ConvUNet-DIR framework (Fig. 1) proposed in this study is to estimate the optimal parameterized mapping function \(\phi\) between a baseline pre-operative MR image (fixed image, \(_\)) and the follow-up MR image (moving image, \(_\)). The \(\phi\) is a nonlinear voxel-wise correspondence between \(_\) and \(_\). The deformed/warped image (\(_\circ \phi\)) from \(_\) can be registered to \(_\). Here, the global mapping function \(\phi \left(x\right) = x + s\left(x\right)\) is formed by an identity spatial transform and deformation field \(s\). Once the framework has been trained, the deformation field \(s\) can be obtained from a new pair of MRI scans.

Fig. 1figure 1

The framework of our unsupervised convolutional U-Net based deformable image registration (ConvUNet-DIR) of pre-operative and follow-up magnetic resonance images of glioma patients

Dataset

A multi-institutional dataset (n = 160) of multi-parametric MRI scans, representing the training and validation sets of the Brain Tumor Sequence Registration (BraTS-Reg 2022) challenge, was used in this study [23]. The patients were diagnosed with glioma and clinically scanned with multi-parametric MRI acquisition protocol. The multi-parametric MRI scans of each patient included T1-weighted (T1), contrast-enhanced T1 (T1-ce), T2, and T2 Fluid Attenuated Inversion Recovery (T2-FLAIR or simply FLAIR). The mages were acquired at two time-point: pre-operative (treatment-naïve) and follow-up. The follow-up scans range from 27 days to 37 weeks. The images have been manipulated already. All scans were first transformed into the same coordinate system and then rigidly co-registered to the same anatomical coordinate using the greedy diffeomorphic registration algorithm [24]. Then, the images were skull-striped by extracting the brain tissue and sampled down to 240 × 240 × 155 dimensions with 1 mm3 spatial resolution. The brain extraction was performed using the Brain Mask Generator, a deep learning-based algorithm [25]. Precisely, non-cerebral tissues such as the skull, scalp, and dura were removed from all MRI scans.

Preprocessing

We implemented some preprocessing to the multi-parametric MRI data before being utilized to train the proposed model. First, we cropped the multi-parametric MRI data into smaller sizes of 224 × 224 × 155 dimensions by excluding peripheral voxels with no information. Then, we resized the data into 128 × 128 × 128 dimensions. Next, the image data were normalized using the zero mean and unit variance technique and scaled the data to the [0, 1] range. Accurate registration requires the input MRI scans to be normalized (i.e., voxel intensities range from 0 to 1) to produce consistent results for the images acquired with different scanners/imaging protocols. Finally, we randomly split the data into 64% (n = 102) as a training set, 16% (n = 26) as a validation set, and 20% (n = 32) as a test set.

CNN architecture

Our proposed ConvUNet-DIR framework uses U-Net [22] as a core (Fig. 2). It takes a two-channel 3D MR image representing the concatenation of \(_\) and \(_\) as input. The input size of the network is 128 × 128 × 128 × 2. The convolutional network consists of an encoder with four downsampling layers, a bottleneck or bridge, and a decoder network of four upsampling layers. In the encoder, 3D convolutional layers (3 × 3 × 3 kernel size and 2 × 2 × 2 stride) and the LeakyReLU activation are used to extract features and down-sample/reduce the spatial resolution of the volumes by a factor of 1/2 at different spatial resolution levels. The base level of the model or bottleneck is composed of two convolutional layers (3 × 3 × 3 kernel size and 1 × 1 × 1 stride). The decoder network is made up of four upsampling convolutional layers (3 × 3 × 3 kernel size and 1 × 1 × 1 stride) and the LeakyReLU activation function to reconstruct the deformation field. The encoder and decoder networks are concatenated via skip connections to detect and combine information at different spatial levels to produce the deformation field. Three additional convolutional layers (3 × 3 × 3 kernel size and 1 × 1 × 1 stride) with LeakyReLU activation function are added after the decoder network. The final layer is a convolutional layer (3 × 3 × 3 kernel size and 1 × 1 × 1 stride) with a linear activation used for a regression output. The linear activation in this layer enables positive and negative values as the output of the deformation field. The output layer of the network is a deformation field \(s\) with a size of 128 × 128 × 128 × 3.

Fig. 2figure 2

The convolutional U-Net-style architecture. Numbers at the bottom of the blocks represent the spatial resolution ratio of each volume to the input volume. Numbers inside the block indicate the extracted features. Arrows donate different operations

Spatial transformation layer

The spatial transformation layer computes \(_\circ \phi\). The position of individual voxels in \(_\) is determined in the space of \(_\). In this layer, \(\phi\) is used to warp \(_\) and then obtain \(_\circ \phi\). To ensure that the spatial transformation layer is differentiable, we used linear interpolation to estimate the voxel value of \(_\) in the x, y, and z coordinates. As a result, back-propagation of the errors through the network could be implemented during the training.

Loss function

We implemented an unsupervised loss function (\(Loss\)) to evaluate the model using only the \(_\) and \(_\circ \phi\). The loss function consists of a similarity loss (\(_\)) and a regularization term (\(_\)) that penalizes large deformations to produce smooth registration fields. \(_\) penalizes the difference in appearance, whereas \(_\) penalizes the first derivative of \(s\) to produce a fine registration field. The \(Loss\) is defined as:

$$_, _\circ \phi \right)=L}_\left(_, _\circ \phi \right)+\lambda _(_, _\circ \phi )$$

where \(\lambda\) is a regularization parameter.

The \(_\) was set to the negative local normalized cross-correlation (NCC) coefficient [26] of \(_\) and \(_\circ \phi\). This intensity-based similarity measure was found to be optimal for single/mono-modality image registration where the image pair shares similar intensity distribution [2]. The \(NCC\) coefficients were calculated over a volume of 9 × 9 × 9. The \(_\) is defined as:

$$_\left(_, _\circ \phi \right)=-NCC(_, _\circ \phi )$$

.

Minimizing the \(_\) without applying constraints can lead to \(_\circ \phi\) with unrealistic organ appearances. Obtaining a smooth \(s\) requires implementing a diffusion regularizer on the spatial gradients (\(\nabla\)) of deformation \(s\) as follows:

$$_\left(_, _\circ \phi \right)=\sum (_, _\circ \phi )\left|\right|}^$$

.

The spatial gradients were approximated using differences between neighboring voxels.

Training and validation

The ConvUNet-DIR model was trained using an unsupervised manner for deformable registration of a pair of MRI scans (\(_\), \(_\)) with volumes of 128 × 128 × 128 on a training data set (n = 102). Adam optimizer, implementing the gradient descent approach, was set to optimize the learning model with a 0.0001 learning rate. The regularization parameter, λ, was set to 1. We assessed different settings for λ, including 0 (no regularization), 0.1, 0.2, 0.5, and 1. The λ = 1 found to work best with our task. The model was trained for 150 epochs using a batch size of 1. Our GPU memory does not permit us to use a larger batch size. During the network training, each pair of MRI scans is concatenated into a 2-channel 3D image and fed into the 3D U-Net. The deformation field \(s\) was computed through the convolutional layers of U-Net. The spatial transformation layer was used to warp \(_\) into \(_\circ \phi\) by using linear resampling. The network parameters are regularly tuned during the training by minimizing the dissimilarity between the \(_\) and \(_\circ \phi\). The \(s\) is punished by regularization terms to encourage smoothness (i.e., regularize the predicted deformation). During the training, the model was validated on the validation set (n = 26) to assess its generalizability and to update the hyper-parameters.

The training was performed using Keras API (version 2.10) with a Tensorflow (version 2.10) platform as the backend in Python (version 3.10, Python Software Foundation, Wilmington, DE, USA) by using an NVIDIA Quadro M1200 4 GB GPU. The trained model takes less than 2 s to register a pair of MRI scans for a new patient, making its deployment to the clinical practice feasible.

Evaluation

The model was assessed to evaluate its performance on the registration of pairs of \(_\) and \(_\) on a test set (n = 32). One of the utilized metrics is the Dice similarity coefficient. It is used to estimate the volume overlap of the brain fields which was determined using the generated brain masks. The Dice score of the warped mask (\(_\circ \phi\)) and the fixed mask (\(_\)) of the \(_\) is calculated as follows:

$$Dice(_\circ \phi ,_)=\frac_\circ \phi \cap _|}_\circ \phi \right|+\left|_\right|}$$

Another metric used to evaluate our proposed method is the structural similarity index (SSIM). This metric simulates the human-perceived quality of images by comparing two images. Mathematically, it is defined as:

$$SSIM(x,y)=\frac__+_)(2_+_)}_^+_^+_\right)(_^+_^+_)},$$

where \(\mu\) is the mean image intensity, \(^\) is the variance of the image, \(_\) is the covariance of the fixed \(\left(x\right)\) and moving \(\left(y\right)\) images, and \(_\) and \(_\) are constants added to stabilize the division with a weak denominator. The performance of the ConvUNet-DIR model was also compared with the open-source VoxelMorph (VM1 and VM2) methods [7], deep learning-based algorithms. We trained the two versions of the VoxelMorph algorithms from scratch on the BraTS-Reg 2021 dataset for a fair comparison.

留言 (0)

沒有登入
gif