Focal cortical dysplasia lesion segmentation using multiscale transformer

Datasets

A publicly open dataset, provided by the Department of Epileptology at the University Hospital Bonn (UHB dataset) [29], is used in this paper. Datasets of 85 patients diagnosed with FCD II, hospitalized from 2006 to 2021, were retrospectively collected, with approval from the university’s ethics committee. 50 of 85 patients (58.8%) are male. The scan age (28.9 ± 12.4 years) and epilepsy onset age (10 ± 8.3 years) distributions of the dataset are plotted in Fig. 2a.

Fig. 2

a Age distribution of the patients; b Lesion distribution in different brain lobes

Included in this dataset are MRI sequences for each patient, along with clinical information. MRI was performed using a 3 T MRI Scanner (Magnetom Trio, Simens Healthineers, Erlangen, Germany). T1-MPRAGE and fluid-attenuated inversion recovery (FLAIR) sequences were recorded.

Lesion ground truth was annotated on FLAIR images by two neurologists, based on a combination of diagnostic tools. A subset of 7 cases (8.3%) was categorized as MRI-negative or non-lesional. The lesions are located in the frontal lobe, temporal lobe, parietal lobe, occipital lobe, and insular lobe as shown in Fig. 2b.

Figure 3 gives the flowchart of the proposed method. Preprocessing is firstly applied to align and normalize images as shown in Fig. 3a, b. Preprocessing details are introduced in Appendix 1. The dataset was randomly divided into a training set, a validation set, and a test set containing 62, 6, and 17 samples, respectively.

Fig. 3

Flowchart of the proposed method. a Dataset splitting; b Preprocessing procedure; c Architecture of the proposed model based on an encoder-decoder structure. Parallel transformer pathways, each consisting of m dual-self-attention (DSA) modules, are inserted to capture the global features on feature maps of different resolutions, ranging from 1/4 to 1/32. d Architecture of the DSA module, consisting of a spatial self-attention branch and a channel self-attention branch

Model architecture

The proposed multiscale dual-self-attention network (MS-DSA-NET) employs an encoder-decoder architecture with parallel transformer pathways connecting between them. The specific architectural details are elaborated below and represented in Fig. 3c. The proposed model accepts a 3D patch $x\in ^_\times D\times H\times W}$ as input and produces a lesion probability map of the same size as the input, where$\,_,,H,$ and  represent the channels, depth, height, and width of the input patch, respectively.

The encoder within our proposed network is a six-level convolutional hierarchy. Each level incorporates a residual convolution module, referred to as$\,$, consisting of two convolutional blocks with residual connection. The feature representation at each level is mathematically expressed as:

$$_\in ^^\cdot C\times \frac^}\times \frac^}\times \frac^}},$$

(1)

where$\,i=}},\cdots ,6$ corresponds to the layer index and C denotes the number of filters in the first convolutional layer and is fixed to 16 in our configuration.

Transformer pathway

Our approach capitalizes on CNN feature maps at various resolutions to establish long-range relationships. Initially, the CNN feature map at each level undergoes processing through a  module to halve the feature channels and normalize the output. The normalized features are denoted as $x\in ^$, where c represents the feature channels and $d\times h\times w$ corresponds to the feature spatial size.

In contrast to using patch embedding as in ViT, the feature maps denoted as $_\in ^_\times _\times _\times _}\}}_}}}$ are directly fed into parallel transformer pathways. Each pathway comprises $m$ DSA modules. In our configuration, we have opted to set $m$ equal to 3.

DSA architecture

The detailed structure of the DSA module is depicted in Fig. 3d. We reshape and transpose feature map into a token sequence denoted as $x\in ^$, where $n=$ is the sequence length. A learnable positional embedding $e\in ^$, is added to the feature sequence to encapsulate the position information.

Subsequently, it undergoes normalization by a  layer. This is followed by two parallel self-attention modules that independently enrich spatial and channel features. Linear layers first map the feature sequence into four matrices. These represent the query $Q\in ^$, the key $K\in ^$, the spatial value $_\in ^$, and the channel value $}_\in ^$. Both $_$ and $}_$ are employed in their respective self-attention modules. Meanwhile, $Q$ and K are shared in both modules to avoid additional complexity.

Spatial self-attention (SSA) module

SSA is capable of emphasizing the salient regions of FCD lesions by capturing the global inter-position dependencies. To decrease the complexity of the conventional self-attention mechanism, linear layers are utilized in SSA to project K and $_$ into a lower-dimensional space:

$$\begin\bar=\left(K\right)\in ^,\\ \bar_}=(_)\in ^,\end$$

(2)

where the projection dimension $\ll n$. The spatial similarity matrix is then computed by multiplying the query matrix $Q\in ^$ by the transpose of the projected key matrix $\bar\in ^$, followed by rescaling and  for normalization:

$$_=\left(\frac}^}}\right)\in ^,$$

(3)

where the rescaling factor $s=c/_$ and $_$ is the number of self-attention heads. A spatial attention map is obtained:

With the aid of linear projection, the computation complexity of SSA is reduced to $O()$. This makes it feasible to apply parallel SSA on feature maps at multiple resolutions, ranging from $1/4$ to $1/32$.

CSA module

CSA is designed to focus more acutely on FCD lesion-related features by capturing the global inter-channel dependencies. CSA utilizes the shared $Q\in ^$ and the key $K\in ^$ with SSA to compute the normalized channel similarity matrix:

$$_=\left(\frac^K}}\right)\in ^,$$

(5)

where the rescaling factor $s=c/_$. Subsequently, the channel attention map is obtained:

The spatial attention map $_$ and channel attention map $_$ are fused by addition. The input feature map $x$ is connected to the fused attention map via a residual connection:

We reshape and transpose $^ }^$ to $^ }\in ^$, with the aim of recovering the spatial information and making it suitable for the following convolutional module. The convolutional module consists of a  module and a $1\times 1\times 1$ convolution with a residual connection to generate the enriched feature maps.

Decoder architecture

The decoder’s function is to fuse features from various hierarchical levels, an operation facilitated by deconvolution and specialized Fusion blocks. A Fusion block receives encoder features, denoted as either $_$ or $_$, to fuse feature maps $_$ from a deeper level as inputs, thereby enabling concatenation and convolution. The feature maps, $_$, are initially upsampled via deconvolution using a kernel and stride of $2\times 2\times 2$. This process synchronizes feature maps across different resolutions and simultaneously reduces the number of feature channels by half prior to entering the Fusion block. Ultimately, the feature maps are refined through a $1\times 1\times 1$ convolution. This effectively transforms the feature channels into two output channels, which is immediately followed by a Softmax layer to yield a normalized probability map.

Loss functions

The optimization of the proposed model employs a hybrid loss function as the objective function, which comprises two distinct components: a regional term and a voxel term:

$$\beginL=_}+\omega * _},\\ _}=1-\frac___+\varepsilon }__+__+\varepsilon },\\ _}=-\left(__\log _+(1-_)\log (_)\right),\end$$

(8)

where reginal term $_}$ is the Dice loss to evaluate the disparity between the predicted lesion map, and the ground truth; voxel term $_}$ is the cross-entropy loss to evaluate the dissimilarity between the predicted probabilities and the ground truth labels at the voxel level; $P$ is the prediction map and $G$ is ground truth; $\omega$ is a predefined weight, which in this study is set to 1.

Evaluation metrics

We employed two primary metrics: subject-level and voxel-level assessments. At the subject level, we leveraged detection sensitivity () and the average number of false-positive clusters (). An FCD lesion was considered detected if there was at least one-voxel overlap between the prediction and the ground truth, a criterion consistent with that used in [19]. $\,$ is defined as follows:

where  denotes true positive subjects, and $F$ represents false negative subjects.

 quantifies the count of false-positive lesion detections within FCD patients. Clusters are initially delineated based on voxel connectivity analysis, with any cluster lacking actual lesion voxels being considered a false-positive cluster.

At the voxel level, we used the commonly applied metrics of precision (), sensitivity (), and the Dice coefficient () for evaluation. These metrics are defined as:

$$\begin=\frac}+},\\ =\frac}+},\\ =\frac}++},\end$$

(10)

where  refers to the number of true positive voxels,  indicates the number of false-positive voxels, and  denotes the false negative voxels.

View original article

INSIGHTS INTO IMAGING

Like

分享书签

0 0 0 0 0 0 0

More from this channel

Focal cortical dysplasia lesion segmentation using multiscale transformer

留言 (0)