A critical guide to the automated quantification of perivascular spaces in magnetic resonance imaging

The glymphatic system and perivascular spaces

In the human body, the lymphatic system is the main pathway for waste clearance (Cueni and Detmar, 2008). Generally, lymphatic organs are concentrated in regions of higher energy consumption and production of metabolic waste (Louveau et al., 2018). The lymphatic system, however, does not appear to extend to the nervous system, despite the brain’s vast metabolic demands (Iliff et al., 2013). Instead, the maintenance and waste management of the neuronal environment is governed by the glymphatic system (Hladky and Barrand, 2014; Bakker et al., 2016; Plog and Nedergaard, 2018). Aptly named, the glymphatic system eliminates waste in the brain with a system of glial cells and cerebrospinal fluid (CSF).

The prevailing understanding of glymphatic clearance involves the convective fluid flow of CSF through perivascular spaces (PVS), which are formed and lined by glial cells. At the microscopic scale, the glymphatic system comprises perivascular spaces surrounding the cerebral vasculature, also termed perivascular units (Figure 1; Troili et al., 2020). CSF produced in the choroid plexus traverses through the subarachnoid space to irrigate perforating blood vessels. As these vessels penetrate deeper into the cortex, the meningeal layers become continuous with astrocytic endfeet, forming a CSF-filled chamber that encapsulates the cerebral vascular tree (Jessen et al., 2015). The astrocytic end feet densely express aquaporin-4 channels (AQP4) which facilitate the exchange of CSF and interstitial fluid (ISF) between PVS and extracellular spaces (Iliff et al., 2013; Huguenard et al., 2017).

Figure 1. Schematic of the perivascular unit adapted from Troili et al. (2020). (Top) Arteries and veins from the subarachnoid space penetrate the parenchyma, perpendicular to the cortical surface. (Middle) Metabolites are extruded from arterioles and into the extracellular space, whilst metabolic waste flows towards the perivenular spaces for glymphatic clearance from the neuropil. (Bottom left) Astrocytic endfeet enclosing a penetrating arteriole which forms the perivascular space. (Bottom right) Aquaporin-4 trans-membrane channels line the astrocytic end-feet and facilitate the exchange of cerebrospinal fluid and interstitial fluid between PVS and extracellular space. Reproduced under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).

Multiple models have been proposed to explain the fluid flow that occurs in and around perivascular spaces (Hladky and Barrand, 2014, 2022; Bohr et al., 2022). According to one widely held model, CSF flushes out from the periarteriolar spaces, via mechanisms including convective fluid flow and arterial pulsations, into the interstitial space to flush waste metabolites towards perivenular spaces (Figure 1; Troili et al., 2020). At the perivenular spaces, interstitial waste is directed to major areas of CSF clearance and fluid filtration via dural glymphatic drainage pathways such as the venous sinuses (Albayram et al., 2022). This is known as glymphatic flow (Xie et al., 2013; Rasmussen et al., 2018).

The glymphatic system

Much of our understanding of the glymphatic system stems from animal models. Animal models have provided evidence suggesting that the clearance of metabolic waste in the interstitial space is modulated by sleep/wake states, and that the restorative effects of sleep may largely be due to glymphatic processes. For example, Xie et al. (2013) used a mouse model to demonstrate that the glymphatic system is most active during sleep, compared to awake or anesthetized mice (Xie et al., 2013). Additionally, glymphatic flow appears to be facilitated by shrinkage of the neuronal environment leading to greater interstitial space (Jessen et al., 2015). The importance of AQP4 channels in glymphatic functioning was outlined by Iliff et al. (2012), when AQP4 knock-out mice showed reduced clearance of amyloid-β by 55% and developed memory deficits. Iliff et al. (2012) also showed that this system preferentially mediates the filtration of solutes of smaller molecular size (<3 kDa), having clear implications in neurodegenerative diseases characterized by the harmful accumulation of small neurotoxins such as amyloid-β (Iliff et al., 2012; Huguenard et al., 2017). In disease models, glymphatic functions appear to be attenuated. For example, mouse models of Alzheimer’s disease (AD) exhibited reduced clearance of amyloid-β from the extracellular space (Harrison et al., 2020). Similarly, in a diabetes mellitus mouse model, glymphatic flow was attenuated with delayed clearance of contrast agents from the interstitial space (Jiang et al., 2017; Zhang et al., 2019).

In humans, contrast-enhanced magnetic resonance imaging has been used to show greater clearance of contrast agents shortly after sleep, compared to waking activity, providing further evidence that the glymphatic clearance may be most active during sleep (Lee et al., 2021). Moreover, dysfunctions of the glymphatic system have been associated with many diseases in humans, including cerebral small vessel disease, AD, and multiple sclerosis (Ge et al., 2005; Mogensen et al., 2021; Natário et al., 2021; Benveniste and Nedergaard, 2022). For example, ex vivo examination of histological slices from the white matter of an AD patient revealed enlarged perivascular spaces (Figure 2) compared to an age-matched control (Roher et al., 2003). The majority of in vivo research in humans, however, uses magnetic resonance imaging to evaluate the extent of perivascular space enlargements.

Figure 2. A histological slice of the superior frontal gyrus from an Alzheimer’s disease patient with numerous enlarged perivascular spaces in the white matter (Roher et al., 2003). PVS appear as bright tubular structures and are mainly present in white matter as opposed to gray matter. Adapted from Roher et al. (2003) and reproduced with permission from Springer Nature, conveyed via the Copyright Clearance Center, Inc.

Characterization of perivascular spaces

Perivascular spaces were first discovered in the early 1800s and described as état criblé, or diffusely enlarged, for their widespread occurrence in the basal ganglia. They are also known as Virchow-Robin spaces, after Virchow and Robin who hypothesized them to be spaces that are continuous with perineuronal spaces (Woollam and Millen, 1955). Since the arrival of neuroimaging techniques in the late 1980s, enlarged PVS have been observed in vivo (Wardlaw et al., 2020).

Magnetic resonance imaging (MRI) is the current standard for in vivo assessment of PVS in humans. MRI is used to evaluate PVS visibility as a proxy measure of glymphatic dysfunction, and potential occlusion of drainage pathways (Rasmussen et al., 2018). PVSs, particularly when enlarged, are observable in MRI scans (Figures 3, 4; Kwee and Kwee, 2007). Structurally, PVS are long tubular structures that follow cerebral blood vessels (Wardlaw et al., 2013). On MR scans, their appearance depends on the viewing plane and the MRI weighting sequence. On both T1 and T2 weighted images, PVSs are CSF isointense. Thus, they are hypointense or dark on T1 images and hyperintense or light on T2 (Kwee and Kwee, 2007). When viewed along a parallel axis, such as the sagittal or coronal planes, PVS appear as long tubular shapes. On axial slices, PVS appear as small ovoid structures typically less than 3 mm in diameter (Figure 3; Wardlaw et al., 2013). In rare cases, giant tumefactive PVS can exceed 15 mm in diameter (Salzman et al., 2005).

Figure 3. PVS rating scales rely on representative axial slices to assess the severity of PVS enlargement. (Left) A sagittal view of the brain, from a T1-weighted MRI scan. The red lines indicate the axial slices that have been selected for visual PVS rating. The upper line was chosen to assess PVS in the centrum semiovale, and the lower line was chosen for the basal ganglia. The centrum semiovale is the mass of white matter above the lateral ventricles. (Middle) An axial slice of the centrum semiovale corresponding to the top line. (Right) An axial slice of the basal ganglia corresponding to the bottom line. Visible perivascular spaces are outlined by red circles. For an extensive review of rating scales, please refer to Paradise et al. (2020).

Figure 4. Axial slices of MRI-visible brain lesions, including PVS, white matter hyperintensities, microbleeds, and lacunes. PVS appear as hyperintense and tubular shapes in T2-weighted MRI scans (Left). White matter hyperintensities are prominent on FLAIR images (Middle-Left). Other lesions that can be confused with PVS include microbleeds (Middle-Right) and lacunes (Right). In FLAIR scans, lacunes are surrounded by a hyperintense rim, whereas PVS are not. Imaging artifacts such as Gibbs ringing, and motion artifacts can also hinder the automated detection of PVS. Figures of microbleeds and lacunes were adapted from Chesebro et al. (2021) and Li et al. (2019), respectively. The figures are reproduced under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). FLAIR, Fluid attenuated inversion recovery; SWI, susceptibility-weighted imaging.

The severity of PVS enlargement is graded by a rater according to established visual rating scales (Figure 3; Patankar et al., 2005; Adams et al., 2013; Potter et al., 2015a; Paradise et al., 2020). Subsequently, these severity scores are associated with features of interest including disease markers and risk factors. Using visual rating scales, the severity of PVS enlargement has been associated with and subsequently proposed as a potential biomarker of various neurodegenerative disorders such as cerebral small vessel disease, AD, neuroinflammation in multiple sclerosis, and cerebral amyloid angiopathies (Hansen et al., 2015; Bakker et al., 2016; Ramirez et al., 2016; Rasmussen et al., 2018; Granberg et al., 2020). Notably, a general weakness of T1 and T2 MRI sequences is that it cannot differentiate between perivenular and periarteriolar spaces (Wardlaw et al., 2020).

Perivascular spaces occur throughout the brain. The most commonly examined regions are the centrum semiovale (CS) and the basal ganglia (BG) (Figure 3). The CS is the mass of white matter (WM) superior to the lateral ventricles, and the BG, which is adjacent to the lateral ventricles and includes the caudate and putamen. Importantly, the anatomy and pathology of PVS are different between regions. In the white matter, blood vessels from the subpial space penetrate the cortical surface and into the parenchyma, are lined by a single leptomeningeal layer: the pia mater (Pollock et al., 1997). In the basal ganglia, penetrating blood vessels are lined by two leptomeningeal layers that connect to the subarachnoid space (Pollock et al., 1997). Thus, the pathophysiology of PVS differ substantially between these two regions (Wardlaw et al., 2020). For a review of the differences between cortical and basal perivascular spaces, please refer to Wardlaw et al. (2020).

Many grading scales have been developed to quantitatively assess the severity of PVS in MRI (Patankar et al., 2005; Adams et al., 2013; Wardlaw et al., 2013; Laveskog et al., 2018; Paradise et al., 2020). The most widely adopted is Wardlaw’s scale which assesses PVS severity in the CS, BG and midbrain (Potter et al., 2015c). According to Wardlaw’s scale, the rater selects a single representative axial slice for each region. If PVSs are observable in the midbrain, it is assigned a score of 1, otherwise it is assigned a score of 0. For the CS and BG, each PVS is counted (Figure 3) and the region is rated according to a 5-point rating scale (0 = no PVS found; 1 = 1–10; 2 = 11–20; 3 = 21–40, 4 = more than 40 present). The 5-point rating scale has high inter-rater and test-retest reliability, and has been used to objectively link PVS and glymphatic dysfunction with markers of disease (Paradise et al., 2020).

Limitations of visual rating scales

By assigning a simple rating to different counts of PVS, the grading scale is a highly replicable and convenient method of assessing perivascular spaces (Paradise et al., 2020). However, the reduction of PVS counts to a simple severity score restricts deeper analyses of glymphatic dysfunction that may be region-specific or associated with specific morphological changes (Zong et al., 2016; Barisano et al., 2021b). For example, asymmetric distributions of PVS between hemispheres has been related to an increased risk of post-stroke and post-traumatic epilepsy (Duncan et al., 2018; Yu et al., 2022). When there is a large difference in PVS counts between hemispheres, Wardlaw’s scale instructs the rater to use the hemisphere with the higher PVS count (Potter et al., 2015b,c). Moreover, longitudinal analyses of PVS are difficult to perform due to the coarseness of the grading scale, and the use of different scales between publications has led to difficulties comparing results and conducting meta-analyses (Granberg et al., 2020).

Multiple algorithms have been developed to facilitate or improve the consistency of PVS grading (Descombes et al., 2004; Zong et al., 2016; González-Castro et al., 2017; Dubost et al., 2019a,b; Jung et al., 2019; Sepehrband et al., 2019; Goryawala et al., 2021; Williamson et al., 2022). For example, Jung et al. (2019) applied a convolutional neural network (CNN) to increase the signal-to-noise ratio (SNR) of PVS, thereby enhancing its appearance and improving detection (Figure 5; Jung et al., 2019). Others have automated the quantification of PVS in axial slices to assign severity scores (Dubost et al., 2019a,b). However, these algorithms do not enable the volumetric or morphological examination of perivascular spaces visible in MRI (Valdes Hernandez et al., 2013). Thus, the remaining review will focus on methods for automatically segmenting perivascular spaces.

Figure 5. A convolutional neural network for enhancing PVS visibility in T2-weighted MRI images (Jung et al., 2019). © IEEE 2019.

Automated segmentation of perivascular spaces

Perivascular spaces are small structures that occur repeatedly throughout the brain. The average volume of a single MRI-visible perivascular space is less than 5 mm3, whilst the total PVS volume in a young and healthy individual can average 5,000 mm3 (Zong et al., 2016; Barisano et al., 2021b). With a resolution of 1 mm3 isotropic (each voxel is 1 mm in length, width, and height), this would require manual labeling of 5,000 voxels to complete PVS segmentation in the average subject. Thus, manual delineation of 3D PVS is laborious and time-consuming. Despite this, the manual segmentation of perivascular spaces is a worthy endeavor for which multiple research groups have dedicated time and resources, in order to develop high quality algorithms (Park et al., 2016; Zong et al., 2016; Zhang et al., 2017; Boespflug et al., 2018; Lian et al., 2018; Schwartz et al., 2019; Smith et al., 2020, Preprint; Sepehrband et al., 2021; Lynch et al., 2022, Preprint). The goal of automatic segmentation algorithms is to label such structures, eliminating the need for manual labeling, thereby expediting detailed analyses of PVS.

Several studies have published automated methods for segmenting perivascular spaces (Table 1; Ramirez et al., 2016; Ballerini et al., 2017; Dubost et al., 2017; Lian et al., 2018). Broadly, these algorithms can be categorized as either (1) classical image processing techniques or (2) machine learning (ML) algorithms. Both have been applied with varying levels of success.

Table 1. Summary of PVS segmentation methods and their results.

In the classical approaches, a computerized pipeline with explicit parameters is set-up and optimized to search for PVS (Ballerini et al., 2016, 2017). For example, seed clusters selected based on intensity thresholds within the white matter can be filtered based on their size, linearity, length, and width (Figure 6; Boespflug et al., 2018). One disadvantage of these methods is that they often require further optimization for different datasets depending on the imaging protocols (Ballerini et al., 2016).

Figure 6. Schematic of the mMAPS (multi-modal auto-identification of perivascular spaces) pipeline for the automated segmentation of PVS adapted from Piantino et al. (2020). It relies on the sequential application of image filters, intensity thresholds, and morphological constraints to delineate PVS voxels. (A) A T2-weighted MRI image is acquired. (B) Voxel intensities are normalized. (C) White matter (WM) mask (red) applied. (D) Holes in the WM mask are filled. (E) Edges of the WM mask are eroded. (F) Frequency map of the voxel intensities. (G) Map of the local intensity differences. (H) Seed clusters resulting from the previous steps are extracted and filtered with morphological constraints. (I) The final segmentation map. Reproduced with permission of the American Society of Neuroradiology, from Piantino et al. (2020), conveyed through the Copyright Clearance Center, Inc.

In comparison, ML-based algorithms undergo training with example data to perform a certain task (Bengio, 2013). In PVS segmentation, the algorithm is trained with manually labeled images, over many iterations or epochs, to learn the features associated with PVS. After training, ML algorithms can be used to label PVS structures on new and unseen data. Although the initial stages are computationally expensive, the results are generally worthwhile as ML algorithms can learn complex features that cannot be defined via classical techniques (Bengio et al., 2012).

Evaluation of segmentation performance

In the evaluation of image segmentation algorithms, the true positives, false positives, and false negatives are tallied to compute three main metrics: the Sørensen–Dice coefficient or F1 score, the sensitivity or detection rate, and the positive predictive value (PPV) (Equation 1) (Dice, 1945; Lian et al., 2018; Jung et al., 2019; Siddique et al., 2021).

D⁢S⁢C=2⁢T⁢P2⁢T⁢P+F⁢P+F⁢N;S⁢E⁢N=T⁢PT⁢P+F⁢N;P⁢P⁢V=T⁢PT⁢P+F⁢P

Equation 1. DSC, Sørensen–Dice similarity coefficient; SEN, sensitivity; PPV, positive predictive value; TP, true positives; TN, true negatives; FP, false positives; FN, false negatives.

The Dice score is a measure of overall segmentation performance (Dice, 1945). A higher Dice score means better overall performance. The sensitivity is the percentage of all PVS voxels that are detected by the algorithm. Whereas the PPV is the percentage of predicted voxels that were correctly labeled as PVS. These metrics assess different aspects of the algorithm’s ability to delineate perivascular spaces. Together, they are useful for gauging the tendencies of an algorithm’s segmentation ability. For example, higher sensitivity than PPV indicates the algorithm prioritizes detection over accuracy. Whereas higher PPV vs. sensitivity indicates the algorithm prefers accuracy or correctness over detection.

However, the Dice coefficient is not infallible. A common problem in medical image segmentation tasks, and especially PVS segmentation is that of noisy labels (Karimi et al., 2019). Due to a number of reasons such as poor image quality, rater fatigue or time constraints, the ground truth can encompass a number of false positives and false negatives that affects that Dice coefficient (Wardlaw et al., 2013; Moses et al., 2022). Thus, the sources of error that affects the Dice score are:

(1) Error in the ground truth (human labels).

(2) Error in the algorithm prediction.

The Dice score aims to measure the latter, but if the ground truth is not reliable, then its ability to do so is impaired, and resultant Dice scores are distorted. In the ideal scenario, the ground truth is perfect and therefore any decrement in the Dice score is purely due to algorithm error. However, to attain this perfect ground truth would require high inter and intra-rater agreement, and robust standards for PVS delineation. To a great extent this has been achieved with 2D rating scales, e.g., STRIVE and UNIVERSE, demonstrating high-concordance across research groups (Wardlaw et al., 2013; Adams et al., 2015). Currently, there is no guideline for the voxel-wise segmentation of perivascular spaces in either T1 or T2-weighted MRI data.

Variations of the Dice metric have been employed, such as the Dice score with 6-neighborhood connected components rule, wherein predicted PVS voxels are deemed true-positives if they are adjacent to a real PVS voxel (Sudre et al., 2022, Preprint). These do not address the inherent issue of noisy labels in PVS segmentation, rather they merely inflate the reported scores. Such modifications make it difficult for comparisons to be made to similar studies that have employed the traditional Dice metric (Equation 1). It also obfuscates the fact that PVS segmentation is a more difficult task compared to other medical imaging problems, where inter-rater Dice scores of 70%+ are commonplace, as opposed to PVS segmentation where inter-rater Dice scores are usually below 50% (Sudre et al., 2022, Preprint; Liu et al., 2022; Spijkerman et al., 2022). This is likely due to the nature of the task, as PVS are small, numerous and occur repeatedly throughout the brain, whilst other lesion detection tasks such as for tumors or stroke lesions require delineation of a single, large and prominent object. Thus, the Dice score is not a perfect metric, but it is the most rigorous one for the evaluation of algorithm performance in PVS segmentation.

To our knowledge, no study has directly addressed this problem of noisy labels in PVS segmentation. One review that has examined solutions to address noisy labels in medical images, suggest the use of semi-automatic annotations, wherein models are trained on incomplete data, then used to aid the human in the detection of undetected case (Karimi et al., 2019).

Several studies have employed semi-automated methods, e.g., with the Frangi filter, to efficiently generate PVS maps (Park et al., 2016; Sepehrband et al., 2021; Langan et al., 2022; Ranti et al., 2022). Subsequently, these masks can be compared to the raw predictions to derive Dice similarity coefficients, and thus measure model performance. The downside is that segmentation performance is likely to be overestimated, compared to Dice scores derived with independently generated manual segmentations.

Whilst the evaluation metrics may be biased, the practical benefit is that considerable amounts of time and resources are saved, since correction of automated segmentations may be less time-consuming than purely manual segmentations. Whether this advantage outweighs the disadvantage of a biased model evaluation should be up to the individual researcher. Interestingly, if a machine learning model is trained on a small subset of purely manual PVS labels, and subsequently used to generate a large sample of semi-automated segmentations, then the bias might be diminished, compared to semi-automated Frangi labels, since the model was originally trained on purely human labels.

Furthermore, visual comparisons between model predictions and the ground truth are often conducted (Figure 7). Certain weaknesses of the algorithm can then be uncovered, and the algorithm adjusted retrospectively. Visual inspection is important, since PVS can be easily confounded with other lesions or imaging artifacts including lacunes, white matter hyperintensities (WMH) and microbleeds (Figure 4; Wardlaw et al., 2013; Lian et al., 2018).

Figure 7. Visual assessment of predicted PVS segmentations from different algorithms to the ground truth, adapted from Lian et al. (2018). Visual comparisons are used to compare algorithms to each other and the ground truth. Red voxels are PVS that have been manually labeled by a human rater. Cyan voxels were predicted to be PVS by the respective automated method. Yellow arrows and circles indicate low contrast perivascular spaces that can be used to differentiate segmentation ability of different methods. FT, Frangi filtering; SRF, structured random forest; M2EDN, multi-channel, multi-scale encoder-decoder network. Reprinted from Lian et al. (2018), with permission from Elsevier, conveyed through the Copyright Clearance Center, Inc.

Another common method of validating segmentation performance is to compare quantitative measures of PVS produced from segmentations, such as total PVS volumes or counts, with conventional PVS ratings (Boespflug et al., 2018; Schwartz et al., 2019). A high correlation coefficient is expected and ensures newer algorithms are commensurate with established methods. Importantly, the aforementioned methods of assessing algorithm performance are not infallible and can be misleading when incorrectly applied. These will be further discussed in more detail alongside recommendations to avoid such problems.

Existing approaches

An example of a traditional approach to segmentation is the multimodal auto-identification of perivascular spaces (mMAPs) algorithm by Boespflug et al. (2018). This method determines the likelihood of a voxel being a PVS, based on its intensity on co-registered T1, T2, FLAIR (Fluid attenuated inversion recovery) and Proton density (PD) weighted images. Voxels exceeding a certain probability are grouped into clusters, which are then deemed to be PVS if they are sufficiently linear in shape. Their application demonstrated strong correlations to visual rating scores by experts (r > 0.65) (Boespflug et al., 2018). Importantly, mMAPS required four imaging modalities: T1, FLAIR, Proton Density, and T2 sequences.

Schwartz et al. (2019) adapted the mMAPS algorithm to segment perivascular spaces using only two imaging modalities (T1-weighted and FLAIR images) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (Jack et al., 2008; Boespflug et al., 2018; Schwartz et al., 2019). Both the correlation coefficient, comparing PVS volumes to visual PVS ratings (r > 0.7), and PPVs (77.5–87.5%) were high. Worth noting is that neither Dice coefficients nor sensitivity scores were published. PPV alone is an incomplete assessment of the algorithm. For example, if an algorithm correctly labeled 100 voxels in an image with 1000 PVS voxels, its PPV would be 100%, neglecting the remaining 900 voxels it missed resulting in a Dice coefficient of only 18%. Nevertheless, the application of mMAPS to commonly acquired T1-weighted images marks an important step towards automated PVS segmentation in clinical settings.

Another common approach utilizes image filters to highlight perivascular spaces. In this context, filters are operations that calculate the resemblance of voxels and its neighbors, to features of interest such as tubes or edges. For example, the Frangi filter calculates the “vesselness” of voxels after taking into account surrounding values in order to highlight vessel-like structures (Frangi et al., 1998; Ballerini et al., 2016).

One notable approach combined both T1 and T2 scans from a single subject to produce an image with enhanced perivascular contrast or EPC (Figure 8; Sepehrband et al., 2019). By dividing the T1 image by its co-registered T2-weighted counterpart, the contrast between tissue types is enhanced, and therefore PVS clusters more discernible from the white matter. Subsequently, optimized Frangi filtering was applied to automatically label PVS voxels (Frangi et al., 1998; Ballerini et al., 2016). With EPC, the PVS-to-white matter contrast was substantially greater than in either the T1 or T2 image alone, and the number of manually detected PVS was significantly increased (Sepehrband et al., 2019). However, compared to T1 or T2 alone, EPC did not significantly improve count correlations with expert evaluations (Sepehrband et al., 2019).

Figure 8. The enhanced perivascular contrast (EPC) pipeline proposed by Sepehrband et al. (2019). In EPC, a T1-weighted image is combined with a co-registered T2-weighted image to improve the visibility of PVS. (A) Subsequently, non-local means filtering is applied and PVS are automatically segmented by a Frangi filter. (B) Comparisons between Frangi filtered PVS from a lone T1 image (top), a lone T2 image (middle), and combined modalities in EPC (bottom). Segmented voxels are labeled in red. Reproduced under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).

Typically, Dice scores are used to compare the prediction of an algorithm to a ground truth that was generated independently, usually by manual human segmentation. However, in this study, the strong Dice scores reported (74–95%) were comparing the predicted PVS maps before and after manual correction (Sepehrband et al., 2021). Therefore, these scores are biased and should be expected to perform slightly worse when compared against independently produced segmentations.

Importantly, Sepehrband et al.’s (2021) reported Dice scores highlight how the metrics can be misleading. Dice scores of algorithm predictions before and after manual correction resulted in an average of 95% (Sepehrband et al., 2021). However, when the segmentations were further corrected aided by FLAIR images, this resulted in an average Dice score of 74%. FLAIR images are often used to differentiate PVS structures from confounds such as WMH and lacunes, thus are able to identify false positives (Wardlaw et al., 2013). Clearly, the quality of the segmentations has improved after FLAIR-accompanied corrections, yet the Dice metric has declined by more than 20% (Sepehrband et al., 2021). The implication here is that, without an accompanying FLAIR image the PVS measurements would have been inflated by false positives, via inclusion of WMHs mistaken as PVS. It also suggests Frangi filtering may not be suitable for investigating PVS in disease cohorts without the additional FLAIR modality (Frangi et al., 1998; Ballerini et al., 2016). Notably, the correction for FLAIR WMHs may have removed PVS that reside within the WMHs.

Machine learning approaches

Machine learning algorithms differ from classical methods in that they automatically learn the features of their target object, such as its shape, intensity, and location. The main ML models that have been applied to PVS segmentation are random forests and CNN.

The first instance of machine learning for automated segmentation of PVS, was done by Park et al. (2016) using a random forest model. With the help of a Frangi filter, 17 MRI images were manually segmented and used to train random forest models. Compared to models trained with intensity thresholded images or vessel-ness filtered images, the random forest performed best when trained on normalized Haar filters, enabling it to learn discriminative PVS features (Park et al., 2016). Dice scores for the optimal model averaged 64%. Sensitivity was lower at 59%, and PPV was 73%. Importantly, high-resolution MRI images, 7 Tesla (T), with both T1 and T2 modalities were used. The labels used to train the random forests were initiated using a Frangi filter, then refined manually, i.e., semi-automatically. As such the resultant random forests would likely be biased to detect voxels more strongly detected by the Frangi filter.

Similarly, Zhang et al. (2017) used a structured random forest to delineate PVS in 7T images. Here, three different filters based on vascularity were used to differentiate PVS from background voxels. This was supplemented with an entropy-based sampling strategy to select regions of interest from the image. Similar to Park et al. (2016), an average 66% Dice score, 65% sensitivity and 68% PPV was achieved (Zhang et al., 2017).

The first instance of automated PVS segmentation using a fully CNN was by Lian et al. (2018). Lian et al. (2018) applied a multi-scale and multi-channel CNN architecture for this task, named the M2EDN (Figure 9). The multi-scale feature enables the model to incorporate both small and large contextual details to improve PVS detection. Frangi filtering was also used as a secondary input channel, providing more information to the model, and lastly initial predictions from the network were used as a third channel of information to refine predictions in the With 7T, T2-weighted data achieved an average Dice score of 77% was achieved (sensitivity = 74%, PPV = 83%), superseding the previous machine learning approaches with similar data parameters (Lian et al., 2018).

Figure 9. The M2EDN neural network architecture proposed by Lian et al. (2018). M2EDN stands for multi-scale, multi-dimensional encoder-decoder network. The model features a Frangi filter segmentation as a second input channel. Conv, convolution; ReLU, rectified linear unit; Pool, max pooling. Reprinted from Lian et al. (2018), with permission from Elsevier, conveyed through the Copyright Clearance Center, Inc.

To extend this work, it is necessary to explore methods of automated segmentation with lower quality datasets that are more accessible than 7T images. Boutinaud et al. (2021) employed another CNN called the u-net with an autoencoder to segment 3T, T1 images (Boutinaud et al., 2021). Typically, model parameters are initialized randomly. However, an autoencoder enables meaningful initialization of model parameters to optimize the performance of the final model (Kingma and Welling, 2013). In this case, it is unclear whether the inclusion of an autoencoder improved performance. Trained on 40 manually labeled images, the model achieved a voxel-wise Dice score of 51% in the white matter and 66% in the basal ganglia. Notably, for PVS clusters larger than 10 mm3, Dice scores above 90% could be reliably achieved (Boutinaud et al., 2021).

The noticeable decrement in performance compared to previous ML approaches can be attributed to the lower image quality and resolution of the data (7T vs. 3T) (Lian et al., 2018). However, Wang et al. (2021) recently published a study applying Lian et al.’s (2018) CNN to 3T, T2-weighted data with a Dice score of 70%. A similar performance level was achieved when the same CNN was applied to T1-weighted images, suggesting that the multi-channel and multi-scale architecture is superior to a regular u-net with an autoencoder (Kingma and Welling, 2013; Huang et al., 2021). For an in-depth discussion of each of the techniques, please refer to previous work (Barisano et al., 2022a; Moses et al., 2022).

Limitations of automated segmentations

With the automated delineation of perivascular spaces, new ways of understanding the glymphatic system are possible. However, these studies are based on algorithms with certain disadvantages. The conventional methods relying on image processing techniques require further parameter optimization for different datasets (Ballerini et al., 2016; Smith et al., 2020, Preprint; Boutinaud et al., 2021; Bernal et al., 2022). Currently, only one fully automatic segmentation pipeline is freely available, making it difficult to replicate previous methods (Boutinaud et al., 2021). Most machine learning approaches were trained or tested with data from high field (7T) MRI scanners that are not commonly used clinically (Park et al., 2016; Zhang et al., 2017; Lian et al., 2018). Several algorithms require multiple imaging modalities (Boespflug et al., 2018; Sepehrband et al., 2021). However, the primary MRI sequences acquired are T1 or T2-weighted images, with FLAIR sequences less commonly acquired (Schwartz et al., 2019).

Currently, there is no gold standard for automated 3D PVS segmentation in either T1 or T2 data leading to highly heterogenous methodologies and results across research groups. The issue is further compounded by the different scanner protocols and processing methods being utilized. Factors affecting the SNR and contrast-to-noise (CNR) ratios of PVS, including field strength, signal weighting, or resolution will attenuate PVS detection (Zong et al., 2016). Therefore, PVS detection techniques applied to either T1, T2, or both, and for different image qualities, are yet to be adopted as the gold standard.

The ideal PVS auto-segmentation tool should meet the following criteria:

(1) Open source and freely available, alongside optimized parameters or trained weights for datasets of different image qualities. Publication of algorithms in code repositories such as Github enable external reproducibility.

(2) High performance in regards to voxel-wise evaluations including Dice coefficients, sensitivity and PPV metrics based on ground truth manual segmentations.

(3) Able to segment PVS throughout the brain, including the gray and white matter, basal ganglia, hippocampus, and brainstem.

(4) Robust to noise and able to differentiate PVS from structurally similar objects such WMH, lacunes, microbleeds, and imaging artifacts (Figure 4).

Recommendations for model development

To objectively adhere to the criteria, we make the following recommendations, which relate to establishing the validity, robustness and reproducibility of the model.

(1) Validate segmentation performance on multiple open access datasets.

(2) Benchmark against previously published and optimized algorithms.

(3) Evaluate against high quality manual segmentations.

(4) Publish PVS voxel counts and volumes, image quality metrics including CNR, field strength, and voxel size as descriptive variables.

(5) Report all relevant segmentation metrics: Dice scores, sensitivity and PPV.

(6) Evaluate algorithm robustness on a noisy dataset, including images with MRI artifacts, and PVS mimics, e.g., lacunes, microbleeds, and WMHs (Wardlaw et al., 2013).

(7) Publish visual examples of algorithm predictions with 3D renders and 2D slices, displaying PVS both cross-sectionally and lengthwise. Examples should be either the default contrast that was input into the algorithm or contrast adjusted to improve the visibility of PVS.

Points 1–2 relate to establishing the reproducibility and benchmarking of the algorithm. Validation of algorithm should be completed on multiple open access datasets of inhomogeneous image qualities. Segmentation of PVS can be conducted in tools such as ITKSnap, Freeview, or Osiris (Yushkevich et al., 2006; Fischl, 2012; Othman et al., 2016). Publication of manual segmentations and predictions on open access datasets would contribute significantly towards progress in PVS research. Moreover, publishing source codes and pipelines would enable fair comparisons and benchmarking of PVS detection methods, allowing for direct comparisons between approaches. In this regard, ML approaches with trained weights should be compared to fully optimized versions of traditional methods, e.g., the Frangi filter, for a fair comparison.

Points 3–5 relate to validating the performance of the algorithm. Currently, as there is no consensus for automated methods of PVS segmentation. The gold standard of manual segmentation should be used as a benchmark for judging algorithm performance. Semi-automated methods of model validation may result in inflated and biased Dice scores (Sepehrband et al., 2021; Langan et al., 2022; Ranti et al., 2022). Another reason for inflated Dice scores may be the under-labeling of PVS, which lowers number of PVS that the algorithm is required to detect. This can be observed by lower-than-expected average counts of PVS voxels and volumes. In the same way a study publishes the sample size, and the percentage of subjects that developed a disease, publishing counts of PVS voxels in the ground truth would make the resulting Dice scores more meaningful. An algorithm that performs well on images with a small number of PVS may not perform as well for MRI scans with greater volumes of PVS.

Noisy labels are inherent in neuroimaging segmentations. Many factors can cause PVS voxels to go undetected by manual raters, including experience, time constraints, image quality and contrast, and the inclusion/exclusion criteria. Therefore, an experienced neuroradiologist should validate the integrity of ground truth segmentations. A detailed guide for detecting PVS in T2 MRI images has been published by Potter et al. (2015c). As this guide primarily focuses on T2-weighted axial slices for the purpose of assigning PVS enlargement scores according to a rating scale, we make further recommendations to aid segmenters in producing high-quality 3D PVS segmentations:

(1) All voxels within a cluster should be labeled such that no hypointense or hyperintense voxels connected to that cluster remain visible, in T1 or T2-weighted images, respectively.

(2) Where there are cerebral blood vessels there are PVS, including the white matter, gray matter, basal ganglia, hippocampus, and brainstem (Figure 10).

Figure 10. 3D PVS segmentation from a T1-weighted MRI scan. PVS voxels are labeled in red (top row), and the corresponding unlabeled slices are in the bottom row. Basal ganglia PVS are visible in the axial (left), sagittal (middle), and coronal (right) slices.

(3) In the white matter, PVS are generally directed towards the ventricles (Figure 10).

(4) In the basal ganglia, PVS following the lenticulostriate arteries are most obvious in the sagittal views, appearing to travel superiorly with an anterior curvature (Figure 10).

(5) To distinguish PVS from WMHs, where available, FLAIR images should be used (Wardlaw et al., 2013). WMHs are visible on all three sequences: T1 (as hypointense), T2, and FLAIR. On FLAIR sequences, the PVS are not visible but WMHs are.

Given their numerosity and at times inconspicuous appearance of PVS in MRI, inexperienced raters may find it difficult to detect and label PVS clusters in their entirety (Wardlaw et al., 2013). PVS clusters should be labeled when there is sufficient evidence based on features such as voxel intensity, cluster size (e.g., >4 voxels), directionality, and brain region. Additionally, segmentation software such as ITKSnap allow for manual contrast adjustments (Yushkevich et al., 2006). Contrast adjustment can be very useful, as varying contrast or brightness levels may be required for different regions of the image, e.g., due to bias field inhomogeneities. To improve PVS visibility in one region of the image may require brightening, whereas other regions may require darkening.

The visibility of PVS is strongly associated with image quality. The field strength of the MRI scanner has been shown to significantly affect PVS detection (Barisano et al., 2021a). However, this does not take into account preprocessing methods where images are enhanced to increase PVS visibility. For example, by combining 3T T1 and T2 images in the EPC pipeline, it is likely that the CNR for PVS are greater than 7T data without preprocessing (Sepehrband et al., 2019). Ultimately, CNR is the most direct and comparable metric of PVS visibility for a given dataset, considering variables including scanner field strength, and preprocessing methods. Publication of SNR and CNR values are useful for determining whether appropriate comparisons between studies with different datasets can be made, since models developed on data of lower SNR may not be applicable to higher SNR datasets. This will be further discussed in section “Recommendations for future research and clinical applications.”

S⁢N⁢R=M⁢E⁢A⁢NG⁢MS⁢DA⁢I⁢R;C⁢N⁢R=M⁢E⁢A⁢NW⁢M-M⁢E⁢A⁢NG⁢MS⁢DA⁢I⁢R

Equations 2. SNR, signal-to-noise ratio; CNR, contrast-to-noise ratio; MEANGM, mean gray matter intensity; SDAIR, standard deviation of the intensity or empty voxels outside the head; MEANWM, mean of the white matter intensity. These metrics should be calculated with respect to PVS voxels and within the bounds of the 3D brain extracted volume. These equations apply for T1-weighted images. For T2-weighted images, the MEANWM and MEANPVS are interchanged.

Alternative methods of assessing segmentation performance include PVS count correlations or interclass concordance correlations. These are convenient assessments of reliability and accuracy of PVS cluster quantification, but not voxel-wise segmentation performance, and therefore voxel-based volumetric assessments of PVS. One critique of the Frangi filter is that its performance deteriorates as the size of a PVS cluster increases (Bernal et al., 2022). The implication for PVS research is that PVS counts, rather than volumes, are more likely to demonstrate statistical differences. Thus, count correlations for validation of segmentation performance do not convincingly demonstrate high PVS detection at a voxel-level, and Dice metrics should be preferred.

Findings from related fields of neuroimaging and lesion detection can offer useful insights applicable to PVS segmentation. In MRI lesion segmentation of patients with multiple sclerosis, inter-rater dice scores (between two human raters) around 60% are typical (Egger et al., 2017). For these tasks, there are a small number of large lesions to be detected in a single brain scan. In comparison, PVS occur repeatedly and throughout the brain, numbering in the hundreds. Given that PVS are more numerous and diffuse, one would expect that inter-rater dice scores for the task of PVS segmentation to be much lower. Two publications have assessed inter-rater agreement in PVS segmentation, with median or average inter-rater Dice scores of 26.7 and 49% (Spijkerman et al., 2022) implying that segmentation algorithms exceeding this 50% Dice threshold has superseded human performance (Spijkerman et al., 2022). Necessarily, visual inspections should be conducted to verify that the model outperforms the human labels. Broadly, the algorithm should be critiqued based on its ability to consistently detect all PVS clusters labeled by the human, and whether it can reliably label all PVS voxels belonging to each cluster. In the ideal scenario, where the human segmentation is free of error and imperfection, a Dice score of 100% would indicate a perfect prediction. In practice, the segmentation of medical images often includes noisy labels, i.e., false positives and false negatives, thus such Dice scores are not realistic.

If a model has genuinely outperformed manual segmentations, then presumably it has labeled not only all PVS detected by the rater, but also PVS that evaded human detection. In this case, a high Dice score should be observed alongside a very high sensitivity score and a lower PPV, as the model is presumably detecting more PVS clusters than were labeled manually. This may indeed be the case for Sepehrband et al.’s (2021) approach with non-local means filtering and the Frangi filter (Dice = 74%, Sensitivity = 98%, PPV = 61%) (Figure 8). Thus, all three metrics serve complementary roles to assess model predictions. Furthermore, recording Dice scores for different regions ensures the model performs adequately throughout the brain white matter, basal ganglia, midbrain, and hippocampus. Performance in one region might not be indicative of performance in another region as the appearance and anatomy of PVS in the white matter differ substantially from PVS in the BG (Figure 10). Not only are these regions different structurally, but the morphology of PVS may differ between regions. For example, PVS inferior to the putamen tend to be larger in width and volume, then those that are above the level of the putamen (Pullicino et al., 1995; Wardlaw et al., 2013; Bouvy et al., 2014).

Moreover, testing the robustness of the algorithm in noisy data with image artifacts and PVS confounds will be useful. This is especially important when the algorithm is to be applied in disease cohorts, where such lesions and PVS mimics are commonly observed. If algorithms are to be useful in characterizing PVS in disease, they need to be validated in similar data. Almost all algorithms published were developed based on “clean” images without PVS mimics that commonly occur in patients of neurodegenerative diseases. To our knowledge, only Sepehrband et al. (2021) has evaluated the effect of WMH presence on the Frangi filter. Co-registered FLAIR scans were used to manually correct PVS masks generated by the Frangi filter, and found a substantial decrement of 21% in the Dice metric (Sepehrband et al., 2021). Thus, if findings in pathological cohorts derived from these algorithms are to be trusted, they need to be validated against noisy and pathological data.

Key biological findings from perivascular spaces segmentations

Over the course of the lifespan changes in gray matter, white matter, and ventricular volumes are numerous and have been well documented (Bethlehem et al., 2022). The growth trajectory of perivascular spaces, visible in MRI, has not been as comprehensively characterized, but insights can be gleaned from multiple studies (Table 2).

Table 2. Summary of the PVS research that has resulted from 3D segmentations.

In adolescence (12–21 years old), PVS appear to be bilaterally symmetric, and tend to visible more often in the frontal and parietal lobes compared to the temporal and occipital lobes (Piantino et al., 2020). Male adolescents (mean PVS count = 98.4) also had significantly greater WM-PVS counts than females (mean PVS count = 70.7) (Piantino et al., 2020). Thus, a spatial distribution of PVS enlargement, and sex differences arise early on. The biological mechanisms behind these observations, and whether this spatial distribution is constant or changes with age is unclear.

In young adults (21–37 years old), the mean diameter of PVS in the frontal, parietal-occipital, temporal lobes and subcortical nuclei (basal ganglia and thalamus) were significantly different, with those in the subcortical nuclei, wider than the three WM regions (Zong et al., 2016). Of these four regions, the parietal-occipital lobe exhibited the highest PVS volume fraction, and the temporal lobe, exhibited the lowest volume fractions. Moreover, heatmaps of the spatial distribution of PVS location, length, and tortuosity have been generated (n = 50, age range = 27–78 years old) (Spijkerman et al., 2022). Table 3 summarizes the publications that have explored the quantitative and morphological attributes of PVS.

Table 3. Summary of the PVS metrics that have been investigated by previous publications.

In a cohort of healthy older adults (n = 160, mean age = 60.4 years old), both WM and BG-PVS have been associated with cerebral small vessel disease markers, such as WMHs (Wang et al., 2021). BG-PVS volumes, counts, and width, and WM-PVS counts were associated with hypertension (Wang et al., 2021). Moreover, greater WM-PVS sizes were associated with presence of diabetes. In this cohort, the median number of PVS clusters of 490 and 65, consisting of median PVS volumes of 2,371 and 166 mm3, was detected in the white matter and basal ganglia, respectively (Wang et al., 2021).

Frangi-detected WM-PVS in the Lothian Birth Cohort (LBC1936, n = 533, mean age = 72.6 years old), found that mean PVS size (volume per cluster), length and width, as opposed to counts, to be associated with WMH (Ballerini et al., 2020a). PVS size and widths were also associated with hypertension and risk of stroke (Ballerini

View original article

FRONTIERS IN NEUROSCIENCE

分享书签

0 0 0 0 0 0 0

More from this channel

A critical guide to the automated quantification of perivascular spaces in magnetic resonance imaging

留言 (0)