Outlier detection in multimodal MRI identifies rare individual phenotypes among more than 15,000 brains

1 INTRODUCTION

Outliers are defined as observations differing by a large amount from most other observations (Tan, Steinbach, & Kumar, 2006). By this definition, outliers constitute a small portion of a dataset and are exceptional patterns in some sense. Detecting outliers is of interest in brain imaging for two major reasons. First, outliers can occur due to imaging artifacts or noise. For example, head motion adversely affects brain morphometry, diffusion, and connectivity measurements (Power, Schlaggar, & Petersen, 2015; Reuter et al., 2015; Yendiki, Koldewyn, Kakunoori, Kanwisher, & Fischl, 2014) and causes outliers in these data. Second, and more importantly, some outliers represent unusual phenotypes that deserve special attention. For example, an anomalous MRI may indicate the presence of neurological disease that requires clinical attention. Certain unusual phenotypes may also be interesting for follow-up to determine the underlying mechanism for the large deviations of their brain MRI from the population.

Outlier detection methods applied in brain imaging can be categorized in many ways. One common way is based on whether the method makes use of labeled datasets to train the outlier detection model: supervised methods use labeled datasets that contain both labeled outliers and labeled non-outliers for training; semi-supervised methods use labeled datasets that only contain labeled non-outliers for training; and unsupervised methods use unlabeled datasets for training (Goldstein & Uchida, 2016). Using the available diagnostic labels for all subjects or at least the non-outlier subjects, outlier detection studies have employed a variety of algorithms, such as one-class support vector machine, Gaussian process regression, or autoencoders, and these have been applied in a supervised or semi-supervised manner to quantify the outlierness of healthy individuals or patients (Marquand, Rezek, Buitelaar, & Beckmann, 2016; Mourao-Miranda et al., 2011; Pinaya, Mechelli, & Sato, 2019; van Hespen et al., 2021). However, diagnostic labels are not always available, making the supervised or semi-supervised approaches challenging to implement across the board. Unsupervised outlier detection methods are needed for unlabeled brain imaging datasets, for example, the UK Biobank (UKB), an ongoing large epidemiological cohort (Miller et al., 2016).

The UKB is enrolling 500,000 subjects 40–69 years of age for extensive phenotyping and subsequent long-term monitoring of health outcomes (Allen et al., 2012). In this cohort, 100,000 subjects are currently in the process of being invited back for MRI imaging, making it the largest multimodal MRI cohort in the world (Littlejohns et al., 2020). By enrolling a large population of this age range, this unlabeled brain imaging dataset includes healthy and presymptomatic subjects, as well as a small fraction of subjects with different clinically relevant diseases. Over time, many more subjects in this cohort will become identified with a clinically relevant disease (Miller et al., 2016). Given its large sample size, the UKB cohort enables a unique opportunity for developing unsupervised outlier detection methods to identify rare imaging phenotypes. These rare imaging phenotypes could be clinically relevant or informative for discovering new processes and mechanisms.

In the present study, a two-level outlier detection and screening methodology was developed to characterize individual outlying MRI results across multiple brain imaging phenotypes among more than 15,000 UKB subjects. We made use of the multimodal MRI data to derive ventricular, white matter, and gray matter-based imaging phenotypes of the brain (Figure 1a). Every subject was parameterized with an “outlier score” per imaging phenotype in an unsupervised manner without any prior labels (Figure 1b). This outlier score quantifies how far an individual deviates from most other subjects. Test–retest reliability of outlier scores of each imaging phenotype was characterized in the subjects that had repeat MRI scans, and any less reliable imaging phenotype was not used for further individual-level outlier screening. Individual extreme outlier subjects were categorized according to whether there were data collection/processing errors, or whether the individual had radiological findings or appeared normal as determined by a board-certified neuroradiologist (Figure 1c). Similar outlier detection and screening procedures were also carried out separately in the Human Connectome Project (HCP) dataset (Van Essen et al., 2013), and the extreme outlier subjects from this young adult cohort that might be interesting for follow-up are also described.

image

Schematic illustration of outlier detection and screening pipeline. (a) Brain imaging phenotypes used for outlier detection. (b) Primary screening: calculation of outlier scores. (c) Secondary screening: investigation of individual extreme outliers

2 MATERIALS AND METHODS 2.1 Main dataset

The multimodal brain MRI data of 19,406 subjects (9,170 males and 10,236 females; age 44–80) at the initial imaging visit were downloaded from the UKB. This included T1-weighted (T1w) MPRAGE and T2-weighted (T2w) FLAIR structural MRI, spin-echo echo-planar imaging (EPI) diffusion MRI (dMRI), and gradient-echo EPI resting-state functional MRI (rsfMRI) data. Some subjects only had usable T1w data available in this sample, resulting in a reduced initial sample size of other MRI modalities. Following exclusions based on automatic quality control described below in Section 2.3, the final sample size for each imaging modality varied from 15,166 to 19,076 (hereafter referred to as UKB discovery group for this final sample). The detailed number of exclusions and the demographic information of the final sample are summarized in Table S1. The data were acquired on identical 3 T Siemens Skyra MRI scanners, and detailed acquisition protocols can be found elsewhere (Alfaro-Almagro et al., 2018). The UKB study was approved by the North West Multi-centre Research Ethics Committee, and informed consent was obtained from all participants. The present study was approved by the Office of Human Subjects Research Protections at the National Institutes of Health (ID#: 18-NINDS-00353).

2.2 Image preprocessing and extraction of imaging phenotypes

The following six commonly used brain imaging phenotypes were extracted from imaging preprocessing outputs: ventricular volume (VV), white matter lesion volume (WMLV), fractional anisotropy (FA), mean diffusivity (MD), cortical thickness (CTh), and resting-state functional connectivity (RSFC). The detailed procedures are described as follows.

The raw T1w MPRAGE and T2w FLAIR images were preprocessed by the HCP structural pipeline (v4) (Glasser et al., 2013) based on FreeSurfer (v6) (Fischl, 2012). For the subjects without usable T2w FLAIR images, the ventricles were segmented from their T1w images using FreeSurfer. The ventricular segmentations were manually inspected for each subject, and 213 subjects with large segmentation defects in their enlarged ventricles were reprocessed with “-bigventricles” flag in FreeSurfer to correct the defects. Each subject's VV was calculated by summing the volumes of lateral ventricles, temporal horns of the lateral ventricles, choroid plexuses, third ventricle, and fourth ventricle. WMLV was calculated by the Brain Intensity Abnormality Classification Algorithm (BIANCA, Griffanti et al., 2016), a k-nearest-neighbor-based automated supervised method, using T2w FLAIR images but also T1w images as its inputs. Unsmoothed CTh values in the standard CIFTI grayordinate space (with folding-related effects corrected) were averaged within the region of interests (ROIs) of the HCP multimodal parcellation atlas (360 regions) (Glasser et al., 2016), and these ROI-wise CTh values were used for primary screening.

The dMRI data underwent FSL eddy-current and head-movement correction (Andersson & Sotiropoulos, 2016), gradient distortion correction, diffusion tensor model fitting using the b = 1,000 shell (Basser, Mattiello, & LeBihan, 1994), and Tract-Based Spatial Statistics (TBSS) analyses (Smith et al., 2006). The TBSS skeletonized images were averaged within the ROIs of the Johns Hopkins University white matter atlas (Mori et al., 2008). Here, the original MD values were multiplied by 10,000 to convert to the unit of 10−4 mm2/s. The FA or MD maps of 27 major white matter ROIs (Table S2) were used for primary screening.

The rsfMRI data were preprocessed by the UKB rsfMRI pipeline (v1) (Alfaro-Almagro et al., 2018), and the volumetric FIX-denoised data (Griffanti et al., 2014; Salimi-Khorshidi et al., 2014) were brought to the standard CIFTI grayordinate space using Ciftify (v2.3.2) (Dickie et al., 2019). For each subject, the standard deviation (SD) of percent change time series of each grayordinate was calculated, and the grayordinates with this SD greater than 0.1 were considered as noisy grayordinates. These noisy grayordinates were masked from further analyses. Using a well-established RSFC-based parcellation scheme (333 parcels) (Gordon et al., 2016), RSFC was quantified by the Pearson cross-correlation coefficient between the ROI-averaged time series of each pair of parcels, with or without global signal regression, respectively. In addition, RSFC was quantified using partial correlations with Tikhonov regularization (ρ = 0.5; FSLNets) (Pervaiz, Vidaurre, Woolrich, & Smith, 2020). Due to the symmetry of the RSFC matrices, the upper triangular parts of these matrices (333 × 332/2 = 55,278 elements) from each of these three RSFC evaluation methods were used for primary screening, respectively.

2.3 Automatic quality control

Recent research has shown the importance of quality control in big neuroimaging datasets (Maximov et al., 2021; Monereo-Sanchez et al., 2021). Exclusion of poor quality data was performed based on eight quality control metrics. First, for all imaging phenotypes, because their preprocessing all relied on usable T1w images (Alfaro-Almagro et al., 2018), the subjects with low image quality of their T1w images were excluded for further analyses. The quality of T1w images was evaluated quantitatively using the Computational Anatomy Toolbox (CAT12) (Dahnke, Yotter, & Gaser, 2013; Gaser & Dahnke, 2016), which generated a single aggregated metric on a 100-point scale for the overall quality of each T1w image, with 100 the best possible. The T1w images with scores below 75 were excluded (Gaser & Dahnke, 2016; Gilmore, Buser, & Hanson, 2021).

For FA and MD, two head motion parameters and one registration quality parameter were used for quality control. These two head motion parameters were each subject's mean and largest values of the volumetric movements between adjacent dMRI frames. The registration quality parameter was each subject's mean deformation of the TBSS nonlinear registration. For CTh, FreeSurfer's Euler number, which summarized surface reconstruction quality (Rosen et al., 2018), was used for quality control. In addition, because T1w/T2w ratio myelin maps were sensitive to subtle errors of registration or surface placement (Glasser et al., 2013), following the multidimensional outlier detection method described below in Section 2.4, an outlier score of myelin map was calculated per subject and was used for CTh quality control. For RSFC, two head motion parameters were used for quality control. These two head motion parameters were each subject's mean and largest values of the framewise displacement between adjacent EPI volumes. For the seven quality control metrics described above, data in the range above the upper inner fence of the distribution of that metric were excluded from further analyses. Here, the upper inner fence was the third quartile (Q3) plus 1.5 times the interquartile range (IQR) of the distribution, and the observations above it are commonly defined as mild (greater than Q3 + 1.5 × IQR, but smaller than Q3 + 3 × IQR) or extreme outliers (greater than Q3 + 3 × IQR) in statistics (Tukey, 1977). This upper inner fence threshold was applied in the quality control of neuroimaging data (Monereo-Sanchez et al., 2021).

2.4 Primary screening: Calculation of outlier scores In primary screening, every subject that passed the quality control was parameterized with an outlier score per imaging phenotype. The outlier score quantified the degree of outlierness in that imaging phenotype, and extreme outliers were identified based on the outlier scores. In statistics, extreme outliers in distribution are defined as the observations above the Q3 plus three times the IQR of that distribution (Tukey, 1977). For a unidimensional imaging phenotype (VV, WMLV), using VV as an example, the number of IQRs away from the Q3 of the VV distribution was used to define VV outlier scores: urn:x-wiley:10659471:media:hbm25756:hbm25756-math-0001(1)In this way, the unit of outlier score is IQR, and an extreme outlier has an outlier score of greater than 3. WMLV outlier scores were calculated similarly. For each multidimensional imaging phenotype (FA, MD, CTh, RSFC), an autoencoder was used to calculate the outlier scores (Hawkins, He, Williams, & Baxter, 2002). Setting the dimensionality of the imaging phenotype as M and the number of subjects in the UKB discovery group as N, the inputs to the autoencoder were the values of that imaging phenotype across the whole group (M × N), and the autoencoder was trained to replicate this input at its output. By definition, outliers only comprised a small portion of a dataset; therefore, the trained autoencoder cannot replicate these outliers as good as the non-outliers. This resulted in larger replication errors for the outlying subjects. These replication errors (also known as “autoencoder reconstruction error”) were measured by the root mean square errors between each input and the autoencoder-predicted output. Because these replication errors were unidimensional, similar to the calculation of outlier scores for unidimensional imaging phenotypes, the number of IQRs away from the Q3 of the replication error distribution was used to define outlier scores: urn:x-wiley:10659471:media:hbm25756:hbm25756-math-0002(2)Still, the unit of outlier score is IQR, and an extreme outlier has an outlier score of greater than 3.

In the above analyses, to control for the effects of covariates (age, brain volume, and the image quality metrics described in Section 2.3) on outlier detection, their correlations with VV, WMLV, and the autoencoder replication errors of multidimensional imaging phenotypes were evaluated (Figure S1). The covariates with correlation >0.3 were regressed out from VV, WMLV, or the replication errors before applying Equation (1) or (2). As a result, age, brain volume, and CAT12's T1w image quality metric were regressed out from VV. Age was regressed out from WMLV. Age was also regressed out from the autoencoder replication errors of MD. FreeSurfer's Euler number was regressed out from the autoencoder replication errors of CTh.

Each autoencoder used in the present study was comprised of an input layer (M dimensions), a hidden layer of 10 neurons, and an output layer (M dimensions). A sparsity proportion of 0.05 was used, and the sparsity regularization coefficient was set to 1. The L2 weight regularization coefficient was set to 0.001. The sigmoid function was used as the activation function, and the mean squared error function adjusted for sparse autoencoder was used as the loss function. A scaled conjugate gradient descent algorithm (Moller, 1993) was used for training the autoencoder. The autoencoders were implemented using the “trainAutoencoder” function in the MATLAB and were trained using a GPU cluster (https://hpc.nih.gov). When the input dataset was too large to fit into the GPU memory, multiple autoencoders were used. In these scenarios, the input data were split into four to five smaller subgroups in a stratified manner, preserving the ratio of age and sex in each subgroup. For each subgroup, an autoencoder was trained using the data of that subgroup as the input. The trained autoencoders were then applied to the full dataset and the output of the whole group was obtained by averaging the outputs from each of these autoencoders.

2.5 Evaluation of reliability of outlier scores and elimination of less reliable imaging phenotype A subgroup (1,391 subjects) of the UKB discovery group subjects had a repeat MRI session (also known as “retest”) 2–3 years after the initial imaging visit (also known as “test”). The test and retest data of these subjects were used to evaluate long-term reliability of outlier scores. Unlike the primary screening, in the reliability analysis, the volume measurements of unidimensional imaging phenotypes or the autoencoder replication errors of multidimensional imaging phenotypes were no longer adjusted for covariates. For each unidimensional imaging phenotype, the Q3 and IQR were calculated from the full test data and applied to calculate outlier scores for both test data and, for subjects who were scanned twice, retest data. For each multidimensional imaging phenotype, the autoencoders trained on the full test data were applied to the retest data. The reliability was quantified by intraclass correlation coefficient (ICC; Shrout & Fleiss, 1979) between the outlier scores of the test and retest data using a one-way random effects model: urn:x-wiley:10659471:media:hbm25756:hbm25756-math-0003(3)where MSb is the between-subject mean square, MSw is the within-subject mean square, and k is the number of observations per subject (McGraw & Wong, 1996). Reliability was defined as excellent (ICC > 0.8), good (0.8 > ICC > 0.6), moderate (0.6 > ICC > 0.4), fair (0.4 > ICC > 0.2), or poor (ICC 2012) in the present study.

Any imaging phenotype with moderate/fair/poor outlier score reliability was excluded from further analysis of individual outliers. This resulted in the exclusion of RSFC (for details, see Section 3.2).

2.6 Secondary screening: Investigation of individual extreme outliers

The automatic quality control described in Section 2.3 excluded most data collection/processing errors. However, a small number of errors could remain in this large cohort. For example, potentially low quality T2w FLAIR images and potential segmentation errors of white matter lesions were not accounted for because of the lack of a well-established tool for automatic assessment of the quality of T2w FLAIR images or white matter lesion segmentation. To capture potential data collection/processing errors that may occur in extreme outliers, for each remaining imaging phenotype, the extreme outlier subjects were first manually inspected and the ones with the errors were eliminated. For each VV extreme outlier subject, ventricle segmentation was visually inspected by overlaying the border of the segmented ventricle mask on the T1w image. For each WMLV extreme outlier subject, white matter lesion segmentation was visually inspected by overlaying the border of the segmented lesion mask on the T2w FLAIR image. The FA or MD extreme outlier subjects were visually checked for registration and field of view (FOV) coverage. For CTh extreme outlier subjects, their white/pial surface segmentation was visually checked via HCP pipeline structural quality control scenes (https://github.com/Washington-University/StructuralQC; v1.4.0).

A subgroup (120 subjects) of the remaining non-artifactual extreme outlier subjects were radiologically reviewed. This subgroup included all top-ranked extreme outlier subjects and randomly sampled non-top extreme outlier subjects to ensure a wide coverage (Figure S2). T1w MPRAGE and T2w FLAIR images, as well as the ages of these subjects, were provided to a board-certified neuroradiologist (D. S. R.). The instructions to the neuroradiologist were to identify any major findings that might plausibly account for the extreme outlier score—not to identify subtle abnormalities that would have required dedicated review on clinical-grade display systems. When the neuroradiologist was uncertain of the diagnosis, UKB health outcomes data (UKB Category 1712) were used in an attempt to determine the diagnosis. These data recorded the first occurrence of various diseases, including neuropsychiatric and neurological disorders. Based on the radiological review results, the subjects in the subgroup were further divided into two subgroups: a subgroup of the extreme outlier subjects with radiological findings (117 subjects), and another subgroup of the extreme outlier subjects which appeared normal to the neuroradiologist (3 subjects). The cases from these two subgroups that would be interesting for follow-up were highlighted.

2.7 Evaluation of the relationships between outlier scores of different imaging phenotypes The relationships between outlier scores of different imaging phenotypes were quantified using Pearson cross-correlation coefficients in the UKB discovery group. Two representative relationships of outlier scores, WMLV versus VV, and WMLV versus FA, were also visualized using scatterplots. In each scatterplot, three zones were defined to categorize extreme outlier subjects. For WMLV versus VV, Zone I covered the subjects who were VV extreme outliers but with normal WMLV (WMLV outlier score urn:x-wiley:10659471:media:hbm25756:hbm25756-math-0004(4) urn:x-wiley:10659471:media:hbm25756:hbm25756-math-0005(5) urn:x-wiley:10659471:media:hbm25756:hbm25756-math-0006(6)To evaluate the differences in densities across the three zones, a bootstrap procedure with replacement on subjects was used to generate 100,000 bootstrap samples of the original sample size. For each bootstrap sample, the density of each zone was recomputed. A one-way analysis of variance (ANOVA) was then performed to evaluate the differences across the zones using the bootstrap samples. Similar analyses were also carried out to evaluate the relationship between WMLV and FA outlier scores. 2.8 Outlier detection and screening in the HCP dataset

Similar outlier detection procedures were carried out separately in the HCP dataset to identify interesting extreme outliers in this young adult cohort (for details, see Supplementary Methods). Briefly, 3 T MRI data from the 1,200 Subjects Release (1,113 subjects: 507 males and 606 females; age 22–37) were used (Glasser et al., 2016). Because of the lack of HCP T2w FLAIR data and poor WMLV segmentation accuracy when only using T1w images (Hotz et al., 2021), WMLV was excluded from the outlier detection of the HCP dataset. All the HCP extreme outliers (12 subjects) without data collection/processing errors were radiologically reviewed, and the cases that would be interesting for follow-up were highlighted.

3 RESULTS 3.1 Properties of outlier score distributions

The results presented throughout the rest of the manuscript were obtained using the UKB discovery group unless otherwise specified. The outlier score histogram of each imaging phenotype is shown in Figure 2. These distributions were all right-skewed and more leptokurtic than a standard normal distribution (see Table 1 for skewness and kurtosis values). The percentage of extreme outliers ranged from a lowest of 0.2% in RSFC, to a highest of 3.9% in WMLV (Table 1). These percentages are all much higher than a standard normal distribution predicts, because the criterion of Q3 + 3 × IQR for defining extreme outliers (referred to as “outlier” hereafter) in each distribution is equivalent to about 4.7 times the SD plus the mean in a standard normal distribution. One would predict only 0.0001% of the data above mean + 4.7 × SD in a standard normal distribution. Taken together, the results suggest that the outlier score distributions were all more outlier-prone than a standard normal distribution.

image

Outlier score histograms. (a) Ventricular volume (VV). (b) White matter lesion volume (WMLV). (c) Fractional anisotropy (FA). (d) Mean diffusivity (MD). (e) Cortical thickness (CTh). (f) Resting-state functional connectivity (RSFC). The zoom panels show the outlier score histograms of extreme outlier subjects

TABLE 1. Summary of outlier score distributions for the main dataset (UKB discovery group) Phenotype VV WMLV FA MD CTh RSFCa Number of subjects 19,076 18,166 15,432 15,432 16,200 15,166 Skewness 1.92 4.28 7.38 7.01 0.99 1.01 Kurtosis 11.47 36.56 227.61 156.27 6.20 6.01 Number of outliers 190 (1.0%) 706 (3.9%) 134 (0.9%) 174 (1.1%) 54 (0.3%) 33 (0.2%) Abbreviations: CTh, cortical thickness; FA, fractional anisotropy; MD, mean diffusivity; RSFC, resting-state functional connectivity; UKB, UK Biobank; VV, ventricular volume; WMLV, White matter lesion volume. 3.2 Long-term test–retest reliability of outlier scores

A subgroup of the discovery group subjects had a repeat MRI session 2–3 years after the initial visit. The outlier scores of test versus retest of each imaging phenotype are visualized in the scatterplots of Figure 3a–f, respectively. VV outlier scores had excellent test–retest reliability, as indicated by the close-to-one value of the ICC (ICC = 0.98) between test and retest outlier scores. The test–retest reliabilities of WMLV and FA outlier scores were lower than VV but still excellent (WMLV ICC = 0.82; FA ICC = 0.86). The test–retest reliabilities of MD and CTh outlier scores were lower than the former three but still in the range of good reliability (MD ICC = 0.72; CTh ICC = 0.64).

image

Long-term test–retest reliability of outlier scores. (a) Ventricular volume (VV). (b) White matter lesion volume (WMLV). (c) Fractional anisotropy (FA). (d) Mean diffusivity (MD). (e) Cortical thickness (CTh). (f) Resting-state functional connectivity (RSFC). For (a–f), in each scatterplot, each subject's outlier score of the initial imaging visit (also known as “test”; year 2014+) is plotted against the outlier score of the first repeat imaging visit (also known as “retest”; year 2019+). ICC: intraclass correlation between outlier scores of the two visits. Red dashed line: Q3 + 3 × IQR. (g) The scatterplot of test–retest global signal amplitude (GSA) change versus test–retest RSFC outlier score change. For (a–g), only the UK Biobank (UKB) subjects that had both test and retest data available are shown in these scatterplots. (h) The scatterplot of GSA versus RSFC outlier score (RSFC calculated using full correlations)

However, RSFC outlier scores had a low test–retest ICC (ICC = 0.40, Figure 3f). Because of this low reliability, among the subjects with available test–retest data, no subject had both test and retest RSFC identified consistently as an outlier. This change in test–retest outlier scores was found to be correlated with the change of global signal amplitude (r = .43, Figure 3g). Here, global signal amplitude was defined as the SD of the global signal (Wong, Olafsson, Tal, & Liu, 2013). Indeed, the RSFC outlier score itself was found to be correlated with global signal amplitude (r = .51, Figure 3h). This association was unlikely due to head motion, because the subjects with large head motion were excluded in the automatic quality control. The association between RSFC outlier score and global signal amplitude also persisted when using partial correlations to evaluate RSFC, although they became negatively correlated in this case (r = −.69, Figure S3a). Global signal regression reduced their association, but RSFC outlier score was still moderately correlated with global signal amplitude (r = .42, Figure S3b). Remarkably, when we carried out similar analyses on the HCP dataset, the results were very similar (Figure S3c–h). Thus, RSFC was eliminated for further individual-level outlier screening due to its low individual test–retest reliability.

3.3 Summary of the screening results of individual outliers

The total number of outliers across all individual imaging phenotypes (excluding RSFC) was 1,258. Because there were subjects who were outliers in more than one imaging phenotype, there were 1,026 distinct subjects that made up these 1,258 outliers.

Through the screening of each outlier, 87 outliers were associated with data collection/processing errors. This was true despite the use of automatic quality control to exclude poor quality data before outlier detection. Interestingly, none of the VV outliers were associated with data collection/processing errors. More frequent data collection/processing errors were found in the WMLV, FA, or MD outliers as compared to the VV outliers (Table 2). The errors were found to be most frequent (22.2%, 12/54) in the CTh outliers. Some of the errors occurred at the data acquisition stage, due to head motion artifacts (Figure S4a) or the selection of a wrong FOV (Figure S4b). Others occurred at the data processing stage, such as incorrect segmentation (Figure S4c) or incorrect registration (Figure S4d).

TABLE 2. Summary of radiological review results of the outlier subjects in the main dataset Phenotype VV WMLV FA MD CTh Outliers without data issue 190 (100%) 640 (90.7%) 129 (96.3%) 170 (97.7%) 42 (77.8%) Outliers read by neuroradiologist 41 62 37 37 18 Radiological comments Normal 2 1 Large ventricles 38 18 9 8 5 White matter lesions 27 62 29 31 9 Mass 2 1 1 Cyst 4 1 1 2 1 Infarct 6 16 9 11 4 Encephalomalacia 3 3 Prominent sulci 2 1 3 1 3 Other findings 4 9 11 10 3 Note: Empty entries are zeros. Abbreviations: CTh, cortical thickness; FA, fractional anisotropy; MD, mean diffusivity; RSFC, resting-state functional connectivity; VV, ventricular volume; WMLV, White matter lesion volume.

Of the remaining 1,171 outliers (954 distinct subjects) that did not have data collection/processing errors, 120 distinct subjects were reviewed by a board-certified neuroradiologist (Table 2). These 120 subjects included all top-ranked outliers and randomly sampled non-top-ranked outliers, and the outlier scores of this representative subgroup spanned almost the whole range above Q3 + 3 × IQR of the outlier score distribution in each imaging phenotype (see Figure S2 for details). In this subgroup, 117 subjects (97.5%, 117/120) were identified with radiological findings, and these findings covered a diverse category of phenotypes, such as large ventricles, masses, cysts, white matter lesions, infarcts, encephalomalacia, and prominent sulci. Representative individual outlier subjects are reported in the next few subsections per imaging phenotype.

3.4 Individual outliers of VV

As an example, Figure 4a shows a VV outlier subject versus a normal subject. This subject had significantly enlarged lateral ventricles compared to a normal one (about 7.9 × IQR away between these two subjects in VV outlier score distribution). Forty-one of the VV outliers were reviewed by the neuroradiologist. Thirty-eight of the VV outliers being read were identified with radiological findings of large ventricles. Some of them had relatively clear etiology: A third ventricle mass (possibly a choroid plexus papilloma), a fourth ventricle mass (possibly an ependymoma), a colloid cyst, and a frontoparietal arachnoid cyst, all of which could cause obstructive hydrocephalus, were found in four VV outlier subjects, respectively (Figure 4b). The other major pathologies identified in the VV outliers were infarcts, nodules, agenesis of corpus callosum, and white matter lesions (Figure S5a).

image

Individual outliers of ventricular volume (VV). (a) Structural images of an example of a VV outlier subject (left column) and an example of a normal VV subject (right column). (b) Structural images showing radiological findings in four representative VV outlier subjects. (c) Structural images of VV outlier subjects interesting for follow-up. Left column: A subject with large ventricles of unknown etiology. Right column: Structural images of a family in the Human Connectome Project (HCP) dataset (monozygotic twins and their non-twin brother). The twins had large ventricles of unknown etiology, but their non-twin brother had normal VV. The structural images in (a), (b), and the left column of (c) are reproduced by kind permission of UK Biobank ©

In addition, a few VV outliers that were read would be potentially interesting for follow-up because they had large ventricles of unknown etiology and they did not present any other noticeable pathology (Figure 4c, left panel). Such VV outliers of unknown etiology were also present in the HCP dataset. In one family (Figure 4c, right panel), female monozygotic twins were both VV outliers, but their non-twin brother had normal VV; in another family (Figure S5b), one twin of a male monozygotic twin pair was a VV outlier, but the other twin and his non-twin brother both had normal VV. These twin data open the possibility of probing genetic and environmental causes underlying the anomalously large VV. Taken together, the results indicate VV outliers were associated with multiple different brain pathologies, and some of them had uncertain etiology requiring additional follow-up investigation.

3.5 Individual outliers of white matter-based imaging phenotypes

Outlier detection of white matter-based imaging phenotypes was performed with WMLV, FA, and MD, respectively. As an example, Figure 5a shows a WMLV outlier subject versus a normal subject (about 26.7 × IQR away between these two subjects in WMLV outlier score distribution). The outlier subject had irregular periventricular white matter lesions extending into the deep white matter with large confluent areas, whereas the example normal subject had only tiny lesions on the periventricular caps. Figure 5b shows regional FA deviation maps of an FA outlier subject versus a normal subject (about 9.9 × IQR away between these two subjects in FA outlier score distribution). For this representative outlier subject, regional FA negatively deviated in all 27 white matter ROI used in this study, whereas the FA of the representative normal subject had almost no deviations. Figure S6a shows regional MD deviation maps of an MD outlier subject versus a normal subject (about 5.5 × IQR away between these two subjects in MD outlier score distribution), in which a large positive MD deviation was observed in the left superior longitudinal fasciculus of this outlier subject.

image

Individual outliers of white matter-based imaging phenotypes.

留言 (0)

沒有登入
gif