The German research consortium for the study of bipolar disorder (BipoLife): a quality assurance protocol for MR neuroimaging data

The BipoLife consortium involved nine neuroimaging centers across Germany. All data were acquired on 3 Tesla MR scanners. These scanners were from different manufacturers and had different hard- and software configurations (e.g., different head coils). MR sequence parameters were standardized across all sites to the extent permitted by each scanner. All subjects were assessed both with a high resolution T1-weighted anatomical image and several functional measurements (three task-based fMRI paradigms, one resting-state measurement). Additionally, a phantom was measured after each subject. We specifically decided to measure the phantom not at the beginning but at the end, i.e., after the human MRI data was acquired. This ensured that the MR scanner was in an almost comparable condition each time. We have to acknowledge that the acquisition protocol is EPI heavy. When the scanning session is completed, the gradients might be hot, potentially leading to artifacts (e.g., increased drifts) that had not yet occurred when the human MRI data was acquired. For future studies, one might therefore think about counter-balancing the order of phantom acquisitions. An extensive description of the MR scanners, their hardware configurations and software packages can be found in Table 1 [of note, the MR sequence parameters and the experimental design can be found elsewhere (Vogelbacher et al. 2021)].

Table 1 MR scanners, their hardware configurations and software packages used in the BipoLife consortium

After the study protocol was set up at all centers, the participating sites were visited, the local staff was trained and the compliance with the study protocol was verified. Each center performed a complete measurement of one subject in the presence of the coordinating team to clarify all open questions. Each center then carried out three further measurements on control subjects over the next few days to become familiar with the study protocol (e.g., preparation and execution of the MRI measurement, measurement of phantom data, data transfer to the coordinating center). The data was sent to the coordinating center. If there were no further objections, the study measurements could begin. Of note, for organizational reasons we did not have the possibility to measure the same subjects at each site (“traveling subjects”). This must be considered as a missed opportunity to further establish inter-scanner reliability. We strongly recommend future studies to also include these measurements.

During the project, all MRI data was sent via the internet directly after the measurement to the coordinating center. The transferred data was inspected at the coordinating center for potential errors in data acquisition. The inspection included a check of data completeness [with regard to human and phantom data, the logfiles and the clinical report form (CRF)] as well as a check of the correct positioning of the bounding box during the planning of the measurement. If local staff changed, new team members received thorough training from the coordination center.

Quality assurance was carried out on two levels. First, we used phantom data to evaluate the temporal stability of the MR scanners across time. Second, we assessed the quality of human MRI data using a variety of metrics. In the following, we will first give a detailed overview on the QA protocol for phantom MRI data (Section "QA protocol: Phantom MRI data") and for human MRI data (Section "QA protocol: Human MRI data"). We will then show that the data quality of structural human MRI data (high SNR vs. low SNR) can have a profound impact on the outcome of standard neuroimaging analysis pipelines (Section "Influence of the quality of MRI data on brain imaging analyses").

QA protocol: phantom MRI data

In this section, we describe the measurement and analysis of phantom MRI data (Section "Assessment of the quality of phantom MRI data"), show how the results of the phantom measurements can be used to characterize the various properties of MR scanners (Section "Properties of different MR scanners") and how these properties can change over time (Section "Temporal stability of MR scanners"). This information can be used both to exclude specific data due to poor quality and to determine long-term changes in the quality of an MR scanner (e.g., after technical changes, Section "Influence of major technical changes at MR scanners").

Assessment of the quality of phantom MRI data

The phantom was a 23.5 cm long and 11.1 cm-diameter cylindrical plastic vessel (Rotilabo, Carl Roth GmbH+Co. KG, Karlsruhe, Germany) filled with a mixture of 62.5 g agar and 2000 ml distilled water. In contrast to widely used water filled phantoms, agar phantoms are more suitable for fMRI studies. On the one hand, T2 values and magnetization transfer characteristics are more similar to brain tissue (Hellerbach 2013), on the other hand they are less vulnerable to scanner vibrations and thus avoid a long settling time prior to data acquisition (Friedman and Glover 2006).

Phantom data was acquired after each subject measurement except when two subjects were measured consecutively. In this case, the MRI phantom was measured only once in between the two measurements. Alignment of the phantom was lengthwise, parallel to the z-axis, and at the center of the head coil. The alignment of the phantom was evaluated by the radiographer performing the measurement and—if necessary—corrected using the localizer scan. The positioning of the bounding box during the planning of the measurement was manually centered at the phantom with slice direction perpendicular to the phantom body (see supplementary material S1 for more details).

We decided to apply a T2*-weighted echo planar imaging (EPI) sequence because we were most interested in assessing the temporal stability of the MR scanners across fMRI measurements. We applied the same MR sequence parameters as in the resting-state measurement. The first 5 images were discarded from all analyses to account for equilibrium effects.

Various QA metrics can be calculated from phantom data, assessing for instance the strength of the signal, temporal stability and geometric distortions [for an overview, see Glover et al. (2012); Lu et al. (2019)]. We used QA metrics that covered various spatial and temporal aspects of the images, including the SNR, spatial inhomogeneity, ghosting artifacts, temporal fluctuations and scanner drift [the detailed mathematical description is presented in a previous publication of our research group (Vogelbacher et al. 2018)]. Data analysis was performed using the self-developed LAB-QA2GO software package (Vogelbacher et al. 2019).

Data acquisition started in October 2015 and ended in December 2020. Until this date, 431 phantom measurements were performed, of which 426 data sets were complete and could be further analyzed (some data had to be excluded based on a misplacement of the phantom; for an overview, see Table 2).

Table 2 Number of phantom measurements for each centerProperties of different MR scanners

In Fig. 1, we present the results of phantom QA analyses for each center (see also Supplement S2 for detailed values). It is clearly evident that there are clear differences in each QA metric between the different MR scanners, both in mean and variance. While it might be assumed that all MR scanners of the same type share approximately the same characteristics, in reality, there are significant differences, even among relatively similar models (see Table 1 for an overview of the used scanner). The typical SNR values for the MR scanners in Berlin, Dresden and Marburg (all using a Siemens Tim Trio), for instance, are relatively similar, while the other scanners clearly differ. However, these MR scanners also strongly differ in other metrics (e.g., drift values). Thus, the overall behavior of each MR scanner, characterized by the QA metrics, is unique.

Fig. 1figure 1

Overview over the distribution of various QA metrics for the phantom data. The overall performance, characterized by the QA metrics, strongly differs between the MR scanners, even between identical models. SNR signal-to-noise ratio, SFNR signal-to-fluctuation-noise ratio, PSC percent signal change, PSG percent signal ghosting [for details, see Vogelbacher (2020)]

Temporal stability of MR scanners

In Fig. 2, we show how the scanner drift, as a specific example of a QA metric, developed over time, i.e., over the course of the study, at the Centers of Marburg and Frankfurt. Marburg shows in general a lower drift in comparison to the Frankfurt site. Also the overall variability across time is lower [coefficient of variation (CV) of Marburg = 0.18; CV of Frankfurt = 0.25]. In Table 3, we additionally show a year-by-year comparison of the variability of the drift for both centers. We believe that this year-by-year representation of QA metrics can help to better identify long-term trends in the change of MR scanner characteristics.

Fig. 2figure 2

The scanner drift value for the centers of Marburg (A) and Frankfurt (B) over time. For a post-hoc outlier detection, we defined a range of ±2.5 standard deviations (horizontal lines). For Frankfurt, three outliers were identified (red arrows). A more detailed analysis showed however, that these deviations were caused by a wrong placement of the phantom, not by MR scanner malfunctions

Table 3 Year-by-year comparison of the variability of the scanner drift for the Centers of Marburg and Frankfurt

Based on the fluctuation of this data, we can automatically calculate limit values [e.g., ±2.5 SD of the mean as a possible outlier (Friedman and Glover 2006)] and thus assess whether a QA metric deviates too much from the usual values and thus possibly indicates impairments in the function of the MR scanner. We would like to note that the limit values are arbitrary. If we had chosen smaller limits, we simply had to recheck more data points.

In Fig. 2, we used, for illustrative purposes, as limit a range of 2.5 standard deviations of the mean (based on all data points acquired during the study) for a post-hoc outlier detection. For Marburg, all measured values were within the permitted fluctuation range. For Frankfurt, one could detect three outliers (marked with red arrow). A closer inspection of the data, however, showed that these fluctuations were not caused by changes of scanner performance, but were related to a changed placement of the phantom in the MR scanner (placement differences are highlighted in supplementary material S3). A systematic and, most important, timely assessment of all QA values can be helpful in some cases to early detect potential scanner malfunctions (as described in Section "Influence of major technical changes at MR scanners" below). For future measurement, we recommend on the one hand to use a phantom holder to make sure that the placement of the phantom is as reliable as possible. On the other hand, we recommend calculating the QA metrics not only for one selected slice, but for a larger volume.

Influence of major technical changes at MR scanners

The QA metrics are sensitive to technical changes of a scanner (such as the replacement of the MRI gradient coil), changes of the QA protocol (e.g., the introduction of special phantom holders) or changes of MR sequence parameters. In Fig. 3, we show an example of how damage to the body coil at the Marburg site affected the QA metrics and also how functional failures could have been detected in advance. In June 2018, there was a sudden failure of the MR scanner in Marburg. After extensive error diagnostics, the service technicians detected a defect of the body coil. After its replacement, the MRI system was working properly again. In a post-hoc analysis performed after the incident, we noticed that in a time interval of about two months before the failure of the MR scanner, the metrics assessing ghosting artifacts in the MR images [such as “percent-signal-ghosting”, PSG, cf. Vogelbacher et al. (2018)] were strongly increasing (Fig. 3, top). If we had noticed this at that time, we might have been able to arrange a check of the MR scanner earlier. The other QA metrics did not show any systematic changes before the replacement of the body coil (exemplarily shown for SNR in Fig. 3 bottom). The technical properties of the MR scanner thus remained unchanged.

Fig. 3figure 3

Percent signal ghosting (PSG) (A) and signal-to-noise ratio (SNR) (B) for the center of Marburg over time. The bisque area describes measurements before and the lightblue after the body coil was changed. No measurements took place between 17 June 2018 and 11 September 2018. The scanner shows in general a stable performance before and after replacement of the body coil, as indicated by stable QA values in all metrics calculated from the phantom data. There was an almost tenfold increase in PSG values before the body coil had to be replaced. If this had been noticed earlier, the scanner defect might have been detected sooner

QA protocol: human MRI data

In this section, we describe the quality assessment of human MRI data (Section "Quality assessment of human MRI data"). We then show various examples of anatomical (Section "Example 1: Evaluation of anatomical MRI data") and functional data sets (Section "Example 2: Evaluation of functional MRI data") that were reduced in quality by typical artifacts and incorrect measurements.

Quality assessment of human MRI data

In a first step, it was checked whether the data was complete, both with regard to the MRI data and the corresponding logfiles. In particular, the correct positioning of the bounding box during the planning of the measurement was examined since a wrong alignment of the measurement volume turned out to be a frequent error source in the MRI measurements. If something was not in line with protocol, the neuroimaging centers received a direct response and additional training if necessary. However, the quality assurance protocol did not only include a check of the MRI data, but also of the entire experimental procedure. For this purpose, a clinical report form (CRF, see supplementary material S4) was designed on which the entire procedure was documented for each measurement (e.g. training of the subjects, performance of the neuropsychological tests) and unexpected events could be logged. This CRF had to be filled out for each measurement and helped to reconstruct the measurement and any problems that may have occurred.

After the initial check, the MRI data was converted into the BIDS format [using heudiconv, version v0.6.0, Halchenko et al. (2019)]. The data quality was assessed using the BIDS-App MRIQC [Magnetic Resonance Imaging Quality Control, version 0.15.2, Esteban et al. (2017)]. MRIQC assesses both structural T1-weighted MR images and blood oxygenation level dependent (BOLD)-images of the brain by calculating a set of quality measures from each image. MRIQC provides different image quality metrics (IQMs) to characterize anatomical and functional MR images. For the anatomical image, the IQMs are often divided into four broad categories. The first category comprises measures that describe the impact of noise, the second category contains metrics that characterize the spatial distribution. The measures in the third category can be used to detect artifacts. In the fourth category, all metrics are grouped that do not fit within the previous categories and characterize for instance the statistical properties of tissue distributions or the blurriness of the images. For the functional images, the IQMs are typically divided into three categories assessing spatial information, temporal information and the presence of artifacts (for an overview, see https://mriqc.readthedocs.io/).

MRIQC creates automatically visual reports for each anatomical image and each fMRI data set, respectively. Additionally, after all data was acquired, a group report was created. This report shows a plot of all individual IQMs, making it easily possible to identify outliers in each metric. These visual reports were checked by one member of the coordination team with respect to, e.g., movement, ghosting artifacts, positioning of the measurement volume, as well as the general quality of the dataset. The quality of each anatomical image and each functional time series was finally evaluated as “good”, “intermediate” or “poor” (for an overview on all data of the study, see Table 4). The label “good” was given if the rater did not see any relevant artifacts. Images that had a bit of movement and some artifacts were labeled “intermediate”. Images which had major issues (in particular strong movement, wrong placement of the measurement volume, fold-over artifacts, ghosting artifacts) were categorized as “poor”.

Table 4 Number of MRI data sets classified “good”, “intermediate” or “poor” based on the quality assessment with MRIQCExample 1: evaluation of anatomical MRI data

In Fig. 4, we present background noise images of the MRIQC report for three selected structural MRI data sets. On top (A), we present a reference data set with no apparent artifacts. In the middle (B) and on the bottom (C), you can see images with clear artifacts that were both labeled as “poor” (for details, see figure legend).

Fig. 4figure 4

The background noise image of the structural MRIQC report for three selected subjects. The data set in A shows no visible artifacts and was labeled as “good”. Image B shows strong artifacts caused by both a bad positioning of the measurement volume (too low) a wrong phase encoding direction. This data set was labeled as “poor” and was excluded from further analyses. The data set in C shows strong movement artifacts extending into the prefrontal cortex and was also labeled as “poor”

Example 2: evaluation of functional MRI data

In Fig. 5, we present extracts of the MRIQC report (i.e., an averaged functional image) for two selected data sets acquired during the resting state paradigm. On top (A), we again present a reference data set (labeled as “good”) set with no apparent artifacts. On the bottom (B), one can clearly see that the measurement volume was wrongly aligned and did not include the whole brain. The data set was consequently labeled as “poor”. As it turned out, misalignments of the bounding box during the planning of the measurement were a major source of error in the study.

Fig. 5figure 5

Extracts of the MRIQC report for two resting-state fMRI data sets (averaged functional image). The data set A shows no visible artifacts and was labeled as “good”. The image B indicates a wrong alignment of the measurement volume and was consequently labeled as “poor”. Please note that in data set A, the last slice of the cerebellum was chopped off during image acquisition. Here, the measurement volume size selected in the study protocol was too small because the subject had a relatively large brain. In these cases, we specified that the measurement volume would be positioned in such a way that the cerebrum would definitely be measured, even if parts of the cerebellum would not be measured

In Fig. 6, we present the averaged functional image (A) and the background noise image (B) of yet another resting-state data set. This graphic illustrates that it is important to check more than one QA metric, because some artifacts are just visible in some metrics, but not in others. More specifically, the averaged image does not show any artifacts, while the standard deviation map a clear artifact is visible.

Fig. 6figure 6

Extracts of the MRIQC report for a selected resting-state fMRI data set. Unlike the averaged functional image (A), the standard deviation map (B) clearly indicates a strong artifact

Influence of the quality of MRI data on brain imaging analyses

The quality of the underlying data can strongly influence the results of MRI analyses (see Friedman and Glover 2006; Stöcker et al. 2005; Vogelbacher et al. 2018) for various examples from other consortia; see (Goto et al. 2016; Power et al. 2015; Zaitsev et al. 2017) for the impact of motion artifacts on fMRI data). It is not possible to assess in advance how large this influence will be for every conceivable form of analysis. The effect of data quality on the analysis results depends on many factors, e.g., the type of quality reduction, the number of subjects, the characteristics of the MR scanners involved, and, of course, the form of the specific analysis itself. Therefore, the general QA protocol, as outlined in the present article, has to be complemented by more specific QA assessments in subsequent projects. For instance, the adjustment of smoothness across MR scanners (Friedman and Glover 2006) or the introduction of specific covariates might be more important for some analyses than for others. There is not one simple solution for all such potential problems. In Fig. 7, we exemplarily illustrate that the quality of MRI data potentially can influence the results of typical neuroimaging analyses. For this purpose, we randomly selected 40 subjects from the BipoLife dataset, ordered them according to their SNR values of their T1-weighted structural image and built two groups. In the low-SNR group were those 20 subjects with the lowest values, in the high-SNR group those with the highest values. We then tested whether a standard VBM analysis showed differences between the two groups. In our example, we found significant differences in several regions (Fig. 7).

Fig. 7figure 7

Voxel-based analysis of anatomical MRI data: Forty high-resolution, T1-weighted MR images of patients with major depression were drawn from the BipoLife data set. The data was preprocessed using the CAT12 Toolbox (Computational Anatomy toolbox, v1720, Structural Brain Mapping Group, Jena, Germany; https://neuro-jena.github.io/cat/), as implemented in SPM12 (Statistical Parametric Mapping, Institute of Neurology, London, UK) running on MATLAB (version R2017a, The MathWorks, Natick, Massachusetts, USA). Preprocessing steps included spatial normalization, segmentation (absolute threshold for gray matter set to 0.1) and spatial smoothing (8 mm full-width-at-half-maximum). The data set was divided into two groups (n = 20 each) based on the SNR values obtained from the MRIQC analysis. We did not balance for sex or site. In the high-quality group, there were 3 females and 17 males, in the low-quality group 7 females and 13 males. Gray matter segments between the high quality and the low quality group were compared between groups using a F-test with age, total-intracranial volume and sex as additional covariates. Significant differences (p < 0.05, family-wise error corrected for multiple comparisons at the peak level across the whole brain) were found in several brain regions including the precuneus (A), midfrontal gyrus (B) and inferior temporal gyrus (C). These differences can be attributed to the quality of the underlying MRI data. For visualization we used the tool MRIcroGL in version 1.2.20220720 including the DARTEL template (Rorden and Brett 2000)

留言 (0)

沒有登入
gif