Test–Retest Data for the Assessment of Breast MRI Radiomic Feature Repeatability

The use of radiomics to answer diagnostic, predictive, and prognostic questions has increased in recent years, especially in the field of oncology.1 Radiomics refers to the extraction of large amounts of high-throughput quantitative data from medical images using mathematical algorithms that have the potential to noninvasively reveal more information about the region of interest than can be captured by visual inspection alone.2 The extracted quantitative data, termed radiomics features, capture information regarding the shape, intensity, and texture of the chosen region of interest (ROI), which is usually the lesion or the affected organ. Radiomics features are intended to serve as biomarkers for the development of clinical decision support systems to enhance personalized medicine.3

In breast cancer research, multiple radiomics studies have shown promising results for diagnostic, prognostic, and predictive purposes.4-6 Despite these seemingly promising results, translation to clinical practice is limited.7 A major translational bottleneck can be attributed to the often unknown effect that multiple steps in the radiomics workflow have on feature values, including image acquisition, reconstruction, and preprocessing.8-11 For a radiomics feature to serve as a biomarker, and to be used reliably in clinical decision support systems, it must fulfill the criteria repeatability and reproducibility.12 Repeatability can be defined as “the variability of the biomarker when repeated measurements are acquired on the same experimental unit under identical or nearly identical conditions” and reproducibility as to “variability in the biomarker measurements associated with using the imaging instrument in real-world clinical settings, which are subject to a variety of external factors that cannot all be tightly controlled.”12

Previous research has already identified several steps in the radiomics workflow that influence the reproducibility and repeatability of radiomics features. For example, image acquisition and reconstruction appear to cause variation in radiomic feature values in research performed on CT imaging.13, 14 Unlike the Hounsfield Units in CT, MRI does not have absolute signal intensities, potentially causing large differences between images, emphasizing the importance of inspecting and possibly adjusting image intensities before performing feature extraction.15 A test–retest MRI study of glioblastoma showed that both normalization and intensity quantization strategies affect radiomic feature repeatability and that the optimal strategy must be composed per feature group.16 Further test–retest studies assessing feature repeatability have been performed in cervical,17 and prostate cancer18, 19 and have shown consistent results, although all studies state that translation of results to other tumor sites has not been confirmed. In contrast, Peerlings et al20 showed that 9.2% (122/1322) of the features, extracted from apparent diffusion coefficient (ADC) maps in ovarian, liver, and colorectal cancer patients, were repeatable among the different tumor sites.

The assessment of radiomics feature repeatability by test–retest studies in breast MRI exams is currently lacking. A potential reason for this lack of data is the variance present in a standard clinical breast MRI protocol, which means that scanning parameters may differ between patients scanned with the same clinical protocol. Therefore, this study investigated the repeatability of radiomics feature values extracted from breast MRI exams using a fixed clinical breast protocol comprising of T2-weighted (T2W) images, T1-weighted (T1W) images, and diffusion-weighted images (DWI) and their derived ADC maps.

Material and Methods Study Population

The study was approved by the local medical ethical committee and written informed consent was given by all participants before participation. Eleven healthy female volunteers were recruited via college-wide advertisement. Participants were only included if they did not suffer from claustrophobia and met the requirements for admission to the MRI. Participants' height, weight, and the phase of the menstrual cycle were noted. The menstrual cycle of the included healthy volunteers was not taken into account during the MRI exams.

Imaging Acquisition

All MRI exams were performed using a 16-channel breast coil on one single 1.5 T scanner (Ingenia, Philips Healthcare, Best, The Netherlands) in the same research institution by the same technician. During imaging, the women lay in the prone position with both breasts in the openings of the breast coil and both arms above their head. The performed MRI protocol consisted of a T2-weighted turbo spin echo (T2W), native T1-weighted turbo gradient echo (T1W), and a single shot diffusion-weighted imaging (DWI) sequence using b-values of 0, 150, and 800. A single corresponding ADC-map was derived from all three DWI sequences. All volunteers underwent MRI exams using the identical breast protocol while maintaining as many parameters fixed as possible. The acquisition parameters for the different MRI sequences are shown in the supplementary material (Table S1). The shimbox, needed for the T1W and DWI sequences, was placed on the sternum by default. In case the technician judged the scan as clinically insufficient, the shimbox was placed on the breasts.

Study Design

A test–retest study was designed to assess the repeatability of breast-MRI extracted radiomic features. Three separate test–retest strategies were performed twice at 6–10 day intervals. From here on, we will use ‘date 1’ to refer to the first scanning date of each healthy volunteer and ‘date 2’ to refer to the second scanning date. In each strategy, the complete breast MRI protocol was repeated three times with a 2-minute pause between each protocol. In the first strategy (S1) the participant remained in the MRI scanner the entire time (including the pauses) without movement, for the acquisition of the three breast MRI protocols. The second strategy (S2) differed from S1 only by moving the table out of the scanner (with the participant still in the same position without movement) during the 2-minute breaks. For the third strategy (S3) the participant got off the table during the 2 minutes breaks (Fig. 1). In total, 18 different MRI exams were acquired for each healthy volunteer with a total scanning time of approximately 198 minutes.

image

Visual representation of the three test–retest strategies.

ROI Segmentation

All images were visually checked for quality(including artifacts) by a dedicated breast radiologist with 14 years of experience (ML) before starting the analysis. The region of interest (ROI) was segmented by a medical researcher (RG) with 4 years of experience in breast MR imaging and validated by the same dedicated breast radiologist. It was chosen to 3D, manually segment the right breast. The segmentations were bounded by the sternum (medial side), the pectoral muscle (dorsal side), and the axilla (lateral side) in three dimensions using MIM software (version 7.1.3, Cleveland, OH, USA). Segmentations were performed on all patients on the T2W sequences of all MRI exams as anatomical structures are best visible on this sequence. Subsequently, the T2W sequence was registered with the T1W sequence, and ADC map, using rigid alignments within MIM software, followed by segmentations transfer (Fig. 2).

image

An axial slice of a 3D MRI exam of a healthy volunteer including right breast segmentation (red margin). (a) ADC map, (b) T2-weighted image, (c) T1-weighted image

Image Preprocessing and Feature Extraction

All MRI exams including ROI segmentations were converted to the nearly raw raster data (NRRD) file format using Python (version 3.7.3) for subsequent analysis. Before feature extraction, multiple preprocessing procedures were applied to the images to study their impact on feature repeatability. First, feature extraction was performed without any image preprocessing as a baseline measurement. Second, N4 bias field correction was applied to the images prior to feature extraction.21 Lastly, the bias field corrected images were further preprocessed using the built-in image z-score normalization by Pyradiomics software (version 2.2.0), with and without binning the voxel grayscale values using a fixed bin width of 32 and 64 (Pyradiomics suggested a bin width between 16 and 128).16, 22 Image preprocessing steps were performed in Python (version 3.7) using an in-house developed pipeline based on the computer vision packages, including OpenCV (version 4.1.0), SimpleITK (version 1.2.0), and NumPy (version 1.16.2). For each ROI, 91 original features were extracted using the Pyradiomics software (version 3.0.1), which is mostly compliant with the Image Biomarker Standardization Initiative.23 The extracted radiomics feature included first-order statistics features, gray-level co-occurrence matrix features (GLCM), gray-level run length matrix features (GLRLM), gray-level size zone matrix features (GLSZM), neighboring gray tone difference matrix features (NGTDM), and gray-level dependence matrix features (GLDM). All texture features were extracted using default Pyradiomics settings. A detailed Pyradiomics feature description can be found online.24

Statistical Analysis

To assess the repeatability of the extracted radiomic features for the various ROI's in the multiple test–retest strategies, the concordance correlation coefficient (CCC) was calculated using the epiR package (Version 0.9-99) (REF) in R language (version 3.6.3) performed in R studio (version 1.2.1335, Vienna, Austria).25 Radiomics features extracted from a given MRI exam are compared to radiomic features extracted from the remaining MRI exams in a pairwise manner. The CCC was used to evaluate the agreement in radiomic feature values, taking into account both the rank and the value of the measurements.26 This metric has the advantage of robust results in small sample sizes.26 The CCC provides values between −1 and 1, with 0 representing no concordance, 1 representing perfect concordance, and −1 perfect inverse concordance. Features with a CCC of >0.90 were defined as repeatable features, according to suggestions in literature.27 Feature concordance was assessed for each preprocessing procedure using the results of all test–retest strategies of both scanning dates as well as for the results collected on the separate scanning dates. To create an overview of repeatable features across all pairs for the different preprocessing procedures, the intersection of the repeatable features across pairs was noted.

Results Patients Demographics

The median age of the 11 healthy female volunteers was 28 years (interquartile range 25–30 years). Table 1 summarizes the healthy volunteers' characteristics. Shimbox displacement occurred in 22.6% of the scanned sequences.

TABLE 1. Patient Characteristics Healthy Volunteers (n = 11) Age (years) (median; IQR) 28 (25–30) Height (cm) (median; IQR) 167 (167–172) Weight (kg) (median; IQR) 60 (58–63) Week of the menstrual cyclea Date 1/date 2 Week 1 1/5 Week 2 1/1 Week 3 3/1 Week 4 4/2 Days between scan (mean; range) 7 (6–9) IQR: interquartile range. Repeatable Radiomic Features

Due to a scanning error of all T1-weighted images and the ADC maps of one healthy volunteer during scanning date 1, all data of this participant was excluded from the analysis. In both the T1W and T2W sequences as in the ADC maps, in pairwise comparison, the number of concordant features varied per scanning date, per test–retest strategy and, per image preprocessing procedure (Figs. 3-5). Furthermore, for all preprocessing procedures, the lowest number of concordant features was observed between the MRI exams scanned on date 1 and the MRI exams scanned on date 2, seen in the reddest field outside the black demarcations in Figs. 3-5.

image

Number of pairwise concordant radiomic features using a concordance correlation coefficient > 0.90 for T1-weighted images with (a) no further preprocessing, (b) 32-bin grayscale discretization, (c) 64-bin grayscale discretization, (d) z-score normalization, (e) z-score normalization +32-bin grayscale discretization, and (f) z-score normalization +64-bin grayscale discretization. The black frame in the top left corner shows the MRI exams taken during the first scan date and the black frame in the bottom right corner shows the MRI exams taken during the second scan date. The numbers on the axis refer to the different MRI exams scanned, wherein the first number corresponds to the scan date and the second number to the test–retest strategy. In each test–retest strategy, three scans were examined which is represented by the last number. A total of 91 radiomic features was examined.

image

Number of pairwise concordant radiomic features using a concordance correlation coefficient >0.90 for T2-weighted images with (a) no further preprocessing, (b) 32-bin grayscale discretization, (c) 64-bin grayscale discretization, (d) z-score normalization, (e) z-score normalization +32-bin grayscale discretization, and (f) z-score normalization +64-bin grayscale discretization. The black frame in the top left corner shows the MRI exams taken during the first scan date and the black frame in the bottom right corner shows the MRI exams taken during the second scan date. The numbers on the axis refer to the different MRI exams scanned, wherein the first number corresponds to the scan date and the second number to the test–retest strategy. In each test–retest strategy, three scans were examined which is represented by the last number. A total of 91 radiomic features was examined.

image

Number of pairwise concordant radiomic features using a concordance correlation coefficient >0.90 for ADC maps with (a) no further preprocessing, (b) 32-bin grayscale discretization, and (c) 64-bin grayscale discretization. The black frame in the top left corner shows the MRI exams taken during the first scan date and the black frame in the bottom right corner shows the MRI exams taken during the second scan date. The numbers on the axis refer to the different MRI exams scanned, wherein the first number corresponds to the scan date and the second number to the test–retest strategy. In each test–retest strategy, three scans were examined which is represented by the last number. A total of 91 radiomic features was examined.

T1W Sequence

Across all pairs, regardless of scanning date and test–retest strategy, the highest number of concordant features was seen in the images without preprocessing, resulting in 15 of 91 (16.5%) concordant features. These 15 features consisted of 7 first-order, 1 GLCM, 2 GLRLM, 2 GLSZM, and 2 GLDM and, 1 NGTDM feature(s) (Table 2). Applying grayscale discretization resulted in 13 of 91 (14.3%) and 14 of 91 (15.4%) concordant features for 32-bins and 64-bins, respectively. Compared to the images without preprocessing, the texture features showed less concordant features. The z-score normalized images resulted in the lowest number of 4 of 91 (4.4%) concordant features. Applying gray-scale discretization after z-score normalization improved the number of concordant textural features to 7of 91 (7.7%) and 8 of 91 (8.8%) for 32-bins and 64-bins, respectively. The loss in the number of concordant features for z-score normalized images (with and without grayscale discretization), when compared to the images without preprocessing, was mainly due to a loss in the number of concordant first-order features, which were 6 of 91 (6.6%).

TABLE 2. Concordant Features across All Pairs for the T1-Weighted MRI Exams, with A: No Preprocessing, B: 32-Bin Grayscale Discretization, C: 64-Bin Grayscale Discretization, D: z-score Normalization, E: z-score Normalization +32-bin Grayscale Discretization, and F: z-score Normalization +64-bin Grayscale Discretization A B C D E F Number of Concordant Features 15 (16.5%) 13 (14.3%) 14 (15.4%) 4 (4.4%) 7 (7.7%) 8 (8.8%) firstorder_90Percentile × × × firstorder_InterquartileRange × × × firstorder_MeanAbsoluteDeviation × × × firstorder_Mean × × × firstorder_RobustMeanAbsoluteDeviation × × × firstorder_RootMeanSquared × × × firstorder_Skewness × × × × × × glcm_JointAverage × glrlm_GrayLevelNonUniformity × × × × × glrlm_RunLengthNonUniformity × × × × × glszm_GrayLevelNonUniformity × × × × glszm_SizeZoneNonUniformity × glszm_SmallAreaHighGrayLevelEmphasis × gldm_DependenceNonUniformity × × × × × gldm_GrayLevelNonUniformity × × × × × × ngtdm_Busyness × × × × ngtdm_Coarseness × × × × ×

For the majority of preprocessing strategies, the images collected during date 2 showed a higher number of concordant features (varying between 10/91 and 48/91 in images without BFC and between 11/91 and 35/91 in BFC images) compared to images collected during date 1 (varying between 4/91 and 32/91 in images without BFC and between 9/91 and 14/91 in BFC images) (Table 3, Fig. 3), with these differences being greatest after applying grayscale discretization. Furthermore, for most image preprocessing procedures, the addition of BFC resulted in less concordant features compared to the images without BFC (Table 3, Table S2 in the Supplemental Material). For the BFC images without further preprocessing and for the BFC images with grayscale discretization, it was mainly the first-order features that showed a loss of concordance compared to not performing BFC.

TABLE 3. Number of Concordant Features Across all Pairs for the Entire Dataset (All) and Across All Pairs from the Separate Scanning Dates (Date 1 and Date 2) for All Sequences With and Without Bias Field Correction, With A: No Further Preprocessing, B: 32-bin Grayscale Discretization, C: 64-bin Grayscale Discretization, D: z-score normalization, E: z-score Normalization +32-bin grayscale discretization, and F: z-Score Normalization +64-bin Grayscale Discretization Sequences Without BFC With BFC All Date 1 Date 2 All Date 1 Date 2 T1W A 15 32 40 8 13 11 B 13 19 45 10 11 30 C 14 18 48 8 12 31 D 4 4 10 4 9 12 E 7 10 35 10 13 34 F 8 9 38 8 14 35 T2W A 11 31 16 0 1 60 B 7 9 12 2 3 22 C 7 9 11 1 3 23 D 26 35 44 26 39 37 E 4 7 7 6 11 17 F 4 7 6 5 11 18 ADC A 8 28 22 8 9 12 B 7 15 13 6 9 12 C 6 11 11 6 11 11

Figures S1–S6 in the Supplemental Material present the pairwise CCC values in scatterplots for all features in the different preprocessing procedures, wherein the different colors represent the use of all pairwise comparisons or only the pairwise comparisons between MRI exams scanned on the same day.

T2W Sequence

Across all pairs, regardless of scanning date and test–retest strategy, the z-score normalized images showed the highest number of concordant features, 26 of 91 (28.6%), of which, 3 first-order, 11 GLCM, 3 GLRLM, 0 GLSZM, 8 GLDM, and 1 NGTDM feature(s) (Table 4). Compared to the other preprocessing procedures, the difference in the number of concordant features was mainly in the concordant texture features, which were almost nonconcordant for the other preprocessing procedures.

TABLE 4. Concordant Features Across All Pairs for the T2-weighted MRI Exams, With A: No Preprocessing, B: 32-bin Grayscale Discretization, C: 64-bin Grayscale Discretization, D: z-score Normalization, E: z-score Normalization +32-bin Grayscale Discretization, and F: z-score Normalization +64-bin Grayscale Discretization A B C D E F Number of Concordant Features 11 (12.1%) 7 (7.7%) 7 (7.7%) 26 (28.6%) 4 (4.4%) 4 (4.4%) firstorder_10Percentile × × × firstorder_90Percentile × × × firstorder_InterquartileRange × × × × × × firstorder_MeanAbsoluteDeviation × × × firstorder_Mean × × × firstorder_RobustMeanAbsoluteDeviation × × × × × × firstorder_RootMeanSquared × × × glcm_JointAverage ×

留言 (0)

沒有登入
gif