Standardised lesion segmentation for imaging biomarker quantitation: a consensus recommendation from ESR and EORTC

The panel

Selection of the experts from the wide international and multidisciplinary EIBALL and EORTC Imaging Group subcommittees avoided methodological weaknesses that can otherwise severely threaten the validity and reliability of Delphi results [26]. Because panel selection critically affects outcome, we sought to include a variety of primary stakeholders, and this is reflected by the fact that only 50% of the respondents were radiologists and nuclear medicine physicians, with the other half being from related specialties, particularly medical physics and computer science which are intimately involved with (automated) segmentation processes. We also ensured that 50% of the panel was represented by members of both the EIBALL subcommittee and the EORTC Imaging Group subcommittee, all of whom are senior members of the imaging community with extensive experience of segmentation in various relevant contexts. Panel selection in a systematic review from 1978 to 2009 of 49 studies employing a Delphi process showed that the median number of panel members was 17 and increased over time. Panels included multiple stakeholders, who were healthcare professionals in 95% of cases [24]. More recently, Delphi studies involving imaging have convened multidisciplinary expert panels of 18–50 relevant stakeholders [27,28,29,30].

System performance

Variability in system performance is well recognised as a factor affecting image biomarker quantitation [31]. Therefore, this must be considered when selecting images for segmentation (a key component of most quantitation approaches), not only in clinical trials and research, but also when making longitudinal measurements in individuals where treatment decisions are based on the results of the measurements. Variability in system performance can affect perception of the boundaries of lesions for both humans and algorithms. For instance, when using automated (e.g. machine learning based) segmentation algorithms where training has been performed on data from quality-controlled devices, data coming from further systems with variable performance are likely to compound segmentation error. The nuclear medicine community have a well-established system for device and site accreditation that ensures that systems meet recognised standards of performance, and that regular quality assurance and control procedures are in place so that these sites and devices are accredited to perform quantitative measurements within clinical trials. Such systems are not routinely in place for CT and MRI, and these require individual site review and approval when participating within trials. Increasingly, however, such procedures are being implemented by triallists in order to pool multicentre data, e.g. CT trials for radiomic analyses [32], MRI trials utilising imaging biomarkers in breast cancer [33, 34], prostate cancer [35] and ovarian cancer [36]. In our survey, there was consensus of a certified requirement for systems performance for clinical trials, though this did not reach consensus for research outside the trial setting.

Artefacts

Unlike segmentation for radiation therapy planning, where lesion delineation is for the purpose of directing therapy, segmentation for image biomarker quantitation requires detailed attention to the location of artefacts and their likely influence on the data derived from the segmented lesion. Non-ferromagnetic metal implants with attenuation of radiation may, for instance, be particularly problematic in CT. Where artefacts obscure lesion boundaries, thus affecting segmentation, it is inadvisable to extract quantitative information by extrapolating lesion edges. However, although the majority of respondents felt that any level of artefact was unacceptable, accepting no or 5% artefact was felt to be acceptable by 75%. Within clinical trials, this runs the risk of bias at the patient level, where more unwell patients may be excluded because of artefact, or at a lesion level, where the segmented ROI might not encompass the entire lesion and its heterogeneity (for instance if there are marked differences between a more vascular periphery and a more necrotic, cystic central region).

SNR, CNR, and TBR

There was consensus that SNR thresholds should be set based on modality, organ, and lesion size. Noise correction approaches have been compared under different SNR in terms of reproducibility of diffusion tensor imaging (DTI) and diffusion kurtosis imaging (DKI) metrics [37]. Noise bias correction has a strong impact on quantitation for techniques inherently low in SNR such as diffusion-weighted imaging (DWI), and noise bias can lead to erroneous conclusions when conducting group studies. Noise bias correction significantly reduces noise-related intra- and inter-subject variability and should not be neglected in low SNR studies such as DKI [37].

CNR and TBR profoundly affect lesion edge perception and hence directly impact segmentation and the resulting derived quantitative parameters. For example, in MRI, sequences that provide the highest contrast are generally used for segmentation, and the corresponding ROIs are then copied onto images where the contrast between lesion and background is less striking [38]. In PET, high TBR is similarly advantageous for metabolically active tumours, while posing difficulties in segmentation of lesions with low metabolic activity. Where CNR or TBR is high, automated segmentation may be undertaken with greater confidence as techniques such as thresholding, region growing, and machine learning all then become more robust. Thresholding (fixed, adaptive, or iterative) converts a greyscale image into a binary image by defining all voxels greater than some value to be foreground, and considering all other voxels as background [39]. Partial volume effects (linked to a modality’s spatial resolution vs. the size of a region of interest) critically affect selection of optimal thresholds. In many clinical studies, a value such as a standard uptake value (SUV) of 2.5 (for PET) or an apparent diffusion coefficient (ADC) of 1.0 × 10–3 (for DWI-MRI) is set as pre-defined threshold levels to differentiate malignant lesions from benign, but there can be substantial variability across multiple studies even in the same tissue type [40, 41]. The data from this Delphi process indicate that 1.0 is a minimal CNR threshold for radiological images, and 2.0 an acceptable TBR for PET data.

Modifying the acquisition parameters that in clinical practice can be selected by the user can significantly impact CNR. In clinical trials, optimising these parameters to achieve a CNR that enables robust segmentation methodology is thus desirable. Brambilla et al. [42] investigated the effect on CNR when varying acquisition parameters such as emission scan duration (ESD) or activity at the start of acquisition (A(acq)), or object properties such as target dimensions or TBR, which depend uniquely on the intrinsic characteristics of the object being imaged. They showed that the ESD was the most significant predictor of CNR variance, followed by TBR and the cross-sectional area of the given sphere (as test object), with A(acq) found to be the least important. Thus, raising ESD seems to be much more effective than raising A(acq) to increase CNR for improving target segmentation. Moreover, when determining percentage thresholds for segmentation, for targets ≤ 10 mm in diameter, target size was the most important factor in threshold selection followed by the TBR, while for targets ≥ 10 mm, the TBR was more important in threshold selection [42]. This is reflected in our recommendations where selection of targets < 10 mm in diameter is not recommended for extracting quantitative imaging biomarker data.

Spatial resolution

The spatial resolution of images selected for segmentation is not routinely cited, and our data indicate a need for this although there was no consensus on size thresholds that should be set. The majority of respondents indicated that 5 pixels or less was too few, and that the lower limit should be set somewhere between 10 and 20 pixels within a region of interest in order to capture lesion heterogeneity and be representative enough for biomarker quantitation. Such a lower size limit for the target lesion, which will also depend on the intrinsic resolution characteristics of the modality and instrument is also linked to ensuring that partial volume effects do not significantly affect derived measurements.

Post-processing

Specifying post-processing methods did not achieve consensus, although it was agreed that organ- and modality-specific window and level methods should be used. Multiple methods appeared acceptable without specification of the use or otherwise of filters for edge enhancement, smoothing, or noise reduction. However, within clinical trials, documentation of these parameters should be enforced as they were deemed important or extremely important in the first Round. These data are not currently recorded, not even in clinical trials.

Reference standards and validation

As phantoms provide the exact dimensions of the objects in the images, using them is one way to create a surrogate truth for measuring the performance of an algorithm or a complete imaging pipeline. Synthetic images effectively serve as digital phantoms, where the true object size and boundaries are known and can be affected by varying noise, artefacts, or other confounding effects [43]. Alternatively, manually segmented structures can be compared with algorithm-generated segmentations in terms of overlap or boundary differences [44]. This strategy is commonly used, but, because of the variability in human perception, it is important to incorporate as many manual segmentations as possible and combine these segmentations together to form a single statistical ground truth. The widely used Simultaneous Truth and Performance Level Estimation (STAPLE) method estimates the ground truth segmentation by weighing each expert observer’s segmentation depending on an estimated performance level [45]. Our study has emphasised the need for multiple operators for manual segmentation in order to generate a reference standard. The use of multiple operators, or a human operator supplemented by an automated process, was reinforced by our survey, particularly for validation. The input of human operators was deemed essential for validating automated processes during algorithm development and subsequent roll out, together with regular training intervals.

Operator training

Manual segmentation is highly subjective, and intra- and inter-operator agreement rates (citing years of relevant experience of the individual operators) are often presented in the literature, to indicate both the reliability of the obtained surrogate truths and the level of difficulty of the segmentation problem. Moreover, a manual process is time-consuming and labour-intensive. In one study [46] that involved 18 physicians from 4 different departments, the agreement, defined as a volume overlap of ≥ 70%, was found only in 21.8% of radiation oncologists and 30.4% of haematologic oncologists. Smaller lesions (i.e. < 4 cm3) suffer much more from partial volume effects [47], which in a fixed, threshold-based phantom study has been shown to critically depend on lesion size (e.g. vs. imaging resolution), contrast, and noise [48], so that challenges and consistency of operator segmentation are also related to these factors. Our work indicates that operator training is more important than board certification and years of experience, and that refreshing training (e.g. using 20 data sets or more) was important within clinical trials on a per trial basis and was also necessary for clinical research and clinical practice. We also obtained consensus that this performance should be validated against the reference standard and achieve a DICE similarity (as representative of such metrics) score of at least 0.7.

留言 (0)

沒有登入
gif