Characterization of high-grade prostate cancer at multiparametric MRI: assessment of PI-RADS version 2.1 and version 2 descriptors across 21 readers with varying experience (MULTI study)

To specifically evaluate the characterizing value of the PI-RADSv2/v2.1 descriptors, we asked the readers to score the exact same corpus of lesions. To be clinically meaningful, this corpus had to include lesions with a large range of degrees of suspicion. Therefore, we selected consecutive patients who underwent MRI and biopsy at our institution in 2015–2016. At that time, our biopsy policy required to target all focal lesions, even those with a low degree of suspicion. Biopsy operators could omit systematic biopsy in PZ sextants that had targeted biopsy, which allowed targeting several lesions without unreasonably increasing the number of cores taken. Hence, 92.5% (147/159) of the study patients underwent targeted biopsy while the csPCa prevalence was only 39.6% and 33% at patient and lesion level retrospectively. Furthermore, in accordance with the recommendations of the time [29], MRI was not used to select patients for biopsy but only to indicate the lesions to target, which limited selection bias.

This set of predefined lesions was first used to assess inter-reader agreement on lesion size and location. Agreement on size was excellent (CCC ≥ 0.80). The overall agreement on lesion location (PZ, TZ or CZ) was moderate-to-good (κ = 0.60–0.73). Only 59% (142/240) of the predefined lesions were localized in the same zone by all readers. This is problematic since PZ and TZ lesions are scored differently, using different dominant sequences. Additionally, CZ lesions are also assessed differently, at least using PIRADSv2.1 descriptors [10]. Thus, any variability in lesion location can have major consequences on the final scoring agreement. Variability on lesion location can be explained by two main factors. First, due to the lack of well-defined anatomical landmarks between CZ and PZ, the number of lesions localized in CZ was highly variable from one reader to another. Second, partial volume effects in some locations (e.g. anterior horn of the PZ, extreme apex) made it difficult to distinguish between PZ lesions and TZ nodules protruding into the PZ. 3D T2w acquisitions with multiplanar reformations might facilitate lesion location by reducing partial volume effects. Unfortunately, in this study, readers had only access to 2D T2w axial and sagittal imaging.

As others [30], we found that experienced seniors performed significantly better, mostly because they assigned lower scores to non-csPCa lesions. However, the impact of experience on inter-reader agreement was small and agreement remained moderate at best, even for experienced seniors. This is discordant with another study in which inter-reader agreement was substantial and better between dedicated uro-radiologists than between non-dedicated radiologists. However, in that study, all radiologists were from the same institution, which may have reduced interpretation variability, particularly among dedicated radiologists [15]. Taken together, our results suggest that, despite continuous efforts of standardization and clarification, most PI-RADS descriptors remain subjective. Distinguishing ‘marked’ from ‘non-marked’ abnormalities, ‘encapsulated’ from ‘mostly encapsulated’ nodules, or ‘focal’ from ‘non-focal’ enhancement is subjective but has major effect on the final score. Interestingly, for PI-RADSv2.1 and PI-RADSv2, and for all groups of readers, κ values tended to be higher for T2-weighted and diffusion-weighted categories than for DCE categories. Although this finding should be interpreted with care since all pulse sequences do not have the same number of categories, it may suggest that visually distinguishing positive from negative cases is difficult at DCE, especially in the presence of subtle enhancements from background.

Several solutions for improving MRI reproducibility can be suggested. Mentoring through systematic double reading with an experienced reader could probably accelerate the training of beginners, but this is made difficult by the heavy workload of radiologists [31]. Using quantitative thresholds for apparent diffusion coefficient or DCE-derived parameters may also improve prostate MRI accuracy and inter-reader agreement [16, 32,33,34], but there is still progress to be made on the reproducibility of MRI biomarkers [35,36,37,38]. Finally, assistance by Artificial Intelligence algorithms may facilitate prostate MRI reading in the future; however, conflicting results have been recently published on this matter [39,40,41,42,43,44,45].

Our sample size was not designed to statistically compare PI-RADSv2.1 and PI-RADSv2 performances, because the difference was expected to be small. Meaningful comparison would have needed an unrealistic number of patients. Yet, the strict application of PI-RADSv2.1 descriptors in predefined lesions tended to yield lower scores in non-csPCa lesions as compared to PI-RADSv2 descriptors. This was mainly observed in PZ lesions for which the PI-RADSv2.1 clarifications on Dw imaging categories 2, 3 and 4 seem to have favoured better characterization. However, this effect was too small and too heterogeneous across readers to induce a substantial difference between the AUCs of the two scores. Additionally, PI-RADSv2.1 clarifications did not improve inter-reader agreement.

After assessing the predefined lesions, readers were allowed to describe additional suspicious lesions. This was designed to evaluate whether the new PI-RADSv2.1 upgrading rules in TZ increased the number of suspicious lesions as compared to PI-RADSv2. In accordance with other studies [12,13,14, 18], we found that such upgradings were rare. As a result, per-lobe analysis, that included predefined and additional lesions, showed similar results than per-lesion analysis: experienced seniors out-performed the two other groups of readers, and, in all groups of readers, PI-RADSv2.1 showed a trend toward improved specificity as compared to PI-RADSv2. Of note, the number of additional lesions was highly variable across readers, with juniors tending to describe more lesions that seniors.

In this study, experienced readers were a priori defined as having more than 5 years of experience. A recent European consensus suggested that a minimum of 1000 cases should be read to become an expert [31]. All our experienced seniors fulfilled that condition, and our results are in line with those of the European consensus.

Readers assessed PI-RADSv2 and PI-RADSv2.1 during the same session. This may have resulted in underestimating the differences between the scores. However, independent scoring is illusory; most readers were familiar with the PI-RADSv2 descriptors and would have kept them in mind when using the new PI-RADSv2.1 criteria. In addition, assigning the scores in two different sessions introduces intra-reader variability, which may be substantial [46, 47]. Because reading the cases needed approximately 15–20 h, we were also afraid that the second reading would be biased by fatigue and the gradual lack of involvement of the readers. Thus, we chose to ask the readers to concentrate, during the same reading session, on the assessment of each pulse sequence category by following as closely as possible the written PI-RADSv2 and PI-RADSv2.1 descriptors without minding the overall score that was calculated automatically.

Our study has limitations. Firstly, because we indicated the predefined lesions to the readers, the AUCs obtained herein do not fully assess the diagnostic performance of the PI-RADS score in clinical routine. The detection phase, that is also a source of interpretation variability, was outside the scope of this study. However, many other studies have already assessed the overall performance of the PI-RADS score [23,24,25]. Instead, we wanted to specifically evaluate whether the PI-RADS descriptors were specific enough to induce reproducible characterization of the same lesion across multiple readers. This allowed the evaluation of factors of variability (size, location, PI-RADS categories of each pulse sequence) that, to our knowledge, had not been studied before. Secondly, prostate biopsy, used as reference standard, may have missed some csPCas. However, the small proportion of aggressive cancers detected during follow-up suggests that the sensitivity of our biopsy technique was good. Thirdly, we included only biopsy-naïve patients. Our results may not be valid for other populations.

In conclusion, when assessing the same set of MRI lesions using PI-RADSv2.1 and PI-RADSv2 descriptors, experienced seniors performed significantly better in characterizing csPCa than the other groups of readers. PI-RADSv2.1 descriptors tended to be more specific than PI-RADSv2 descriptors, but did not improve inter-reader variability.

留言 (0)

沒有登入
gif