Evaluation of multiple-vendor AI autocontouring solutions

All four investigated contouring solutions obtained comparable physician scores. However, there were notable exceptions for the three AI contouring solutions for the bladder, brain, femoral head, and spinal cord, as discussed in detail below.

Although SIE scored slightly higher (worse) physician scores than MIM or RAD for the brain contours, this can be explained by stylistic differences: SIE subtracts the brainstem from the brain contour, which is not consistent with our clinical practice and hence has higher (worse) physician scores. Similarly, for the femoral head contours, RAD contours just the femoral head and do not include the neck of the femoral head, which is included in the physician contours, MIM and SIE. Finally, for the spinal cord, SIE contours the supposed true spinal cord, whereas the AP, MIM and RAD contour the spinal canal (or thecal sac) as a surrogate for the cord, which is in concordance with our clinical practice, as shown in Fig. 2. For the spinal cord contour, while similar DSCs were found (0.75 - MIM, 0.81 - RAD, 0.68 - SIE), MIM showed a larger average HD (18.3 mm - MIM, 6.0 mm - RAD, 7.2 mm - SIE) and MDA (3.6 mm - MIM, 1.0 mm - RAD, 1.6 mm - SIE). Closer investigation revealed that these larger distances were found only in abdominal/pelvic patients where MIM contoured the spinal cord to the level of the L2 vertebra, whereas the physician and other contouring solutions included the cauda equina in the spinal cord structure, as shown in Fig. 2B.

Fig. 2figure 2

A) Transverse and B) sagittal views with a “soft tissue” window/level, showing spinal cord contours. C) Sagittal view with a “soft tissue” window/level, showing bladder contours. D) Coronal view with a “lung” window/level showing left and right lung contours. Contours are labeled as approved by physicians (AP), generated using ProtégéAI+ (MIM), AutoContour (RAD), and DirectORGANS (SIE)

When examining the bladder contours, poor scores were found for some or all vendors when unusual anatomy was encountered. MIM, RAD and SIE all received average scores > 4.5 for one patient where a contrast agent was placed within the bladder. Both MIM and RAD had average scores > 4.5, or “unusable”, for one female patient with advanced gynecological cancer for whom SIE had an average score of 2.67, as shown in Fig. 2C. One male patient with metastatic prostate disease and an enlarged, trabeculated bladder also received average scores > 4.5 for MIM and SIE, whereas RAD received an average score of 3.00. When these three examples of unusual anatomy were excluded, the average physician scores improved by 0.70, 0.79 and 0.73 for MIM, RAD and SIE, respectively.

An example of potential errors introduced by autocontouring solutions for patients with abnormal or nonstandard anatomy is shown in Fig. 2D. Here, the patient’s right lung was typical, while the left lung had partially collapsed. For the right lung, all autocontouring solutions performed well, with PS values between 1.67 and 2.56, DSC values ≥ 0.92 and MDA values ≤ 2.4 mm. For the left lung, however, only MIM matches the AP contour well, with a PS of 2.33, DSC of 0.93 and MDA of 1.1 mm, while both RAD and SIE produce unusable contours with DSCs of 0.38 and 0.02, respectively, and PS > 4.

These examples highlight some of the challenges faced by vendors as contouring atlases used in the definitions of specific organs may vary between research studies, internationally and over time, which can lead to the stylistic difference noted. Collaboration with users at a range of clinical practices is important to allow for improvements in these autocontouring solutions. Since we began this evaluation there have already been updates to the available models from Radformation that allow for users to select femoral head models that match the RTOG guidelines, which would theoretically improve the physician scores for this structure. There are also new female pelvis atlases which may improve bladder contouring.

DSCs greater than 0.5 were found when comparing AI-generated structures to the AP structure, with the exception of RAD femoral head owing to the contouring differences outlined above. Most structures had average DSC scores between 0.7 and 0.9, indicating good agreement in the bulk of the structure but with room for improvement, especially at the periphery. Doolan et al. investigated five autocontouring solutions, including RAD, using volumetric methods [26]. Their work found similar DSC scores when averaged across all volumes for the various contouring solutions. They also investigated the time savings and found that between 14 and 93 min could be saved based on the number and complexity of the contoured organs. The average HD and MDA were similar between the autocontouring solutions, with the exceptions noted above. 41 out of 48 structures had an average MDA < 5 mm.

When examining physician scores between contouring modalities, 11/16 (68.8%) of the manually generated approved physician contours had average scores ≤ 2.5. MIM showed slightly worse results, with 10/16 (62.5%) with average scores ≤ 2.5, while both RAD and SIE achieved better results, with 14/16 (87.5%) of contours receiving average scores ≤ 2.5. Bustos et al. compared one autocontouring solution to manually generated and atlas-based contours [27]. Their work also included a review of the AI-generated contours by a single radiation oncologist and found that of the 140 contours evaluated, only 5 (3.6%) required major edits or were completely redone. A total of 95 (67.9%) were judged to be clinically useable with no edits necessary, similar to the results of this study. We deemed contours with average physician scores less than 2.5 be clinically usable, with only minor or stylistic differences. With most of the AI-generated contours achieving these scores, all investigated products can be deemed to be at least as good as physician contours for a subset of contours. This underscores the potential of AI-generated contours to simplify and streamline the contouring and treatment planning process.

As a result of this work, it was decided to implement AutoContour (RAD) at all our clinical sites spanning five facilities, four CT simulators, eight LINACs and three HDR treatment units. Whilst similar physician scores and similarity metrics were found with all vendors, at the time of this work, RAD had the largest number of available organ contours.

留言 (0)

沒有登入
gif