The overlay validation was a substudy, independent of the analytical and clinical validations, designed to validate the use of the algorithm-generated overlays to assist the pathologist in reviewing the slide and AIM-MASH scores. Up to 160 frames or regions of interest within the whole-slide image (WSI) with a predefined area per feature (steatosis, lobular inflammation, hepatocellular ballooning, fibrosis, hematoxylin and eosin (H&E) artifact and trichrome artifact) were evaluated in this study (some frames were enrolled for multiple features). Distributions of frames based on slide-level score (GT scores) are listed in Extended Data Table 1, and distributions of frames based on frame-level scores (collected from the enrollment pathologist) are listed in Extended Data Table 2. For each frame and each feature, the pathologists indicated whether the feature was present (yes or no), shown in Extended Data Table 3.
The acceptance criteria for true positive (TP; evaluation of underestimation by overlay) success were met for all feature overlays except for hepatocellular ballooning, where it was narrowly missed, and the mean success rates were all above 0.85. H&E artifact TP success rate was 0.97 (95% confidence interval (CI), 0.95–0.99), trichrome artifact was 0.99 (95% CI, 0.97–1), lobular inflammation was 0.94 (95% CI, 0.92–0.96), steatosis was 0.96 (95% CI, 0.93–0.98), and fibrosis was 0.97 (95% CI, 0.95–0.99). For hepatocellular ballooning, the overall TP success rate was 0.87, with 95% CI (0.83–0.91). The acceptance criteria for the false positive (FP; evaluation of overestimation by overlay) success rate was met for all six feature overlays. H&E artifact success rate for FP was 0.97 (95% CI, 0.95–0.99), trichrome artifact was 0.93 (95% CI, 0.90–0.96), lobular inflammation was 0.99 (95% CI, 0.98–0.99), steatosis was 1.00 (95% CI, 0.98–1), hepatocellular ballooning was 0.92 (95% CI, 0.90–0.94), and fibrosis was 0.99 (95% CI, 0.99–1).
The individual pathologist TP and FP success rates are listed in Table 1. The number of frames for which all three evaluating pathologists agreed on the presence of the feature (independent of any overlay) divided by the number of frames for which at least one pathologist indicated the presence of feature in a frame was 89% (132 of 148 frames) for H&E artifact, 55.1% for hepatocellular ballooning (65 of 118 frames), 80.0% (124 of 155 frames) for lobular inflammation, 99.4% (158 out of 159 frames) for steatosis, 72.0% (108 of 150 frames) for trichrome artifact and 96.8% (149 of 154 frames) for fibrosis. Given that the agreement for the presence of hepatocellular ballooning was the lowest (55.1%) and the TP success rate for ballooning was above 0.90 for two of three of the pathologists, the sources of variability between pathologists for this feature were further examined. For the 65 frames for which all three evaluating pathologists indicated the presence of hepatocellular ballooning, the TP success rate was calculated. Pathologists A and B identified underestimation in one and three of the 65 frames, respectively, resulting in TP success rates of 0.99 for pathologist A and 0.95 for pathologist B for those frames. However, pathologist C identified underestimation in ten of the 65 frames, showing a TP success rate of 0.85. Additionally, pathologist C identified a total of 111 frames that had some ballooned cells compared to 92 and 71 for pathologists A and B (Extended Data Table 3), indicating that pathologist C may identify more cells as ballooned hepatocytes than the other two pathologists and the algorithm. This is predictable given the lack of standardization across expert pathologists in both identifying and quantifying ballooned hepatocytes22.
Table 1 TP and FP success rates per individual pathologist for overlay validationAlgorithm repeatability and reproducibilityFor interday scanner repeatability (AIM-MASH deployment on the same glass slides on different scans from the same scanner on different days), the mean agreement rate between the AIM-MASH scoring on the three separate WSIs for steatosis was 0.93 (95% CI, 0.89–0.96; P < 0.0001), for lobular inflammation was 0.96 (95% CI, 0.94–0.99; P < 0.0001), for hepatocellular ballooning was 0.96 (95% CI, 0.93–0.98; P < 0.0001) and for fibrosis was 0.93 (95% CI, 0.89–0.96; P < 0.001) (Fig. 2a).
Fig. 2: Scanner repeatability and reproducibility of AIM-MASH.a, For scanner repeatability, a subset of 150 cases from the clinical validation were scanned multiple times using the same Leica Aperio AT2 scanner at ×40 magnification on three nonconsecutive days (intrasite, interscan). b, For scanner reproducibility, the same slides were scanned once at three different laboratories by three different operators using three different Leica Aperio AT2 scanners at ×40 magnification (intersite). Bootstrap percentile P values showing statistical significance for the one-sided hypothesis that the mean agreement rate between algorithm scores for each scan is greater than 0.85 are as follows: ***P < 0.0001; **P < 0.01; *P < 0.05; not significant (NS), P ≥ 0.05. Whiskers show the 95% CIs for mean agreement rate estimated using 2,000 bootstraps. Dashed lines indicate 85% agreement.
For intersite scanner reproducibility (AIM-MASH deployment on the same glass slides on different scans from three different sites), the mean agreement rate for hepatocellular ballooning was 0.91 (95% CI, 0.87–0.95; P = 0.02), meeting the acceptance criteria. The mean agreement rates for steatosis, lobular inflammation and fibrosis were approximately 85%, but the CIs fell slightly below the 0.85 acceptance criteria (steatosis, 0.86 (95% CI, 0.81–0.9; P = 0.39); lobular inflammation, 0.85 (95% CI, 0.80–0.89; P = 0.53); fibrosis, 0.87 (95% CI, 0.82–0.91; P = 0.21)) (Fig. 2b).
Pairwise inter-reader agreements were calculated between IMR pathologists across all cases (Supplementary Table 1) to explicitly compare reproducibility across study pathologists to reproducibility achieved by AIM-MASH across sites and scanners. For all histologic components, interscan, intrasite repeatability and interscan, intersite reproducibility were higher than for pathologist mean pairwise agreement (for pairs of pathologists who read at least ten common cases) (Table 2).
Table 2 Manual pathologist versus AIM-MASH repeatability and reproducibilityAccuracy of the algorithm alone and as a pathologist-assist toolEvaluation of the non-inferior accuracy of AIM-MASH (algorithm only and AI assisted) to IMRs was assessed in 1,481 cases by comparing the mean weighted kappa (WK) of IMRs with GT (workflow in Fig. 1b) to the WK of AIM-MASH with GT (workflow in Fig. 1b (Fig. 3)).
Fig. 3: Accuracy concordance comparison of MASH histologic components and comparisons for MASH aggregate component scores (F2 and F3 versus other and NAS ≥ 4 with ≥1 in each score category versus other) and MASH resolution.a,b, Accuracy comparison, based on linearly WK, between AIM-MASH (without pathology review) versus GT and IMR versus GT in a and between AI assisted (AIM-MASH with pathology review) versus GT and IMR versus GT in b for MASH components. c, Accuracy comparison, based on kappa, between AI assisted versus GT and IMR versus GT for aggregate components relevant to clinical trial enrollment and endpoint criteria, including the score-based enrollment requirement, MAS ≥ 4 with a score of at least one for each component, fibrosis score of 2 or 3, and the NASH resolution endpoint, defined as a ballooning score of 0, a lobular inflammation score of 0 or 1 and any score for steatosis. Point estimates are shown on top of each bar, with whiskers representing the 95% CIs estimated from 2,000 bootstrap samples. Non-inferiority (NI) was assessed using bootstrap percentile P values for testing the one-sided hypothesis that the LB of the 95% CIs of the difference in AIM-NASH versus GT or AI assisted versus GT and IMR versus GT is not smaller than −0.1. S (superiority) was assessed by testing the one-sided hypothesis that the LB of the difference is greater than 0. ***P < 0.0001; **P < 0.01; *P < 0.05; NS, P ≥ 0.05. ‘+’ in c indicates aggregate components where the LB of the 95% CIs for AI assisted versus GT kappa is greater than the upper bound of the IMR versus GT kappa.
For AIM-MASH only (Fig. 3a), the difference in WK for AIM-MASH and GT compared to mean WK for IMR and GT for hepatocellular ballooning was 0.15 (95% CI, 0.11–0.18; non-inferiority P < 0.0001) and for lobular inflammation was 0.12 (95% CI, 0.08–0.17; non-inferiority P < 0.0001) with P < 0.0001 for superiority for both components. The difference in WK for AIM-MASH only and GT compared to WK of mean IMR and GT for steatosis was 0.01 (95% CI, −0.02 to 0.03; non-inferiority P < 0.0001) and for fibrosis was −0.01 (95% CI, −0.04 to 0.02; non-inferiority P < 0.0001). Steatosis and fibrosis met non-inferiority but did not achieve superiority.
For AI-assisted pathologist reading of the 1,481 cases (Fig. 3b), the difference in WK for AI assisted and GT compared to mean WK for IMR and GT for hepatocellular ballooning was 0.15 (95% CI, 0.11–0.19; non-inferiority P < 0.0001) and for lobular inflammation was 0.12 (95% CI, 0.08–0.17; non-inferiority P < 0.0001) with P < 0.0001 for superiority for both components. The difference in WK for AI assisted and GT compared to mean WK for IMR and GT for steatosis was 0.01 (95% CI, −0.02 to 0.04; non-inferiority P < 0.0001) and for fibrosis was 0.01 (95% CI, −0.02 to 0.03; non-inferiority P < 0.0001). Steatosis and fibrosis met non-inferiority but did not achieve superiority. For all MASH score components, WKs for AI assisted and GT were in the ranges of published CRN pathologist WKs8,14.
For AI-assisted pathologist reading, accuracy was higher for composite histologic scores than for IMRs (Fig. 3c). The WKs for AI assisted and GT and WKs for IMR and GT for fibrosis 2 and 3 (F2 and F3) versus other were equivalent, with WK for AI assisted and GT being slightly higher than WK for IMR and GT (0.57 versus 0.53, respectively; Fig. 3c). WKs for the trial-relevant enrollment criteria MAS ≥ 4 with ≥1 in each score category between AI assisted and GT were significantly (lower bound (LB) of the 95% CI for AI assisted versus GT kappa was greater than the upper bound of the 95% CI for IMR versus GT kappa) higher than the WK between IMR and GT (0.63 versus 0.51, respectively, with a difference of 0.11 and a 95% CI of 0.07–0.16) and, for MASH resolution (defined as a hepatocellular ballooning score of 0, a lobular inflammation score of 0 or 1 and any steatosis score) between AI assisted and GT, were also significantly higher than the WK between IMR and GT (0.54 versus 0.37, respectively, with a difference of 0.16 and a 95% CI of 0.10–0.22) (Fig. 3c).
For AI-assisted evaluation against a median of a panel of pathologists (GT workflow described in Fig. 1c), non-inferiority was met for all histologic components for agreement of AI-assisted reads with median GT reads, compared to the agreement between median read scores derived from two different groups of pathologists (GT workflow in Fig. 1c, results in Fig. 4). For steatosis, the average WK for AI assisted versus GT was 0.68 and for manual median versus GT was 0.75, with a difference of –0.07; for lobular inflammation, the WK for AI assisted versus GT was 0.43 and for manual median versus GT was 0.44, with a difference of –0.02; for hepatocellular ballooning, the WK for AI assisted versus GT was 0.56 and for manual median versus GT was 0.53, with a difference of 0.04; and, for fibrosis, the WK for AI assisted versus GT was 0.65 and for manual median versus GT was 0.72, with a difference of –0.09.
Fig. 4: WK analysis for MASH components AI assisted and median panel comparisons.The same cohort of 1,481 cases used in analytical and clinical validation was used to determine the accuracy of AI-assisted reads against two panels of readers. Median GT (panel 1, using median scores, described in Fig. 1c), instead of panel calls for consensus and median IMR (panel 2), derived from a minimum of three IMRs, was determined. AI-assisted scores for each component met the non-inferiority performance criteria described in Statistical analysis (Methods). Superiority was not observed for any of the components. Whiskers represent 95% CIs estimated using 2,000 bootstrap samples. ***P < 0.0001; **P < 0.01; *P < 0.05; NS, P ≥ 0.05.
留言 (0)