AI in Pathology: What could possibly go wrong?

Pathology is the study of the causes, progression, and consequences of disease. Advances in medical technology have enabled pathologists to increasingly use automated image analysis tools and algorithms to help improve their diagnoses and treatment options. However, there are also concerns about the potential dangers involved with artificial intelligence (AI) becoming increasingly pervasive in the medical field. There are several potential issues that could arise when using AI in pathology. Some of the key concerns include:

1

Lack of data diversity: If the dataset used to train an AI model is not diverse, the model may not be able to accurately identify pathology in a wide range of patients.

2

Bias in the data: AI models can perpetuate bias present in the training data, leading to inaccurate or unfair diagnoses for certain groups of patients.

3

Lack of understanding: AI models can be difficult to interpret, making it difficult for pathologists to understand how a diagnosis was made, and therefore difficult to trust the AI diagnosis.

4

Security: AI models and data must be protected from unauthorized access, especially when dealing with sensitive patient information.

5

Regulation: AI models will have to comply with data protection and healthcare regulations.

6

Accreditation: AI models will have to be validated and accredited by relevant regulatory bodies before they can be used in a clinical setting.

The section above was composed in its entirety by the OpenAI-developed web-based language model, GPT-3, which has been shown to produce human-like text.1 The prompt was, “Pathology and AI: What could possibly go wrong?” While large language models (LLM) have burst on the scene with the recent release of ChatGPT and similar tools, we include this example here to demonstrate AI's potential contribution to (or intrusion into) human expert tasks across many areas, including, of course, pathology, as well as the already significant penetration of these concerns into mineable online texts. This machine-generated introduction above and the text-to-image example shown in Figure 1 managed to highlight many of the concerns that will be covered in the rest of this, almost-completely-human-generated discussion.

As is increasingly evident, the worlds of pathology and AI are colliding in exciting and concerning ways—this phenomenon is working itself out not only in academia and in research labs of well-funded tech companies, but also in the commercial and regulatory sectors. There are a number of histopathology-focused AI systems currently entering the market worldwide. A few examples include Paige2, PathAI3, Aiforia4, and Proscia5. The first FDA approval occurred in September 2021 for Paige Prostate Detect (2021)6,7. The goal of this product was to identify areas suspicious for carcinoma in prostate biopsy cases. Despite these examples, implementation, validation, and maintenance of these systems in a clinical environment present a number of challenges.

Many of these issues have already been confronted in other fields outside of pathology, particularly in radiology, which has been wrestling with them for some years now.8, 9, 10 Some angst was generated by an experiment a few years ago demonstrating that even pigeons were quite capable of delivering diagnoses on both histology and radiology images.11 In addition to some caustic emailing, the study did stimulate a reframing of some discussions of the role of AI and the possible threats presented12, and helped inform our understanding of the neural basis for [human] acquisition of pathology expertise.13

We will provide in Table 1 and in the discussion below, an overview of the potential challenges and benefits at different stages of the AI pipeline, from data acquisition to post-approval maintenance, and the ultimate impact on human practitioners and society. The structure of the discussion is as follows:

Upstream issues of AI (data acquisition, algorithm training)

Downstream issues of AI (dataset shift, model recalibration, changing practice patterns)

Human issues related to AI (deskilling, dethrilling, burnout)

Societal implications (health equity, data inclusion)

The challenges in Table 1—and other ones discussed below—should instill caution. But we do not see things as relentlessly dire. The impressive potential impact of AI will be convincingly presented in other papers in this special edition, as well as in discussions published previously.14, 15, 16, 17, 18, 19 AI addresses a broad range of applications in pathology from quality control, diagnosis, prognosis, and treatment recommendations to workflow impacts designed to improve practice efficiency. The potential benefits of AI are numerous—but as we try to show here, the benefits can be accompanied by opposing effects whose risks are plausible and should be considered. Indeed, what could possibly go wrong?

By “upstream,” we mean the tasks involved in data acquisition and algorithm training. Obtaining the data used to train AI tools involves capturing the right mix of clinical presentations and histologies with sufficient patient distribution. This task also includes the need to ensure that the images being submitted to the AI tools are generated with histology preparation and scanning techniques that match those that were used in the foundational databases. The second major challenge is the development of appropriate AI tools that can ingest, process, and interpret—and stay current. These problems are exacerbated by possible technological evolution, as well as the evolution of new diseases (e.g., COVID-19).

There are serious problems involved in developing pathology-enabling models. Access to high-quality data is critical, but significant issues arise that include generalization biases—the fact that different slide and staining processes are in place, not only between different medical centers within a single country, but even more so across countries. For example, it is not well-appreciated in the US that some European H&E staining solutions include saffron dye for improved fibrous tissue delineation. This may not even be indicated in the slide description, as the term H&E—rather than the more correct HES—may be employed as shorthand. Differences in procedures, technique, and quality exist between academic medical centers, community hospitals, and laboratories.20 Moreover, not infrequently, the histology processing may be performed by separate, third-party commercial services, some better than others, and hospitals can use different providers, sometimes making the switch without notice to affected pathologists or the AI team. Quality of metadata, that is, information associated with cases beyond just the digital image itself, is also a tremendous challenge. An anecdote from the UC Davis burn registry will illustrate the problem. There was a field for “location of burn,” and entries included: “my left leg”; “my right arm”; “the back yard”; and “Mexico.” [example courtesy Kent Anderson, UC Davis Health]

It is not unusual for pathologists to encounter cases with poor histology quality—even at major medical centers—and of course the problems multiply when less equipped, understaffed, and/or underfunded histology facilities are in the mix, and can get exponentially worse in truly low-resource settings. Laboratories experiencing this issue must prioritize development of good histology practices before considering digital pathology and AI implementation. Even laboratories with adequate histology will need to practice stringent quality control techniques because digitized artifacts can flummox downstream AI tools if not anticipated. As an example, a slide with a small fold on a section may still be adequate for interpretation by a pathologist or technician using a microscope because focus is manually adjustable, and a human can disregard technically inadequate regions, as can some AI tools now (e.g., HistoQC).21, 22, 23

In practice, there are two main approaches to implementing digital pathology image acquisition: retrospective and prospective scanning. Retrospective scanning involves scanning glass slides after they have been received by pathologists, while prospective scanning involves incorporating digitization into the histology production process, resulting in a digitized slide being available immediately to the pathologist. Prospective scanning offers the most streamlined workflow, with benefits such as minimal added time for digitization, no need to clean slide surfaces, and consistent availability of digital cases for both routine diagnostic work and new AI-based workflows. However, it is important to note that adding digitization to an already complex histology workflow is not a straightforward task and will involve significant up-front and continuing costs. Practices already dealing with throughput issues should be cautious when implementing digital pathology solutions. Furthermore, digitization without proper implementation will create additional bottlenecks in glass slide production, which could further delay case sign-out. Delays could affect not only the reputation of the laboratory and pathologists but could also be deleterious to patients with diseases whose management depend on the availability of prompt pathology results.

AI systems perceive their environment in the form of data. The data could be in the form of electronic health records, pathology reports, and whole-slide images. The data used to teach and train the AI systems to make decisions may contain implicit biases arising from historical decisions and actions of humans, across racial, religious, ethnic, or gender dimensions. Datasets are just one source of bias in AI; if teams actually building the AI algorithms are not sufficiently diverse, they are likely to introduce biases when they are setting the problem (cognitive bias), designing the experiment (framing bias), or selecting the algorithms (selection bias). Marginalized groups are the most at risk since they are under-represented in datasets used to train and validate AI algorithms.

However, having diverse datasets will not address disparities in care that are embedded in the electronic medical record (EMR) data; these will thus have the potential to be translated into the resulting algorithms. Unintuitively, the explicit inclusion of race in the construction or evaluation of datasets can contribute to, rather than minimize, downstream error. This is because the notion of race (or a substitute variable, skin pigmentation) has been enshrined in the medical literature while having little or no underlying validity on its own. Far more determinative are associated variables such as income, geographical location, the impact of racially tuned medical assessments and treatments, and so on.24 Including race denomination in metadata accompanying medical images can run the risk of creating erroneous prior odds for various connections and affect machine learning output. While that identification may correlate with certain determinants of health status, simply capturing zip code data (as a proxy for income or environmental exposures) may in fact perform more reliably. Unfortunately, while proxies could allow one to avoid explicit use of race, the resulting algorithm will impact one race more than others, so the fundamental concern will likely remain.

While it is evident that AI tools will be able to contribute expertise (e.g., diagnostic capabilities) that might otherwise be completely unavailable in some low-resource settings, it may also be possible that poorly configured AI would enshrine knowledge imbalances and thus perpetuate global inequities, as the diagnostic tools may poorly reflect the local prevalence and predilections of disease. For example, OncoType DX predicts risk poorly when applied to breast cancer specimens from African-American Black women.25 While that was a study based on molecular phenotypes, this phenomenon also manifests itself when image-based structural biomarkers, such as multinucleation, are considered as predictive risk features.26

Mitigating and remediating biases in AI is indeed challenging but necessary for creating a trusted foundation for more ethical and less biased AI functionalities. Typical development teams rarely have the necessary expertise and perspective to fully address biases in real-world data. Social determinants are infrequently included as features for prediction, classification or optimization, and to remedy this, appropriately nuanced social scientists should be part of study design.

Competent AI requires large volumes of diverse data for training, traditionally only achieved by centralizing data from multiple sources. This process is risky, as it creates opportunities for breaches and often requires careful, thorough, non-trivial de-identification to ensure that patient data do not leave the safety of hospital or laboratory firewalls.27,28 In pathology settings, high-volume breaches were previously limited by the simple fact that slides physically existed only in one location. The rapid digitization of the field (necessary for AI) makes potential breaches possible and potentially more damaging.29

Federated learning is emerging as a leading technology to improve many of the issues plaguing AI (e.g., poor performance, bias, model degradation), while at the same time decreasing the risk of patient privacy issues by enabling training across multiple institutions while patient data remains behind hospital firewalls.30 This accelerates development of higher performance, more generalizable models, but also unlocks use cases only possible with access to real-time, rich data (e.g., continuous model monitoring to assess model drift). Federated learning has already created significant impact in healthcare AI, facilitating global collaborations of over 70 hospitals, and is being actively leveraged in pathology.30, 31, 32, 33, 34, 35

The basic principles of federated learning involve training AI models on local data at different sites; only the locally trained model weights are transferred from each site and collated via different aggregation techniques to create a consensus model (federated averaging is most commonly used). This resulting model can achieve performance similar to one trained on fully centralized data and has improved performance and generalizability compared to models trained on limited sites.36,37 Additionally, this also de-risks multi-party collaborations, especially in cases where there is high sensitivity of exposing model IP to potentially nefarious actors. While a federated approach supports data privacy by design, privacy and security can be further enhanced by add-on technologies such as differential privacy, partial weight sharing, and homomorphic encryption.38,39

For all multi-site training approaches there are nevertheless serious challenges to be overcome, such as varying data formats, differing data quality, inconsistent annotation, onerous data network infrastructure setup, ensuring that legal and regulatory compliance is in effect across different geographies, and aligning incentives among organizations that sometimes view each other as competitors. These challenges are being actively addressed by academic collaborations, startup companies, and governmental initiatives around the world.

Furthermore, federated learning is not the sole technology for privacy-preserving collaborations (e.g., swarm computing40, and secure multi-party computation), but it appears to be well suited for real-world settings due to issues with other techniques (e.g., high compute costs, high network requirements).

Despite its impressive performance in specific tasks, it is clear that AI is still in its infancy. Are the current mathematics/statistics tools good enough? They are, in the sense that they can do a good job with the data they are given. This question is entangled with issues of data quality, model requirements, and how that influences performance of the model on never-before-seen cases (the generalizability issue). In this regard, apparently sensible preprocessing steps could obliterate relevant content from the training set. An example from chest X-rays AI: an overenthusiastic lung segmentation step that cuts out regions near the heart can completely obliterate relevant signs of disease.41 Similarly, highlighting only areas that contain neoplastic cells in a strongly supervised histology data set might incorrectly exclude stroma and peritumoral tissue that could house potentially the most prognostically significant features.

One general problem, namely overfitting, can affect analyses that were generated after ingestion of potentially hundreds of uncorrelated data variables, which can include not only a plethora of image-based features, but also ancillary clinical and demographic data as well. A common finding is that inclusion of increasing numbers of features initially boosts, but then can depress achievable accuracy. If too many features are included, fitting to noise can be the outcome.42 Proceeding to commercial implementation typically involves the deployment of proprietary and non-disclosed algorithms. A recent article in Nature has addressed some of these relevant issues, including data inadequacies, and lack of transparency around the actual code base used.43

留言 (0)

沒有登入
gif