Evaluating Artificial Intelligence in Clinical Settings—Let Us Not Reinvent the Wheel


Introduction

The last two decades have seen rapid growth in artificial intelligence (AI) initiatives in health care settings, driven by the promises of improved treatment, quality, safety, and efficiency []. AI systems are computer algorithms that are able to mimic human intelligence to perform tasks. They are potentially capable of improving clinical decision-making. However, there is currently a lack of high-quality evidence of effectiveness, and an overoptimism regarding AI-based technologies in health care [,]. Many existing algorithms and applications fail to scale and migrate across settings [], potentially leading to missed benefits or compromised patient safety.

Evidence from other sectors, such as finance and retail, may have limited applicability given the particular social, economic, technical processes, and legal challenges of health and social care settings []. Across the digital economy, AI has been successfully applied to historical data, for example, in financial forecasting [] or retail marketing, where personalized advertisements have transformed consumer behavior []. These methods are harder to deploy in the more complex and sensitive settings of health and social care []. This is largely because developers and implementers focus on tool development and do not sufficiently draw on existing work to inform the conception and design of technologies, their use and optimization, and organizational strategies to implement them.

Theory-informed approaches to evaluation can help to ensure that technologies are effectively validated, implemented, and adopted. They can also help to ensure that systems do not result in unintended negative consequences, such as inappropriate or suboptimal care, exacerbated inequities, or clinician burnout []. Theories seek to explain complex relationships at an abstract level and can help to integrate a particular implementation with the empirical evidence base. As such, theory-informed evaluation frameworks can enable learning from experience, thus guiding developers, implementers, and evaluators through development, implementation, and optimization []. Ideally, the real-world experience gathered during this process is then used also to inform the refinement of evaluation frameworks.

Despite significant investments, there are currently only a few examples of the use of AI-based systems in health care and most systems are only beginning to be rolled out and embedded [-]. This is in contrast to the finance and retail sectors, where processes and products are standardized. To date, most activity has focused on diagnostic image-based systems and text or language processing, while complex precision medicine efforts are in very early stages of development. We here call for the increasing use of theory-informed approaches to evaluation to help ensure that developed systems can be adopted, scaled, and sustained within settings of use, and are safe and effective. Until now, this has not been done consistently, which has resulted in limited learning and limited ability to transfer learning across settings, as well as limited clinical and patient reassurance. If done appropriately, the implications for clinical settings are significant, as validated new knowledge can be disseminated and shared. This, in turn, obviates the need to learn through experience that can be painful, dangerous, and costly.

Unfortunately, despite increasing attention in research, the current application of theory-informed strategy and evaluation in AI practice is relatively limited in both health care and other sectors []. This may be due to a lack of understanding surrounding the theoretical literature (ie, why theories are useful in practice and how they may be used by different stakeholders), and the immediate focus of developers on demonstrating that technology works. Politically and managerially, there may be a drive to show modernization processes rather than making clinical and organizational decisions based on evidence-based outcomes. Where theories have been applied, these have been driven by business approaches to value creation in organizations [], or by approaches designed to influence consumer behavior []. In these contexts, they have been strategically used to help address a particular stakeholder need (eg, how to maximize value through implementing AI in organizations and how to get consumers to accept AI technology). In health care, the range of stakeholders and associated needs however varies significantly from other sectors. While the managers and policymakers may focus on value and efficiency, patients are likely to be concerned about avoidable illness, and practitioners may focus on workloads and potential liability.

It is therefore often difficult to know what needs (and consequently what theory) to focus on and in what context. For example, while developers of technology now increasingly draw on cocreation with users to promote the adoption of AI, these approaches may not consider organizational drivers, workflow integration, multiplicity of stakeholders, or ethical considerations in implementation, thereby limiting the scalability of emerging applications.

Theory-informed approaches to evaluation in health care must be considered within their specific context, recognizing their relative positions and identifying which needs they address at various stages of the technology lifecycle. We aim to begin this journey by providing a conceptual overview of existing theory-informed frameworks that could usefully inform the development and implementation of AI-based technologies in health care. Despite some differences in technological properties and performance between AI- and non–AI-based technologies () [], many existing frameworks are likely to be applicable.

Table 1. Differences between artificial intelligence (AI)–based and non–AI-based health IT.ApplicationsAI-basedEvidenceNon–AI-basedEvidenceHealth services managementAI can help in optimizing resource allocation, scheduling, and workflow management by analyzing large data sets and identifying patterns and trends. For example, modeling of waiting times and underlying reasonsLimited evidence in relation to impact, mainly in relation to proof-of-concept [,]Non–AI-based approaches typically rely on manual processes and human decision-making for resource management, scheduling, and workflow optimization. For example, patient flow management applicationsHigh potential of data-driven approaches to improve organizational performance [,]Predictive medicineAI algorithms can analyze patient data, genetic information, and medical records to predict disease risks, treatment outcomes, and responses to therapies. This enables personalized medicine and targeted interventionsMany proof-of-concept studies but limited evidence in relation to how outputs are incorporated into clinical decision-making [,]Non–AI-based approaches rely on statistical analysis and clinical expertise to make predictions about disease risks, treatment outcomes, and responses to therapiesMany proof-of-concept studies but limited evidence in relation to how outputs are incorporated into clinical decision-making [,]Clinical decision support systemsAI to analyze large amounts of medical literature, patient data, and clinical guidelines to support clinical decision-makingArea of most focus, especially in imaging applications, AI has the potential to improve practitioner performance [,], but limited evidence surrounding organizational impacts or patient outcomesNon–AI-based approaches rely on the expertise and experience of health care professionals, along with clinical guidelines and published research, to make clinical decisionsDemonstrated benefits for practitioner performance and patient outcomes in some areas of use (eg, drug-drug interactions) [,]Laboratory and radiology information systemsUse of AI to detect abnormalities and to enhance the accuracy of diagnosesMost progress has been made in relation to imaging [,], but limited attention has been paid to integration with organizational practices as above []Non–AI-based diagnostics typically rely on visual inspection by health care professionals and manual analysis of patient dataThis is associated with information overload but does take account of contextual factorsPatient data repositoriesAI algorithms can process patient data to identify trends, patterns, and risk factorsPromising proof-of-concept studies, but limited implementation [,]Patient data are stored in a centralized repositorySome evidence that digitized records and repositories can lead to improved quality, safety, and efficiency, but hard to assess and take a long time to materialize [,]Population health managementPrecision prevention approaches to identify populations at risk and tailor preventative interventionsPromising approaches to precision prevention in specific cohorts, but limited implementation [,]Understanding factors that influence health outcomes and developing tailored interventionsSignificant evidence of population health interventions [,]Patient portalsAI-based symptom checkers and triage toolsInconsistent evidence in relation to symptom checkers and triage tools, concerns in relation to diagnostic accuracy [,]Access to generic informational resourcesTailored informational resources can improve satisfaction, involvement, and decision-making [,]Telehealth and telecareOnline health assistants and chatbotsMixed evidence of effectiveness usability and user satisfaction [,]Access to generic informational resources.Tailored informational resources can improve satisfaction, involvement, and decision-making [,]Health information exchangeExtracting and converting unstructured or semistructured data into a standardized formatThe use of free text data is still in its infancy but is promising. There is limited data on integration with existing ways of working and organizational functioning. [,]Coding and transfer into standardized formats are often done by health care staff Increased workloads for health care staff and coding are often not done accurately [,]

We here provide a conceptual overview of existing frameworks, focusing on practical applications of examples of existing theory-informed frameworks and their potential application to AI-based technologies in health care []. Frameworks were selected as examples illustrating these extracted categories. This work is not intended to be exhaustive but to provide a pragmatic introduction to the topic for nonspecialists [,].

To categorize frameworks in a meaningful way, we focused on their potential area of application and the particular interest or focus of various stakeholder groups who may need to draw on existing experience to inform their current efforts to develop, implement, and optimize AI-based technologies in health care settings.


Health IT Evaluation Frameworks and Their Potential Application to AIOverview

The 3 distinct dimensions identified are illustrated in , along with potential applications of AI-based technologies in health care and example use cases. These include frameworks with a technology, user, and organizational focus. We discuss each of these categories, the application of exemplary frameworks, and practical implications for various stakeholders in the paragraphs below.

However, it is important to recognize that the categorization of frameworks provided here is a simplification. Various frameworks have common and, in some instances, overlapping elements. The categories presented are intended to facilitate navigation and application.

Table 2. Examples of the focus of existing health IT evaluation frameworks and their potential application to artificial intelligence.Focus of the frameworkArea of applicationExample theoretical lensesPractical implicationsStakeholdersExamplesTechnology focusInforming the conception and design of technologies
To help AIa system developers design a system that is usable and useful within intended use settings
Human-centered designActively and iteratively involve end users in system design and developmentEnd users and developersA team had developed an algorithm to predict arterial fibrillation from electrocardiograms, but prospective users stated that the information would not change their practice [] User focus Informing and helping to optimize the use of technologies
To help developers and implementers understand the various contexts of use of AI as well as unintended consequences, and tailor systems to maximize benefits and minimize harms
Sociotechnical systemsPlan with users to effectively integrate the system in their work practices and monitor progress over timeEnd users, implementersIBM Watson encountered adoption-related issues, including usability and perceived usefulness of their oncology software, which eventually led to the abandonment
The system increased the workloads of doctors and made treatment recommendations that were viewed as unsafe by doctors []
Organizational focusInforming organizational strategies to implement technologies
To help AI system implementers integrate AI safely within existing organizational structures and processes
Institutional theoryPlan and monitor how systems and their outputs are integrated within and across organizational units and existing technological and social structuresEnd users, organizational stakeholders, and implementersBabylon Health UK (an AI-based remote service provider) failed because it did not fit with existing health system financing structures and cultures
Many patients from outside the local area enrolled in the service, which meant that the product was not commercially viable for local organizations []

aAI: artificial intelligence.

Frameworks With a Technology Focus

Many current AI applications in health care settings have been developed by AI specialists in laboratory settings. Consequently, they have struggled to successfully translate into clinical settings and deliver the performance achieved in research trials []. Frameworks with a technology focus can help to inform the “conception and design” of technologies, thereby helping to ensure that AI system developers design a system that is readily implemented and useful within intended use settings. For instance, techniques such as technology assessment and requirements analysis can help to identify use cases, constraints, and requirements that the new technology needs to fulfill.

Frameworks include, for example, design and usability frameworks such as the Health IT Usability Evaluation Model (Health-ITUEM) for evaluating mobile health technology []. This includes assessment of subjective properties of the technology from the perspective of users, which have been shown to be crucial to user adoption of technology, but that developers may not necessarily consider as a priority during the development process, including ease of use and perceived usefulness.

Frameworks With a User Focus

While use is crucial for the successful development of AI-based technology, empirical work has shown that systems may be used in ways other than intended, which may in turn result in unanticipated threats to organizational functioning and patient safety []. For example, users may develop workarounds to compensate for usability issues of technologies, but these workarounds may compromise the intended performance of a system []. Frameworks that focus on the user of the technology can help to address these issues and facilitate the “optimization of technology use”. In doing so, they can help developers and implementers understand the various contexts of the use of AI-based technologies, as well as unintended consequences, and tailor systems to maximize benefits and minimize harms. For instance, a contextual analysis can help to gain a deep understanding of the various contexts in which a technology will be deployed. This includes examining cultural and social factors, as well as user behavior, user expectations, and existing systems or practices.

An example framework in this context is the Health Information Technology Evaluation Framework (HITREF), which includes an assessment of a technology’s impact on quality of care as well as an assessment of unintended consequences [].

Frameworks With an Organizational Focus

AI-based technologies are not adopted in a vacuum but must be integrated within organizational contexts. Previous work has shown that organizational strategies to implement health IT (HIT) and organizational cultures can have significant consequences for adoption and use []. For example, lack of integration with existing health information infrastructures can slow down system performance and impede practical use, and hence, impact adversely on safety and user experience []. Frameworks with an organizational focus can facilitate the development of “organizational strategies” to implement new technologies. In doing so, they can help AI system implementers integrate AI safely within existing organizational structures and processes. For instance, these can help to inform communication strategies, training programs, and support mechanisms to help users understand the benefits and risks of AI technologies and adapt to new roles and responsibilities.

An example of a framework with an organizational focus is the Safety Assurance Factors for Electronic Health Record Resilience (SAFER) guides, which help implementing organizations identify existing risks and facilitate the development of mitigation strategies to promote the effective integration of technologies within organizational processes [].


Discussion

A range of theory-informed evaluation frameworks for diverse kinds of HIT already exist []. Although not all of these may be relevant for AI-based applications, many aspects of existing frameworks are likely to apply. Exploring the transferability of these dimensions, therefore, needs to be a central component of work going forward [].

Existing frameworks examine various aspects of technology design, implementation, adoption, and optimization. On the most basic level, they can be distinguished according to their focus, which then influences their application and context of use. A simplified overview of selected HIT evaluation frameworks and their potential application to AI is shown in . Frameworks with a technology focus can help to inform the conception and design of technologies through actively and iteratively involving end users, bridging the gap between technology development and application. This can, in turn, mitigate risks around nonadoption due to a lack of need or actionable system outputs. Frameworks with a user focus can help to ensure that systems are effectively embedded with adoption contexts and thereby mitigate the risk of systems not being used or not being used as intended. Finally, frameworks with an organizational focus can help to ensure that systems fit with existing organizational structures, and thereby help to ensure sustained use over time and across contexts.

We recommend that researchers, implementers, and strategic decision makers consider the use of existing theory-informed HIT evaluation frameworks before embarking on an AI-related initiative. This can help to mitigate emerging risks and maximize the chances of successful implementation, adoption, and scaling. To achieve this, existing and emerging guidelines for the evaluation of AI must promote the use of theory-informed evaluation frameworks.

Although many of the frameworks are well-known in the academic clinical informatics community, there is an urgent need to incorporate them into general AI design, implementation, and evaluation activities, as they can help to facilitate learning from experience and ensure building on the existing empirical evidence base. Unfortunately, this is currently not routinely done, perhaps reflecting disciplinary silos leading to lessons having to be learned the hard way. This, in turn, potentially compromises the safety, quality, and sustainability of applications. For example, although AI applications in radiology are now getting more established, the existing evidence base focuses on demonstrating effectiveness in proof-of-concept or specific clinical settings (the technology dimension in ) []. Wider organizational and user factors are somewhat neglected, potentially threatening the wider sustainability and acceptability of such applications.

Conclusions

We aimed to provide a conceptual overview of existing theory-informed frameworks that could usefully inform the development and implementation of AI-based technologies in health care, and we identified several frameworks with technological, user, and organizational foci. Future research could involve conducting a systematic review based on this pragmatic overview to synthesize existing evidence across evaluation frameworks, spanning the dimensions of technology, user, and organization.

Evaluation of AI-based systems needs to be based on theoretically informed empirical studies in contexts of implementation or use to ensure objectivity and rigor in establishing the benefits and thwarting risks. This will ensure that systems are based on relevant and transferable evidence and can be implemented safely and effectively. Theory-based HIT evaluation frameworks should be integrated into existing and emerging guidelines for the evaluation of AI [-]. The examples of frameworks provided could also help to stimulate the development of other related frameworks that can guide further evaluation efforts.

Drawing effectively on theory-based HIT evaluation frameworks will help to strengthen the evidence-based implementation of AI systems in health care and help to refine and tailor existing theoretical approaches to AI-based HIT. Learning from the wealth of existing HIT evaluation experience will help patients, professionals, and wider health care systems.

The authors are members of the International Medical Informatics Association Working Group on Technology Assessment and Quality Development and the European Federation for Medical Informatics Working Group on Evaluation. This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data sharing is not applicable to this article as no data sets were generated or analyzed during this study.

KC led on drafting of the manuscript and all authors (NDK, FM, RW, MR, MP, PK, ZSW, PS, CKC, AG, SM, JBM, and EA) critically commented on various iterations.

None declared.

Edited by T Leung; submitted 10.02.23; peer-reviewed by P Aovare, M Yusof, D Chrimes; comments to author 14.04.23; revised version received 20.04.23; accepted 02.03.24; published 07.08.24.

©Kathrin Cresswell, Nicolette de Keizer, Farah Magrabi, Robin Williams, Michael Rigby, Mirela Prgomet, Polina Kukhareva, Zoie Shui-Yee Wong, Philip Scott, Catherine K Craven, Andrew Georgiou, Stephanie Medlock, Jytte Brender McNair, Elske Ammenwerth. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 07.08.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

留言 (0)

沒有登入
gif