Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals

IntroductionOverview

Integration of existing artificial intelligence (AI) models into health care—a field where the trust in AI is crucial due to the significant impact of decision-making—is still a work in progress []. At the same time, efforts to develop standardized protocols for the deployment of AI in health care are underway, yet they have not reached a point of completion []. This endeavor is critical for ensuring AI’s safe and effective use in health care settings. Additionally, the challenge of evaluating AI in health care is exacerbated by a lack of comprehensive and standardized metrics []. This void is something that researchers and policymakers are actively working to address by creating robust evaluation frameworks that could be applied universally. The regulatory landscape has been focusing on policies around ethical considerations, data privacy, transparency, and patient safety, alongside frameworks that hold AI systems and their developers accountable for the outcomes of their use in patient care [].

Advent of Generative AI—Large Language Models in Health Care

Despite these ongoing challenges and developments, generative AI like large language models (LLM) is already being deployed in the public sphere [,], used by health care workers, researchers, and the public for a variety of health care–related tasks. Although LLMs have shown promise in medical assessments [-], scientific writing, eHealth care, and patient classification [-], their integration marks a shift in paradigm introducing new AI complexities [-]. Its rapid and early adoption highlights the critical need for continued discourse ensuring the safe and effective integration of LLM into health care. Additionally, LLM characters such as—stochasticity, emergent indeterminacy, and lack of consciousness—reinforces the need for cautiousness.

One fundamental aspect of LLMs that prompts special attention is their stochastic paradigm, which means that these models operate based on probabilities and randomness, allowing the model to generate varied outputs for a given input. It exhibits a level of indeterminacy and unpredictability. LLMs can produce different responses under seemingly similar conditions, complicating their reliability. Such LLM behaviors can lead to unexpected results, which, while sometimes beneficial in generating creative solutions or insights, can also pose risks when applied to critical domains like health care, where accuracy and predictability are paramount.

Another critical risk characteristic of LLM is the lack of inherent understanding of the context they parse and generate. Despite their ability to produce human-like text, LLMs do not possess consciousness, comprehension, or the ability to discern the truthfulness of their outputs. In other words, LLMs might generate plausible but incorrect content, presenting significant challenges in contexts where the veracity and relevance of information are critical [,].

Approaching AI integration in health care with a critical mindset is important. It is crucial for users to have a clear understanding of a technology’s actual performance, distinguishing it from the exaggerated expectations set by media hype. These risks underscore the importance of asking the question: are we and our health care system ready to integrate LLMs? If yes, is there a policy in place explicitly stating in what capacity it could be used to reduce clinical workload before its dissemination?

Objective

In this paper, we conceptually investigate the dynamics between clinicians’ growing trust in LLMs, the evolving sources of training data, and the resultant implications for both clinician competency and LLM performance over time. Our discussion highlights a potential feedback loop where LLMs, increasingly trained on narrower data sets dominated by their own outputs, may experience a decline in output quality coinciding with a reduction in user skills. While these phenomena are not yet fully realized, they represent anticipated challenges that coincide with the deeper integration of LLMs into the health care domain. We call for preemptive, focused dialogues concerning the integration of LLMs in medical settings, underscoring the importance of maintaining patient safety and the standard of care.

Presently, LLMs are developing at an accelerated pace, heavily reliant on human-generated data sets that are integral to their accuracy and the consequent trust placed in them, particularly in the health care sector. This burgeoning dependency, although seemingly beneficial in terms of efficiency and productivity, may lead to an unintended erosion of clinician skills due to the habitual delegation of tasks to AI—as noted in the academic context [,]. This trend raises the possibility of an overreliance on LLM outputs, potentially diminishing the variety and depth of human insights within these models. The risk is a self-perpetuating cycle where LLMs, learning mostly from their creations, could see a degradation in their effectiveness and a narrowing of the breadth of human knowledge they were designed to emulate. Such an outcome would be counterproductive, possibly leading to a decline in both LLM effectiveness and human expertise.

illustrates our core arguments. The first panel reveals a timeline that shows an inverse correlation between clinicians’ escalating trust in AI and the preservation of clinical skills over successive time points (T1 to Tn), signaling an increase in AI reliance and a decrease in skill retention. The middle panel demonstrates the shift in training data for LLMs from predominantly human-generated to a growing proportion of AI-generated data, which in turn affects LLM performance and contributes to the feedback loop. The final panel plots LLM accuracy against time, displaying an initial increase as LLMs leverage a mix of data sources. However, upon reaching a tipping point—marked as the self-referential zone—accuracy declines in tandem with the onset of the user deskilling zone, emphasizing the dilemma of increased AI reliance degrading user capabilities. We underscore the need for strategic measures to address these impending challenges in the health care sector.

‎

Figure 1. The dynamics of user skills, trust, data, and large language models. AI: artificial intelligence; LLM: large language model. User Expertise and Trust in LLMsOverview

User trust in LLMs is deeply intertwined with the individual’s subject matter expertise and their willingness to engage critically with AI outcomes. Expert users, with a robust understanding of their domain, are more likely to approach LLMs with a discerning mindset and preparedness to review and validate its suggestions. Thus, trust in LLMs can be seen as a spectrum influenced by the user’s expertise, and the effort they are willing to invest in ensuring the accuracy of the outcomes.

User Expertise: Ability to Detect Errors in LLMs

The use of LLMs presents a range of possibilities and challenges that vary depending on the user’s expertise and intent, delineating into 2 primary user categories—subject matter experts and those seeking assistance due to a lack of knowledge.

Subject matter experts (doctors) may use LLMs to handle routine, time-consuming tasks, enabling them to allocate more time to complex or urgent issues like seeking a second opinion on complex medical diagnoses, or patient triage. They have the advantage of being able to critically evaluate the LLM’s output, verify its accuracy (deviation from clinical standards), and make necessary corrections. The expertise of such users acts as a safeguard against potential errors, ensuring that the AI’s assistance enhances productivity without introducing risk.

On the other hand, individuals who turn to LLMs due to a lack of expertise in a particular area face a different set of challenges []. The ability of LLMs to generate fake but persuasive responses further exacerbates the risks making users vulnerable to accepting erroneous information as fact []. For instance, a general practitioner faced with a dermatological case, such as an atypical presentation of psoriasis, can use an LLM to access detailed diagnostic criteria and treatment protocols. This capability can significantly assist in the management of the patient, particularly when the LLM’s suggestions are accurate and relevant. However, the inherent risk of LLMs generating incorrect suggestions cannot be overlooked. Such inaccuracies pose a heightened risk to patient safety, especially in scenarios where the clinician may lack the specialized dermatological knowledge required to critically evaluate the validity of the LLM’s output []—constituting an environment where trust becomes critical.

The crux of the problem lies in the user’s ability to verify the accuracy and relevance of the AI-generated content. However, the pivotal consideration here is whether the verification of LLM outcomes by health care staff negates the purported reduction in workload. If health care professionals are required to meticulously check each AI-generated output for accuracy, the time saved through automation may be offset by the time spent on verification. Maintaining the balance between productivity and accuracy is pivotal. For instance, LLMs can analyze vast data sets to identify patterns or treatment outcomes that may not be immediately apparent to human clinicians, thereby offering insights that can lead to more accurate diagnoses and personalized treatment plans. This capability, even if it requires additional time for verification of AI-generated recommendations, may be deemed a worthy trade-off for reducing long-term health care costs and improving care quality. However, this trade-off must be carefully managed to ensure that the pursuit of improved health outcomes does not lead to unmaintainable decreases in productivity. Excessive time spent verifying AI recommendations could strain health care resources, leading to longer patient wait times and potentially overburdening health care staff.

To navigate this trade-off, health care systems might adopt strategies such as targeted use of LLMs in high-impact areas where they are most likely to enhance outcomes and the development of systems that prioritize clarity and actionability in their recommendations to minimize verification time. By carefully weighing the benefits of improved patient outcomes against the costs in terms of productivity, health care providers can make informed decisions about how best to integrate LLMs into their practices, ensuring that these technologies serve to enhance rather than hinder the delivery of patient care.

User Trust: Willingness to Review LLM Output

Trust in user engagement with LLMs, particularly in health care, is a multifaceted construct influenced by sociotechnical and psychological factors. We acknowledge that user trust in LLMs, in health care, can substantially depend on the context. Depending on the stakes (risk) the level of trust required may differ; for instance, LLMs used for diagnosis and treatment recommendations necessitate a higher trust level compared to applications for patient note summarization. Additionally, the degree of autonomy granted to the LLMs, and the extent of clinical oversight are crucial determinants of trust.

Clinicians bring their own norms and expectations to the evaluation of trust in these systems, further complicating the landscape. Individual and cultural perspectives on risk tolerance and acceptance also play pivotal roles. Together, these factors create a complex environment where trust in LLMs is dynamic, varying according to the specific context of use and the interplay of diverse elements. In this section, we focus on user willingness to scrutinize LLM output as a precursor to trust.

A user may have the ability and necessary expertise but may not be willing to review LLM-generated outcomes due to factors including prior trust in the technology or biases. A doctor with high trust (blind trust) in the LLM, might be more inclined to accept its suggestion without extensive further verification [,], exhibiting automation bias []. Automation bias, particularly in the context of clinicians’ interactions with LLMs can manifest when clinicians exhibit an undue level of trust in the systems, based on past experiences of accuracy and reliability.

Blind trust in LLMs can introduce 2 critical cognitive biases, precautionary [] and confirmation bias [], both of which alter clinician behavior in the presence of agreement or disagreement between human judgment and LLM outputs. When LLM recommendations align with a clinician’s initial diagnosis or treatment plan (agreement), confirmation bias can be reinforced. Clinicians may overlook or undervalue subsequent information that contradicts the LLM-supported decision, even if this new information is critical to patient care. This confirmation bias can lead to a narrowed diagnostic vision, where alternative diagnoses or treatments are not sufficiently considered. Conversely, in cases where there is a disagreement, precautionary bias can occur. The clinician, having developed a reliance on the LLM due to positive past experiences, might doubt their own expertise and perceive LLM to be the safer alternative for decision-making. Such problems associated with blind trust might persist unchallenged until a point of failure or harm, which can have serious implications in health care.

Future Risk ConsiderationsOverview

As we delve deeper into the dynamics between technology and human expertise, the concepts of the LLM Paradox of Self-Referential Loop and the Risk of Deskilling emerge as pivotal to our discourse. As illustrates the projected trajectory of clinician reliance on LLMs but also hints at the potentially cyclic nature of knowledge and skills within the health care industry. Concurrently, the risk of deskilling looms over the horizon, particularly for upcoming generations of health care professionals who might become overly reliant on LLMs, possibly at the expense of their diagnostic acumen and critical thinking abilities. This section explores these challenges and the strategies needed to mitigate them. Additionally, this section discusses the LLM accountability concern.

LLM Paradox of Self-Referential Loop (Learning From Itself)

In a scenario where LLMs become widely adopted in the health care industry for tasks like paper writing, educational material creation, clinical text summarization, and risk identification, the possibility of a self-referential loop does emerge as a significant concern. This paradox occurs when AI-generated human-like content becomes so widespread that the AI begins to reference its own generated content, potentially leading to an echo chamber effect where original, human-generated insights become diluted or harder to distinguish from AI-generated content. While this problem of a self-referential loop in AI-generated content, particularly in the health care industry, has not yet materialized, it represents a likely challenge as generative AI continues to proliferate. The consequence of a self-referential loop in LLMs can lead to several problematic outcomes, including the propagation of biases [], increased homogeneity in generated data, and ultimately, hindered performance. AI systems learn from the data they are fed, and if these data include biases, the AI is likely to replicate and even amplify these biases in its outputs []. In a self-referential loop, the problem becomes compounded. As the AI references its own biased outputs to generate new content, these biases can become more entrenched, making them harder to identify and correct.

The issue of self-referential loops and the potential degradation of information quality are indeed significant concerns; however, when these LLMs, such as the Medical Pathways Language Model (Med-PaLM) [], are specifically fine-tuned and tailored for health care applications, the severity of these issues can be mitigated through stringent quality assurance measures. This approach reduces the risk associated with the indiscriminate use of a broader corpus that may contain inaccuracies, outdated information, or irrelevant content. Despite these precautions, the risk of self-referential loops in health care contexts can shift toward a different concern, the reinforcement and entrenchment of specific clinical approaches and schools of thought. This occurs as a reflection of the biases present in the curated data sets, which are inherently influenced by the prevailing medical practices, research focus, and therapeutic approaches at the time of data collection.

Addressing this challenge requires a nuanced approach to developing and integrating LLMs technologies into societal frameworks. It involves fostering a symbiotic relationship between human intellect and LLM capabilities, ensuring that AI serves as a tool for augmenting human intellect rather than replacing it. Strategies for maintaining the diversity and quality of training data, including the deliberate inclusion of varied and novel human-generated content, will be critical.

Risk of Deskilling

As individuals come to rely more on LLMs for routine tasks, such as the synthesis of patient information or the interpretation of medical data, there is a possibility that their skills in these critical areas may diminish over time due to reduced practice []. This situation is compounded by the AI’s ability to quickly furnish answers to medical inquiries, which might decrease the motivation for in-depth research and learning, consequently affecting the professionals’ knowledge depth and critical thinking capabilities.

It is crucial to note that the discussion here does not assert that LLMs will definitively lead to the deskilling of current practitioners in the health care sector. These professionals have developed their expertise through extensive experience and rigorous academic training, establishing a solid foundation that is not readily compromised by the integration of AI tools. Instead, the concern is more pronounced for the next generation of health care professionals, particularly medical students who might increasingly use AI for educational tasks and learning activities where over delegating tasks to AI could attenuate the development of critical analytical skills and a comprehensive understanding of medical concepts, traditionally cultivated through deep engagement with the material [,]. The critical question emerges “will the ease of generating content with AI stifle the development of creativity and critical thinking in younger generations accustomed to technology providing immediate solutions?”

If future generations of clinicians grow accustomed to AI doing the bulk of diagnostic review and analysis, there is a risk that their own diagnostic skills might not develop as fully. More critically, should they be required to review patient charts manually—due to AI failures—they may find the task daunting, or lack the detailed insight that manual review processes help to cultivate. The crux of the issue lies in ensuring that reliance on technology should not come at the expense of fundamental skills and knowledge. The challenge is to ensure that the deployment of AI technologies complements human abilities without diminishing the need for critical thinking, reasoning, and creativity.

What is needed is to adapt to the paradigm shift—failing to do so can adversely impact health care industry. A dual focus on harnessing AI capabilities while enhancing unique human skills is pivotal for advancing patient care in the modern medical landscape. The advent of human-AI collaboration in health care prompts a shift in the skill set emphasis within medical disciplines. The transformation accentuates the value of unique human skills—such as problem-solving, critical thinking, creativity, and fostering patient rapport—over traditional reliance on memory and knowledge base tasks. As LLMs undertake roles in diagnostic assistance, literature synthesis, and treatment optimization, the medical profession should evolve to leverage AI for data-driven insights while prioritizing human-centric skills for patient care. The paradigm shift underscores the growing importance of critical engagement with AI outputs, necessitating medical professionals to adeptly interpret and apply AI-generated information within the complex context of individual patient needs.

LLM AccountabilityOverview

The integration of LLMs in health care introduces medicolegal challenges concerning the allocation and apportionment of liability for outcomes, particularly in instances of negligent diagnoses and treatment. The complexity arises from the interaction between clinicians, health care institutions, and AI providers, each contributing differently to the health care delivery process.

Legal Framework and Liability Allocation

In the legal domain, traditional frameworks for medical liability often center on direct human actions, with established principles guiding negligence and malpractice claims. The introduction of LLMs used for diagnostic support or task delegation complicates these frameworks. Clinicians, operating at the interface of LLM recommendations and patient care, are generally seen as the final decision-makers, thus bearing the primary moral and legal responsibility for the outcomes of those decisions. This perspective is grounded in the principle that clinicians must integrate LLM outputs into a broader clinical judgment context, considering patient-specific factors and adhering to professional standards.

Shared Liability and AI Providers

However, the role of LLM providers in developing, deploying, and maintaining LLMs introduces questions about shared liability, especially when system errors or deficiencies contribute to adverse outcomes. Determining the extent of LLM provider liability hinges on factors such as the accuracy of the LLM’s training data, transparency regarding the system’s capabilities and limitations, and the adequacy of user training and support provided.

Institutional Responsibility

Health care institutions also play a critical role in mediating the use of LLMs, responsible for ensuring that these systems are integrated into clinical workflows in a manner that upholds patient safety and complies with regulatory standards. Institutional policies and practices, including the selection of AI tools, clinician training, and oversight mechanisms, are pivotal in mitigating risks associated with LLM use.

Algorithmic Accountability Act of 2023

The Algorithmic Accountability Act of 2023 and Artificial Intelligence Accountability Act [,] represent a critical legislative step toward ensuring the responsible use of algorithms. The act calls for the creation of standardized procedures and assessment frameworks to evaluate the effectiveness and consequences of these systems, reflecting an understanding of the complex ethical and regulatory challenges posed by AI in decision-making processes, particularly in health care. The act is in dialogue with the wider conversation on the ethics of AI, advocating for an approach that emphasizes response-ability—the capacity to respond ethically to the challenges posed by algorithmic decision-making. This perspective is crucial for developing impact assessments and frameworks aimed at promoting fairness and preventing discriminatory practices within algorithmic systems.

The implications of this act on the integration of LLMs in health care are profound and ensuring transparency in LLM can further enhance trust in the system. Transparency can allow clinicians to verify errors and review outputs effectively. For example, an LLM providing a diagnostic suggestion would detail the medical literature and patient data informing its analysis, enhancing clinician trust by making the AI’s reasoning processes visible and understandable. This transparency combats algorithmic deference by encouraging health care professionals to critically assess LLM outputs against their expertise and patient-specific contexts. Moreover, transparency reduces the perceived infallibility of LLMs by highlighting their reliance on input data quality and inherent limitations, promoting a balanced use of LLMs as supportive tools in patient care.

Conclusions

It’s important to acknowledge that the performance of LLMs like ChatGPT (OpenAI) today does not guarantee their performance tomorrow. LLMs have the potential to be a substantial boon to the health care industry, offering to streamline workflows, enhance the accuracy of patient data processing, and even support diagnostic and treatment planning processes. Its value, however, is contingent upon a systematic and informed integration into health care systems. Recognizing that LLMs, like any technology, is fallible is crucial to its successful adoption. Its performance is temporal and will change as new data are fed to its algorithm. This acknowledgment underpins the necessity for robust oversight mechanisms, ongoing evaluation of AI-driven outputs for accuracy and relevance, and clear guidelines on its role as an assistive tool rather than a stand-alone decision-maker.

A thoughtful, deliberate approach to integrating generative AI into health care can mitigate risks associated with overreliance and deskilling, ensuring that it complements rather than compromises the quality of care. By leveraging AI’s strengths and compensating for its limitations through human oversight, health care can harness the benefits of this technology to improve outcomes, enhance patient care, and support health care professionals in their vital work. Thus, the path forward involves embracing generative AI’s potential while remaining vigilant about its limitations, ensuring that its integration enhances rather than diminishes the human element in health care.

This study is not funded by any internal or external agency.

None declared.

Edited by T de Azevedo Cardoso, G Eysenbach; submitted 25.01.24; peer-reviewed by M Saremi, D Hua; comments to author 08.03.24; revised version received 12.03.24; accepted 20.03.24; published 25.04.24.

©Avishek Choudhury, Zaira Chaudhry. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 25.04.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

View original article

JOURNAL OF MEDICAL INTERNET RESEARCH

分享书签

0 0 0 0 0 0 0

More from this channel

Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals

留言 (0)