Effects of interacting with a large language model compared with a human coach on the clinical diagnostic process and outcomes among fourth-year medical students: study protocol for a prospective, randomised experiment using patient vignettes

Introduction

Medical diagnostic errors, defined as wrong, delayed or missed diagnoses, pose a serious threat to quality of care and patient safety, affecting 5%–15% of the patients who present to healthcare systems.1–3 In the 2015 landmark report ‘Improving Diagnosis in Healthcare’, the US National Academy of Medicine warned that ‘most people will experience a diagnostic error throughout their lifetime, sometimes with devastating consequences’.4 Importantly, among harmful diagnostic errors, 84% are preventable but at the same time have higher rates of mortality than other types of error (29% vs 7%).5 6 In a systematic review of malpractice claims worldwide, diagnostic errors were the most common and most expensive type of claim, reflecting 26%–63% of all cases.7 Consequently, there is an urgent need for improving diagnostic decision-making in healthcare.

In recent years, specialised computerised diagnostic decision support systems such as differential diagnosis generators have been developed, showing the potential to improve the quality of diagnoses.8 Additionally, since large language models (LLMs) based on generative pre-trained transformer (GPT) methodology have been widely disseminated, applications such as ChatGPT (Open AI) have raised hopes that such tools will become a valuable asset for (medical) education,9 10 as well as for consultation and clinical decision support.11–15 Recently, researchers have endeavoured to explore ChatGPT’s potential and limitations in the healthcare domain, testing its medical proficiency. Across countries, they have demonstrated its ability to successfully pass medical licensing exams,9 10 16 17 which may render ChatGPT-based chatbots a particularly useful resource for junior physicians. Thus, by leveraging their broad medical knowledge base, their capacity to engage in open-ended, natural conversations and their ability to process complex (patient) data, ChatGPT-based chatbots have the potential to augment diagnostic decision-making processes18 and assist learners in medical education settings.10

However, the novelty of LLMs in diagnostic decision-making introduces uncertainties regarding their impact. Clinicians unfamiliar with using LLMs in their professional context may rely on general positive or negative attitudes towards artificial intelligence (AI), potentially hindering thoughtful use and critical evaluation of their input, leading to either over-reliance and lack of critical thinking or the neglect of AI’s potential.19–23 It is, therefore, imperative to comprehensively explore the extent, application and constraints of LLMs in clinical decision support to guarantee their conscientious and efficient implementation in practice.12 18 24 25 To address these concerns, this prospective, randomised controlled clinical vignette study examines the influence of decision support using an LLM (ChatGPT) on the diagnostic process and outcomes compared with that of a human coach. This will advance the understanding of how human–AI collaboration can be leveraged to enhance diagnostic decision-making.

Leveraging AI for enhanced diagnostic decision-making

What makes an LLM such as ChatGPT a potentially useful coach during the diagnostic journey? In their review of recent literature on ChatGPT in clinical decision support, Ferdush et al 18 listed a number of relevant attributes: For example, (a) LLMs can analyse patient data and take into account relevant clinical guidelines, understand complex medical information and aid in data interpretation; using identified patterns in patient data, LLMs can propose relevant differential diagnoses of high accuracy,26 potentially counteracting premature closure.27 (b) Thanks to their vast knowledge base of similar cases reported in medical literature, LLMs can remind professionals of rare or complex diseases typically in danger of being overlooked. (c) LLMs possess pertinent knowledge spanning multiple medical specialties and healthcare settings, making them a useful resource in any specialty and allowing the integration of information from different medical domains. (d) With LLMs, healthcare professionals can access clinical guidelines and best practices in real time and from one source, which supports them in making informed decisions.18 Last, (e) LLMs may take over the role of advisors,28 29 and (peer) coaches or teachers30 31 who guide learners through the diagnostic process by reminding them of important steps to take or differential diagnoses to consider.

There are also potential drawbacks to consider in the context of diagnostic decision-making: (a) LLMs have been observed to occasionally miss relevant patient information, exhibit hallucinations (ie, confident yet wrong responses), display biases stemming from biased training data (eg, due to under-representation of certain demographics) and show limited contextual understanding.18 (b) Further, there is the fear that over-reliance on LLMs may lead to reduced learning opportunities11 and deskilling and hence an increased risk of diagnostic errors in the long run. Last and contrary to this, (c) clinicians may refute insights provided by LLMs as they tend to overlook the support offered by computerised diagnostic decision support systems.22

Thus, given the novelty of LLMs and the lack of experience with using GPTs in the diagnostic process and for medical education, a deeper exploration of the benefits, limitations and possible applications of LLMs for medical diagnosis and education is warranted. Our study, therefore, aims to (a) investigate the effects of an LLM (ChatGPT) on the diagnostic process, accuracy, number of diagnostic hypotheses and user confidence and (b) explore how the LLM is used during diagnosis. As LLMs generate human-like text responses in conversational settings, we compare the use of ChatGPT assistance with that of assistance from a human coach with more experience, the usual resource for junior physicians in medical educational settings.32

The role of the hypothesis space for diagnostic error

Of the multiple reasons for diagnostic error (such as technical failures or poorly cooperating patients), cognitive factors such as faulty information synthesis most frequently contribute to diagnostic error.6 33 To illustrate, 89% of diagnostic error malpractice claims involved failures in clinical reasoning, the largest study on such claims found.34

Decades of research into clinical reasoning, diagnostic decision-making, or one of its many synonyms provide some insights into possible causes and remedies of diagnostic error.27 It is now well established that clinicians generate diagnostic hypotheses within minutes of an encounter with a patient,35 36 sometimes even much faster.37 These initial hypotheses are of paramount importance for the accuracy of the final diagnosis because clinicians hardly ever add other hypotheses to the diagnoses they consider later on.35 This is an important point because—in contrast to the process of scientific inquiry—physicians tend to conduct diagnostic tests that confirm their initial hypothesis rather than potentially refuting it.35 38 Furthermore, they distort incoming additional findings in favour of the initial idea.39 40 What distinguishes expert diagnosticians from novices is neither faster nor more but just better initial hypotheses.41 42 This understanding of the importance of the initial hypothesis for the accuracy of the final diagnosis aligns well with the observation that the most commonly observed biases in clinical reasoning—availability bias, confirmation bias, satisfaction of search and premature closure27 43–47—all relate to the space of initially considered differential diagnoses.

Given that broadening the differential diagnoses can mitigate diagnostic errors,48–51 it appears imperative to raise awareness among diagnosticians about this possibility. Furthermore, the quality of LLM output and advice is sensitive to the formulation of inquiries.52 53 Therefore, providing single training instructions that offer a rationale for expanding the hypothesis space in diagnostic decision-making, along with practical illustrations on how to effectively elicit information from their coaches (whether human or ChatGPT) will likely enhance the coaches’ impact. This single training will improve participants’ reasoning and ability to leverage the coach’s assistance, leading to better diagnostic outcomes, such as an increased number and relevance of diagnostic hypotheses and greater accuracy in the final diagnosis. Consequently, we will examine the impact of instructional training (training vs no training) along with human versus AI assistance. We aim to provide insights that elucidate the necessary guidance for the effective use of LLMs in diagnostic decision-making.

Methods and analysis

This study seeks to elucidate the differential (or analogous) use patterns between users of ChatGPT and those using a human coach in the context of diagnostic decision-making, along with their respective impacts on the diagnostic process and outcomes as well as user confidence. There is also significant practical interest in examining whether ChatGPT exhibits a more pronounced beneficial effect on diagnostic accuracy and the quantity of differential diagnoses considered, potentially attributable to its heightened computational capabilities.12 Additionally, we seek to assess whether brief instructional training emphasising the importance of expanding the hypothesis space augments these effects. To achieve this, our primary focus is on modelling the dependent variables diagnostic accuracy and number of generated differential diagnoses using linear mixed-effects models54 in R.55

We have been collecting data during an online experiment with medical students at the Charité Medical School in Berlin. Students have been invited to participate via mailing lists in exchange for financial remuneration (€35 per participant). Data collection began on 22 April 2024 and is planned to last until the end of June 2024. The study has a randomised, single-blind study design with a 2×2 factorial design, with the source of assistance (human coach vs ChatGPT) and training (training vs no training) as between-subjects factors (see figure 1). Participants are randomly assigned to the type of assistance they receive and the training/no training condition.

Figure 1

Study design. AI, artificial intelligence; ChatGPT, OpenAI’s generative pre-trained transformer; LLMs, large language models; R, randomisation.

Sample size

A sample size of N=158 was determined using G*Power V.3.1.9.756 for a 2×2 analysis of variance (ANOVA), to detect a practically relevant medium effect size with α=0.05 and β=0.80. Each of the four subgroups is randomly assigned an approximately equal number of participants.

Inclusion and exclusion

All (N=640) fourth-year medical students (in a 6-year programme) from Charité Medical School in Berlin are eligible to take part in the study. Students are recruited via faculty mailing lists, posters and online platforms of the Charité Skills Lab. Students 18 years or older who sign the informed consent can be included. Coaches in the ‘human condition’ are two medical interns who have recently completed their sixth year of studies at the Charité Medical School, have passed their state examination and are now working in the hospital. Human coaches are thus 2 years more advanced than the participants. They are paid €20 per hour.

Main study procedures

Data collection is taking place remotely in two online sessions (see figure 1). In the first session, students provide their written informed consent (see online supplemental information) and watch a short general introduction video on the idea and methods of LLMs to level off potential differences in experience with LLMs among participants. For this, a freely available, up-to-date introductory video was chosen (https://youtu.be/2IK3DFHRFfw?si=uSnEBQv2mhPmIOis). Then, participants fill in a short baseline survey (via https://www.soscisurvey.de) on their medical expertise, attitudes towards and experience with ChatGPT and other forms of AI, and their demographics (see online supplemental e table 1 for an overview of all questionnaires and our OSF repository https://osf.io/cbpr3/?view_only=e5e94231ddd546b491c2e07f43f02c88 for all original items and their English translation). To ensure that participants completed the first session, they are asked to send a codeword (‘Psychologie’), which is provided on the last slide of the survey, by email to the experimenter.

The second session is administered via MS Teams. Up to six students are invited to the same session. On arrival, participants are welcomed by the experimenter and receive a short introduction to the study. Then, participants are randomly assigned to the human or AI condition and training or no training subgroup by the experimenters using a computer-generated randomisation process. Participants are blinded to the training versus no training condition but are aware of the random allocation procedure to the human versus AI condition (from the general study information; see online supplemental information). Participants are sent to individual breakout rooms and receive a link to access their experimental session. They then work individually on the experiment in their breakout room with the opportunity to chat with the experimenter in case of problems or questions. After finishing, they return to the meeting room and are informed about the debriefing (which comes at a later date; see Debriefing below), thanked and dismissed. Experimenters note all deviations from the protocols, technical issues and participants’ comments so that the quality of data collection can be evaluated.

Get to know

The experimental session starts with a get-to-know phase designed to acquaint participants with their respective mode of assistance, whether the human coach ‘Toni’ or ChatGPT. This short introduction highlights the strengths of each coach, such as Toni’s background in medicine, including successful completion of medical studies and practical medical experience, and ChatGPT’s expansive knowledge base (see online supplemental information). Participants are also made aware of the limitations inherent to each coach, such as Toni’s potential knowledge gaps compared with a senior physician and the possibility of ‘hallucinations’ with ChatGPT. This initial step is crucial in addressing participants’ onboarding needs, facilitating their evaluation of the capabilities and intentions of their human or ChatGPT coach.57 By establishing familiarity and understanding of the strengths and limitations, participants can begin to develop trust in their respective coach, which is vital for effective collaboration and decision-making.58 The get-to-know phase does not contain any examples of when and how to interact with the coach, which is only part of the training.

Training

Afterward, participants either see the training instructions on the screen (training condition) or not (no-training condition), depending on the subgroup they are randomly assigned to. The training instructions are designed to heighten awareness regarding the potential for diagnostic errors and delineate three prevalent factors contributing to diagnostic errors: limited knowledge, premature closure and overconfidence.1 59 These are briefly explained. Additionally, the instructions provide exemplar inquiries that participants may pose to their respective coach (whether human or ChatGPT) to effectively navigate these three challenges (see online supplemental information for complete instructions). The training instructions are no longer available once the participant proceeds to the next page.

Task: diagnose cases

The main task is then to diagnose two patient cases (in random order). The cases are based on published cases of real patients43 60 and represent ambiguous emergency cases with a known correct diagnosis but a main competing diagnosis that has to be considered (case 1: pulmonary embolism vs myocardial infarction; case 2: aortic dissection vs stroke). On the patient case page, patient information including ECGs, laboratory results of blood samples and patient history is presented in a patient chart. On the same page, participants have access to a field in which to chat with their coaches, who reply in real time. Participants are instructed not to use any other sources of information than those on the screen. Participants are asked to record all differential diagnoses considered in a separate field on the same page. All clicks, chats and entries are logged with time stamps. Figure 2 shows a screenshot of a patient case page (in German). When leaving the patient case page, participants are asked to assess the likelihood of each diagnosis generated (on a Visual Analogue Scale of 0–100), to provide a reason for their most likely diagnosis (open answer) and to report their intended next steps if this were a real patient (open answer).

Figure 2

Screenshot of a patient case page. Starting on the left, there is a window showing the current step within the experiment and the patient chart with several subcategories, above the field for entering the differential diagnoses; on the right is the chat window (here, in the artificial intelligence condition).

Human versus AI coach

The LLM used in this study is OpenAI’s ChatGPT (version gpt-4–0613, DeploymentName=‘GPT-4’, MaxTokens=1000, Temperature=1.0f), accessed via the application programming interface provided by Microsoft Azure’s cloud platform (hosted in the ‘Switzerland North’ data centre).

The human coach is randomly drawn from the two medical interns who serve as coaches and who received a 5- hour training on the study purpose, the chat system and the philosophy of peer teaching30 and deliberate reflection,61 as well as scripts with standard answers to frequent requests (as identified in a pilot phase) to ensure that they could reply quickly and in a standardised way. Both human coaches are introduced by the unisex name ‘Toni’ to avoid potential gender bias and to keep their identities confidential. Human coaches sit at their computer at home and chat via the experimental interface with the participant. The interface was created using Microsoft’s ‘Blazor Server App’ web framework. Both ChatGPT and the human coach received the instruction to act as a medical coach and accompany fourth-year medical students through the diagnostic process, including asking guiding questions such as ‘Which findings support/oppose your hypothesis?’ following the logic of deliberate reflection61 62 (for the complete instructions, see system prompt in online supplemental information).

Questionnaire per case

Following each patient case, participants respond to questions pertaining to their case perception, encompassing factors such as perceived difficulty and familiarity with the diagnosis, as well as their assessment of the competence and support provided by the coaches (online supplemental e table1).

Table 1

Overview of variables of interest

Final questionnaire

A final questionnaire is administered after completion of both patient cases to assess the perceived usefulness of,63 satisfaction with64 and credibility of the coaches.65

Debriefing

On re-entering the virtual meeting room, participants are told about future debriefing, thanked and dismissed by the experimenter. Following the data collection phase, a comprehensive written debriefing will be provided. This debriefing will include solutions to the patient cases, an information package containing the training instructions (also in the no-training condition), as well as links to additional resources on clinical reasoning and LLMs.

Pilot study

In a pilot study involving N=11 fourth-year medical students and medical interns (Mage=26 years, SD=4.9, 55% female), the case material was tested for intelligibility and feasibility without assistance from a human coach or ChatGPT. Diagnoses were elicited as free text responses. For case 1, the correct diagnosis (pulmonary embolism) was listed by 27% of participants as the most likely diagnosis, and in case 2, the correct diagnosis (aortic dissection) by 0%, confirming that we had adequately selected difficult cases to prevent any ceiling effects.

Data to be analysed

Data will be in the form of questionnaires, process measures (eg, timestamps of clicks), chat protocols and ratings. Data will be entered into a web-based database that fulfils the requirements of the Swiss Human Research Act. Participants will be asked to generate a ‘study ID,’ which guarantees their anonymity but allows for matching baseline surveys with the data collected during the experimental session. All data will be digital. Only authorised study personnel will have access to personal information (eg, email address) during data collection. Any data shared with external parties (eg, collaborators) will be deidentified to remove all personally identifiable information. Only anonymised, coded data will be published together with DOIs in the OSF repository to make them findable. Primary and secondary endpoints as well as control variables are listed in table 1.

Statistical analyses

Data analysis will be conducted with R.55 For statistical analyses, we will use generalised linear mixed models (GLMMs), complemented by suitable post hoc techniques, particularly for subgroup analyses. Standard descriptive statistics and graphical representations will be employed, along with normality testing to assess assumptions for the proper application of parametric testing methodologies. Prior to data analysis, data quality will be checked by, for example, range checks for data values. To evaluate the randomisation procedure to the conditions, we will compare the four groups regarding their demographics (eg, age, gender, prior experience with LLMs) with ANOVAs. To determine whether participants in the training condition read the training instructions, we will compare the time they spent on the page with a minimum reading time threshold. This threshold will be set slightly below the average time spent on the page by participants in the no-training condition.

To determine the accuracy of the differential diagnoses, first, they will be automatically coded to International Classification of Diseases (10th revision; ICD-10) codes using a proprietary German-language natural language processing engine (Averbis Health Discovery, https://averbis.com), which maps ICD-10-German modification codes to unstructured text. 50% of the diagnoses will be randomly selected for cross-checking by two expert raters, blinded to the condition, to ensure the accuracy of the automated ICD matching. If accuracy of this automated matching turns out to be below 95%, the proportion will be increased to 60%, 70% and so forth for human cross-checking. Then, these codes will be compared with the correct codes of the two cases. Accuracy will be calculated as the number of steps required within the ICD taxonomy to get from one diagnosis to the other, as described elsewhere.62

To assess the impact of the type of assistance and training on the primary and secondary outcome variables, we will conduct successively more complex GLMMs,54 starting with participant ID and item ID as random intercepts, and gender and conditions as fixed effects. The dependent variables will include diagnostic accuracy, the number of differential diagnoses and the secondary endpoints (see table 1). Sensitivity analyses are planned to check the robustness of our findings. These will include alternative model specifications, assessing interaction effects, applying different methods for handling missing data (eg, imputation methods, complete case analysis) and subgroup analyses. For example, we will successively include more control variables, such as participants’ medical competence41 42 and general trust in LLMs,58 66 to account for potential confounders and gain a deeper understanding of the conditions under which LLMs are most effective.

In preparation for the qualitative analysis of prompts and usage patterns of coaches, all chat interactions and open answers will be coded using MAXQDA software. Coding categories (eg, confirmatory or knowledge questions) will be derived inductively and deductively by trained raters with domain knowledge. Two trained raters will independently code the material, blinded to the conditions (human coach vs ChatGPT, training vs no training). Rater agreement will be reported as coefficient kappa. Exploratory analyses and subgroup analyses will be conducted to characterise successful and unsuccessful prompts and the differences between consulting a human coach versus ChatGPT. Further, the timing of using the coach (early or late in the process), the frequency and type of errors made by the coaches and the impact of the (correct or incorrect) diagnoses proposed by the coaches on the diagnoses listed by the participants will be explored.

Patient and public involvement

We intend to disseminate the main results to the participants and public in a format that is suitable for a non-specialist audience. There was no patient nor public involvement in the design and conduct of the study.

Discussion

Our study has several strengths. First, it is a prospective randomised controlled experiment involving advanced medical students diagnosing complex patient cases, allowing us to investigate both diagnostic outcomes and processes. Second, the study compares consultations with either an LLM or a human coach, both of which are practically relevant advisors for medical students solving complex cases. Third, the detailed analysis of both the diagnostic process and its outcomes will provide a deeper insight into the research findings.

Our study also has several limitations. First, it focuses solely on fourth-year medical students, which may restrict the generalisability of the results to a broader medical student population or to residents and practising physicians. Also, the study is set within a medical education context, involving complex cases that are challenging for this level of training. Second, only approximately half of our questionnaires have been validated by previous research. This is due to the lack of suitable instruments, given the novelty of our study’s focus. For instance, we were unable to find scientifically validated questions that assess trust in an AI chat partner. Third, although we plan to conduct in-depth qualitative analyses of the interactions between participants and either human coaches or ChatGPT, insights into the underlying mechanisms of how AI influences decision-making processes will still be limited to our setting. More research in various medical (education) contexts is needed to better understand the way users perceive and interact with AI tools.24 25 31 67 68 Last, we acknowledge that integrating AI into medical diagnostics is not just a technological upgrade but also introduces complex ethical dilemmas and practical implementation challenges that require thorough exploration.19 69 In our study, we point participants to the limitations and potential biases of ChatGPT (and human coaches), but any considerations to integrate ChatGPT into medical education need to be accompanied by additional ethical considerations and dedicated training programmes as part of the medical curriculum.

View original article

BMJ OPEN

分享书签

0 0 0 0 0 0 0

More from this channel

Effects of interacting with a large language model compared with a human coach on the clinical diagnostic process and outcomes among fourth-year medical students: study protocol for a prospective, randomised experiment using patient vignettes

留言 (0)