Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study

Introduction

In recent years, machine learning techniques have been integrated into various aspects of medical care, serving as supportive diagnostic algorithms in clinical applications that include electrocardiograms [], skin lesion classification [], and radiological imaging []. With the recent boom of generative artificial intelligence (AI) and large language models (LLMs), thrust into the spotlight by the release of ChatGPT (Open AI) [] in November 2022, the application of generative AI methods has sparked widespread discussion and experimentation in the medical domain.

To many medical professionals, it is remarkable that systems such as ChatGPT without any specialized training have been able to pass US medical exams [,]; provide answers to questions from a web-based platform that are preferred over doctors’ replies []; and in most cases, offer appropriate recommendations to questions on cardiovascular disease prevention []. The capabilities of text generation and summarization introduce numerous controversial research and clinical use cases such as creating discharge summaries [] or crafting scientific abstracts []. Recently, even commercial solutions for automated documentation of clinician-patient interactions have become available [,]. All these use cases carry the promise of improving quality and efficiency by offering second opinions [] and reducing the documentation burden on strained health care professionals [].

Emergency departments (EDs) frequently serve as the initial point of contact for patients in need of immediate medical attention. A crucial component of ED operations is the triage process, which aims to efficiently allocate often limited resources by prioritizing patients according to the severity and urgency of their conditions. This is typically done using established triage systems, such as the Manchester Triage System (MTS), which is popular in Europe and frequently used in Germany [,]. Taking place in a high-stress environment [,], triage processes have been shown to be highly variable in quality [] and influenced by personal traits of the rater such as experience and triaging fatigue [].

In light of the challenges faced by staff undertaking the triage process and the demonstrated medical abilities of language models, our study sought to assess the capability and potential of ChatGPT in the context of emergency triage. We evaluated its performance in triaging patient vignettes according to the MTS framework, comparing its results to those of both professional MTS raters and doctors working in an ED without triage training. Given the promising data of LLMs serving as a second opinion in other medical contexts [], we explored ChatGPT’s potential as a resource for providing external validation and second opinions to ED staff with less experience and without specific training in the MTS. We further compared ChatGPT and doctor triages to triage assessments of other currently available state-of-the-art LLMs such as Gemini 1.5 (Google) [], Llama 3 70B (Meta) [], and Mixtral 8x7b (Mixtral AI) []. Additionally, we tested the performance of the LLM GPT-4 (accessed via the OpenAI application programming interface), on which the product ChatGPT is based on. Our research is the first to compare rater expertise and different LLMs and ChatGPT versions while incorporating the MTS, building on findings of fair agreement between professional raters and Generative Pre-trained Transformer (GPT)–based models in smaller samples using a different triage system [] and similarities observed between the triage decisions of ophthalmology trainees and GPT-4–based ChatGPT [].

MethodsCase Vignette Creation and Triage Process

In this study, we compiled 124 independent emergency cases from a single, randomly selected day in the interdisciplinary ED of University Hospital Düsseldorf, Germany, which were then transformed into anonymized English case vignettes. These vignettes solely contained medically relevant information, extracted by 1 doctor (LM) according to a predefined standard operating procedure. Patient ages were randomly adjusted within a range of –2 and +2 years where they were not pertinent to the case. Nonmedical information, including personal demographics such as race, was excluded. Vital signs or information about the conscious state of the patient were added where available or deemed clinically necessary. Clinical values were slightly altered (up to 5% of the original value) or added where necessary for triage, with all adjustments made under the discretion of the overseeing doctor. Detailed information on laboratory test results or imaging were not included in case descriptions. The resulting case vignettes (see Textbox S1 in ) were then reviewed by a second doctor (MP) to ensure the absence of potentially identifying data, in accordance with an internal standard operating procedure. Both doctors were not involved in the subsequent rating of the cases. The cases were independently assessed by 2 MTS-instructing and experienced staff members. When differing triage priorities were assigned, a third equally qualified and MTS-trained doctor mediated the discussion and had a tiebreaking vote to reach a consensus rating (consensus set). Triage decisions were made using the fifth version of the MTS (German version) [], which consists of the following categories: MTS Level 1 (“Red”) for immediate assessment, MTS Level 2 (“Orange”) for very urgent assessment within 10 minutes, MTS Level 3 (“Yellow”) for urgent assessment within 30 minutes, MTS Level 4 (“Green”) for standard assessment within 90 minutes, and MTS Level 5 (“Blue”) for nonurgent assessment within 120 minutes. The distribution of the determined triage levels for the analyzed day were compared to long-term averages of the ED.

Additionally, 4 MTS-untrained resident doctors regularly working in the ED were given these cases to assess according to MTS triage categories, albeit without access to the precise algorithm diagrams underlying the MTS. These 4 doctors all had no formal MTS training and worked in the ED regularly throughout their residency, with 2 in the second year of residency and 2 in the third year of residency. We also presented the prototypical anonymized case vignettes to 2 versions of ChatGPT, based on either GPT-3.5 or GPT-4 (ChatGPT version: May 24, 2023). The prompt given to ChatGPT was derived after manually testing several prompt versions, with the most promising one selected for the final evaluation in a zero-shot setting. The models were tasked with stratifying the cases according to the MTS using this optimized prompt without further training or refinement (see Textbox S2 in ) and without access to the copyright-protected MTS algorithm diagrams. Both versions of ChatGPT were queried identically 4 times with new chats for each iteration, which was done due to the probabilistic nature of LLMs []. Afterward, the untrained doctors were presented with the answers (rating and explanation) from the overall best-performing ChatGPT instance as a second opinion and were asked to reconsider their initial answers (see Textbox S3 in ). A “hypothetical best” performance in this scenario was defined as the optimal integration of assistant doctors’ decisions with ChatGPT responses, representing the maximum possible improvement achievable with ChatGPT input. For comparison, a variety of state-of-the-art LLMs (GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b) were similarly queried 4 times via respective application programming interfaces in a zero-shot approach using a slightly adapted prompt (see Textbox S2 in ). Details regarding the models and parameters are described in Table S1 in . A flowchart of the study’s setup can be found in .

‎

Figure 1. Summarizing flowchart of the study setup. The displayed flowchart summarizes the methodological approach and gives an overview of emergency department (ED) case vignette creation and triage performance assessment. Triage was carried out according to the Manchester Triage System (MTS) by doctors of different training levels and different Generative Pre-trained Transformer (GPT)–based versions of ChatGPT, as well as other large language models (LLMs). Calculation of Interrater Agreement

For calculating the interrater agreement, we computed the quadratic-weighted Cohen κ for each rater against the consensus set and used qualitative categories according to the original work by Cohen []. To test for statistical differences between the κ values of different rating groups, we performed a 1-way ANOVA with a Bonferroni correction, followed by the Tukey honest significant difference test. Mean values are displayed with SDs throughout the paper. All calculations were executed using Python (version 3.8; Python Software Foundation) employing the statsmodel package (version 0.13.2) []. For designing graphical representations, we used the matplotlib (version 3.1.3) [] and seaborn (version 0.12.0) [] packages.

Ethical Considerations

The local ethics committee of University Hospital Düsseldorf (5794R and 2020-1263) granted approval for data collection. Given the retrospective nature of this study, the requirement for written informed consent was waived. Data were anonymized as explained above and no compensation was provided.

Results

The consensus triage distribution in our analyzed exemplary cases was similar to that of the triage from the same ED in 2022 or published 2019 data []. However, there was a higher proportion of patients triaged as “Blue” (9/124, 7.3% vs 4.5% of all cases in 2022), “Yellow” (52/124, 41.9% vs 34.4%), “Orange” (24/124, 19.4% vs 9.5%), and “Red” (5/124, 4% vs 3%). Conversely, fewer patients were triaged as “Green” (34/124, 27.4% vs 48.7%).

As observed in the study, the different versions of ChatGPT and other LLMs were generally capable of providing triage colors as requested using the prompt and consequently assigned an MTS color for each case vignette (see Figures S1 and S2 in for a graphical depiction of all ratings of all raters). We found moderate to substantial agreement between the consensus triage and the triages made by various groups, including LLMs, different ChatGPT iterations, and untrained doctors. As might be anticipated, the professionally trained raters individually exhibited near-perfect alignment with the consensus set they helped construct before its formation, reflected by an average quadratic-weighted κ of 0.91 (SD 0.054). The interrater agreement of the professional raters (quadratic-weighted κ of 0.82) was identical to that reported in the literature []. On the other hand, GPT-3.5–based ChatGPT demonstrated only moderate agreement with the consensus set (κ=mean 0.54, SD 0.024), a score that was significantly lower than those of both GPT-4–based ChatGPT (κ=mean 0.67, SD 0.037; P<.001) and untrained doctors (κ=mean 0.68, SD 0.056; P<.001). The latter 2 groups displayed substantial agreement with the consensus triage, with no significant difference discernible between them (P>.99). When the untrained doctors were given GPT-4–based ChatGPT responses as a second opinion, they achieved a slightly higher, although not statistically significant, average κ of 0.70 (SD 0.047; P=.97). Despite this improvement, their κ scores still fell short of the hypothetical best combinations of the GPT-4–based ChatGPT responses and the untrained doctors’ initial assessments. Such combinations could have led to near-perfect agreement (κ=mean 0.84, SD 0.021)—a level of performance more comparable to that of professional raters. The other tested LLMs showed varying results, with raw GPT-4 performing very similarly to the GPT-4–based ChatGPT (κ=mean 0.65, SD 0.010; P=.98). Gemini 1.5 achieved results that fell between GPT-4 and the GPT-3.5–based ChatGPT with a mean κ of 0.60 (SD 0.010), which were not significantly different from both models (P=.08 and P=.29, respectively) but significantly worse than untrained doctors (P=.03). Llama 3 70B achieved a similar albeit slightly lower average κ of 0.52 (SD 0.004) than the GPT-3.5–based ChatGPT (P=.98), which was worse than Gemini 1.5 (P=.04) but better than the Mixtral 8x7b model, which had a mean κ of 0.42 (SD 0.000; P=.006; see ; for the respective tables and P values [ANOVA: F8=54.01; P<.001], see Table S2 in ). The κ values of individual raters are shown in Figure S3 in .

In addition to the overall agreement with the consensus set, the distribution of assigned MTS levels varied considerably among the different rater groups. In comparison to professional raters and the consensus set, ChatGPT and other LLMs except for Gemini 1.5 frequently assigned the highest triage category “Red” (mean 18.1%, SD 0.5% GPT-3.5–based ChatGPT; mean 12.5%, SD 2.5% GPT-4–based ChatGPT; mean 22.5%, SD 0.7% Llama 3 70B; mean 37.1%, SD 0% Mixtral 8x7b; 5/124, 4% consensus set), yet they seldomly or never designated the lowest category “Blue” (only once across 4 batches with 124 questions [mean 0.2%] for GPT-3.5–based ChatGPT; mean 1.2%, SD 0.5% GPT-4–based ChatGPT; 9/124, 7.3% consensus set; Llama 3 70B, raw GPT-4, and Mixtral 7x8b never assigned the category). As shown, this pattern was more evident in GPT-3.5–based ChatGPT than in version 4 and less evident in Gemini 1.5 (mean 5.6%, SD 0% “Red” and mean 4%, SD 0% “Blue”). In contrast, untrained doctors assigned the “Blue” category much more frequently than their professionally trained counterparts (mean 17.3%, SD 8.4% untrained doctors vs 9/124, 7.3% consensus set), while also employing the “Red” category more often (mean 9.1%, SD 5.5% untrained doctors vs 5/124, 4% consensus set). Certain language models displayed notable triaging patterns, with Gemini 1.5 predominantly categorizing cases as “Orange” (mean 64.1%, SD1.1% Gemini 1.5 vs 24/124, 19.4% consensus set), while Mixtral 7x8b frequently selected “Red” (mean 37.1%, SD 0%) and “Green” (mean 48.4%, SD 0%) and yielded equal results over all 4 iterations (see ). Accordingly, tendencies toward overtriage and undertriage were evident, with the GPT models leaning toward overtriage, untrained doctors more often undertriaging, and the hybrid models displaying a fairly balanced performance (see Figure S4 in ). Among untrained doctors, variance in triage strategies was noticeable, as demonstrated by the composition of individual raters’ triages (see Figure S5 in ). Importantly, all raters and LLMs, except for 2 instances where untrained doctors rated a “Red” case as “Orange,” accurately identified the most critical “Red” cases in the consensus set as “Red.” Notably, 1 doctor corrected their initial misjudgment to “Red” after being provided the GPT assessment.

‎

Figure 2. Quadratic-weighted Cohen κ compared to the consensus set. (A) This panel shows box plots of the quadratic-weighted Cohen κ for various triaging groups in relation to the consensus triage (blue). For the untrained doctors' group, both scenarios are depicted: rating alone (initial triage in blue) and rating with GPT-4 as a second opinion (orange). The green box plot represents the potential best combination of the doctors' initial triage with the GPT-4–based ChatGPT second opinion. The box plot's center line indicates the median, box limits indicate upper and lower quartiles, and whiskers indicate 1.5x the IQR. The black point indicates an outlier. (B) This panel illustrates the results of the Tukey honest significant difference test, performed based on an ANOVA with Bonferroni correction of the quadratic-weighted Cohen κ values for the different rater groups. The universal CIs of each group's mean were computed based on Tukey Q and plotted, with nonoverlapping intervals indicating statistically significant differences. As the professional raters' ratings were taken into account for the consensus set, their performance serves primarily as the reference point. API: application programming interface. ‎

Figure 3. Average distribution of Manchester Triage System (MTS) levels among different rater groups. The figure presents stacked bar plots illustrating the average count of triage categories, arranged in ascending order of severity. The least severe level, Level 5 ("Blue"), is located at the bottom, while the most severe level, Level 1 ("Red"), is positioned at the top. API: application programming interface.
Discussion

With the rising number of acute cases in the ED and ongoing limitations of personnel resources [,], strategies for improving triage are urgently needed to increase the capacity, motivation, and opportunities of nurses and doctors. Past attempts to decrease workload, such as patient self-triage, have shown limited correlation when compared to triages conducted by nurses trained in the MTS []. However, recent results from algorithmic solutions demonstrate their potential in enhancing clinical decision processes and optimizing resource utilization [].

Given the swift advancements in natural language processing made by LLMs and their extensive applications in medical care, it is plausible that such models could further enhance ED operations in the near future. It is worth noting that in our scenario, no clarifying questions or similar interactions were allowed, which may not mirror the typical setting of an ED. A test of the consumer-facing, general LLM product ChatGPT appeared especially relevant, as many health care professionals have access to and experimented with this product. Our study findings suggest that even a general-purpose system such as ChatGPT can match the performance of untrained doctors when triaging emergency cases and, similar to previous models [], consistently identify the most severe cases. Similar on-par performance between GPT-4–based ChatGPT and ophthalmology trainees has been reported in another medical context, further underlining this finding []. However, the findings clearly highlight that the tested versions of ChatGPT did not demonstrate gold-standard performance in triage, especially in more ambiguous cases, which is also in line with previous findings from a study from Turkey []. This study found varying results for other LLMs, with the performance of Gemini 1.5, Llama 3 70B, and Mixtral 8x7b being similar to or worse than the GPT-4–based ChatGPT or GPT-4 itself. Importantly, this research employed a zero-shot approach, suggesting that parameter tuning could not only enhance results but also address occasional irregular triaging behavior, which warrants sophisticated testing in future studies.

The integration of LLMs in ED triage is promising not only in terms of decision accuracy but also in potentially boosting efficiency and saving time, a critical factor in emergency scenarios. However, it should be pointed out that this study does not empirically measure their efficiency against traditional triage methods. Future studies are needed to rigorously evaluate the time and effort implications of integrating LLMs and AI assistants into clinical settings, particularly in terms of data entry requirements and comparison with established triage practices by trained medical staff. Such evaluations should incorporate interdisciplinary approaches including nursing staff, who are often responsible for triage in the ED.

Despite the considerable potential of using such a system as a second-opinion tool for clinicians—as demonstrated in our study—the minimal actual improvement observed indicates an ongoing need to train medical personnel in more efficient usage of these resources []. While 1 untrained rater corrected an undertriage to the most critical color “Red” after seeing the ChatGPT ratings—a change that could be clinically significant—further studies are needed to explore potential quality improvements with better validated systems and professionally trained triage raters.

The substantial progress between ChatGPT using GPT-3.5 and GPT-4—performing comparably to untrained doctors—hints at further potential advancements in this area. Since a very basic prompting strategy without background on the copyright-protected MTS or strategies such as retrieval-augmented generation was used, there appears to be ample room for further technical improvements. Given the option to fine-tune and further train LLMs on specific data, such as triage cases, it is well conceivable that they might even yield better results that are on par with professional raters in the future and take a wider range of parameters into account than humans are able to. Furthermore, these systems could help decrease the workload of qualified medical staff by answering basic inquiries, collecting medical information, and offering further differential diagnoses as previously shown []. When combined with the recent advancements in automatic patient decompensation detection using continuous sensor data in EDs [], these technologies could contribute to an efficient and human-centered ED experience for both staff and patients. Moreover, these systems might, in the future, assist patients in deciding between primary care and ED presentation. However, such a prediction warrants further investigation.

It is crucial to note that this study is purely exploratory, and at present, no clinical decisions should be based solely on recommendations made by an LLM or ChatGPT without proper validation studies [] in place. This is further underscored by the LLMs’ and ChatGPT’s performance deficiencies in this study, which would likely lead to suboptimal triage if used in a real-world setting. While this research emphasizes the potential and rapid progress of such models in health care and EDs, the authors want to highlight the ongoing debate about the appropriate regulation of these models for medical applications and how data privacy concerns should be addressed [,]. Notwithstanding these challenges, as researchers diligently navigate and promote responsible innovation, the potential of AI to revolutionize health care and augment patient outcomes perseveres as a captivating opportunity to contribute to the amelioration of our health care system. Future validation studies should further prioritize large, representative, and multicentric data sets, which are paramount for the correct assessment of AI tools. In summary, despite rapid advancements in LLM technologies and associated products such as ChatGPT, this study confirms that in a very basic setup, they currently do not meet the gold standard for ED triage, underscoring the urgent need for further development and rigorous validation.

The data assessed in this study are available from the corresponding author on reasonable request.

All authors reports no conflicts of interest related to this study. LS, AS, TK, RJ, MM, MB, and LB also report no conflicts of interest outside the scope of this study. LM received honoraria for lecturing, consulting, and travel expenses for attending meetings from Biogen, Merck Serono, and Novartis, all outside the scope of this work. His research is funded by the by the German Multiple Sclerosis Foundation (DMSG) and the German Research Foundation (DFG). NH reports travel reimbursements from Alexion and meeting attendance fees from Novartis, outside the scope of this work. SGM received honoraria for lecturing and travel expenses for attending meetings from Almirall, Amicus Therapeutics Germany, Bayer Health Care, Biogen, Celgene, Diamed, Genzyme, MedDay Pharmaceuticals, Merck Serono, Novartis, Novo Nordisk, ONO Pharma, Roche, Sanofi-Aventis, Chugai Pharma, QuintilesIMS, and Teva. His research is funded by the German Ministry for Education and Research (BMBF), Bundesinstitut für Risikobewertung (BfR), Deutsche Forschungsgemeinschaft (DFG), Else Kröner Fresenius Foundation, Gemeinsamer Bundesausschuss (G-BA), German Academic Exchange Service, Hertie Foundation, Interdisciplinary Center for Clinical Studies (IZKF) Muenster, German Foundation Neurology, Alexion, Almirall, Amicus Therapeutics Germany, Biogen, Diamed, Fresenius Medical Care, Genzyme, HERZ Burgdorf, Merck Serono, Novartis, ONO Pharma, Roche, and Teva, all outside the scope of this study. MP received honoraria for lecturing and travel expenses for attending meetings from Alexion, ArgenX, Bayer Health Care, Biogen, Hexal, Merck Serono, Novartis, Sanofi-Aventis, and Teva. His research is funded by ArgenX, Biogen, Hexal, and Novartis, all outside the scope of this study.

Edited by B Puladi; submitted 02.10.23; peer-reviewed by A Sarraju, A Follmann, M Moebus, Q Jin, L Zhu; comments to author 29.02.24; revised version received 17.04.24; accepted 14.05.24; published 14.06.24.

©Lars Masanneck, Linea Schmidt, Antonia Seifert, Tristan Kölsche, Niklas Huntemann, Robin Jansen, Mohammed Mehsin, Michael Bernhard, Sven G Meuth, Lennert Böhm, Marc Pawlitzki. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 14.06.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

View original article

JOURNAL OF MEDICAL INTERNET RESEARCH

分享书签

0 0 0 0 0 0 0

More from this channel

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study

留言 (0)