Use of artificial intelligence chatbots in clinical management of immune-related adverse events

WHAT IS ALREADY KNOWN ON THIS TOPIC

Large language model chatbots can provide information about a variety of topics, including medical data. However, the utility of LLMs for complex immune-related adverse event (irAE) questions is unclear.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICYBackground

The advent of new artificial intelligence chatbots such as ChatGPT, Google Bard, and many others (hereafter referred to as chatbots) has the potential to change medical diagnostics and treatment drastically. These chatbots, built around large language models, analyze various data sets procured from sources found on the internet and learn from them before producing human-like answers to address inputted queries.1 The answers generated by the chatbots evolve based on human feedback combined with the availability of new or updated sources of information. This allows the chatbot to provide more complex answers that are better aligned with the end-user’s original intentions.

The ever-increasing extent and availability of medical information presents substantial challenges to physicians. Increasingly, both physicians and patients are turning to chatbots to help make medical information more digestible and accessible. Determining whether chatbot answers are accurate or reliable is important, especially given that patients are increasingly relying on the answers from these chatbots to inform their medical decision-making.2 Several studies have shown that earlier versions of chatbots provide digestible and fairly accurate information, but may also provide incomplete, inaccurate, or out-of-date answers.3 4 Many of these studies though, focus on multiple-choice or binary answers, which often do not reflect the open-ended nature of the real-world medical practice. Lastly, chatbot responses may also lack both the emotional aspects of healthcare such as empathy although some studies suggest they perform well in this regard.5–7

This study seeks to analyze the accuracy and completeness of chatbot-generated answers surrounding complex, open-ended questions regarding immune-related adverse events (irAEs). These immune-related toxicities impact multiple organs,8 are treated algorithmically by defined guidelines,9–11 and are common medical problems for physicians caring for patients with cancer. Further, the diverse range of organs affected, the often non-specific clinical presentations, and the multidisciplinary management required make this a challenging area for clinicians, and thus a potentially attractive area for chatbot-derived assistance.

Methods

This cross-sectional study was exempt from institutional review board review given the lack of patient data. Available guidelines for the management of irAEs were reviewed. Based on these guidelines, a total of 50 questions were generated by the senior author (DBJ) and refined/approved by other study authors as representative common questions that arise in clinical settings. Five questions each from nine common irAE categories were generated (gastrointestinal, hepatic, pulmonary, dermatologic, thyroid, pituitary/adrenal, rheumatologic, neuromuscular, cardiac), with an additional five questions about general irAE management. All questions were designed as descriptive and open-ended in nature (online supplemental table 1), but with clearly defined answers present in available guidelines from international committees with expertize in irAEs.9–11

Finalized questions were entered into two chatbots (ChatGPT (V.GPT-4) and Google Bard) by the first author (HB) on October 6, 2023. Answers were provided back to the rating physicians. Rating physicians were either members of the Society for Immunotherapy in Cancer immune checkpoint inhibitor and cytokine-related adverse events subcommittee (n=5) or their colleagues with a strong focus on irAE management (n=3). All answers were graded by each rater for accuracy and completeness for both chatbots. Accuracy was graded on a 1–4 point Likert scale, with 1 signifying completely inaccurate, 2 mostly inaccurate, 3 mostly accurate, and 4 accurate. Raters were instructed to grade accuracy results based on guideline content, not personal management style. Similarly, completeness was graded on a 1–4 point Likert scale, with 1 signifying incomplete, 2 missing multiple pieces of key information, 3 missing one piece of key information, and 4 complete. Raters were instructed to grade based on major pieces of key information rather than minor or optional items, specifically giving the example of colitis (major including endoscopic evaluation, minor/optional being fecal calprotectin testing).

Grades were summarized with means, medians, and ranges for each chatbot overall and for each irAE category. Scores for completeness and accuracy were compared between chatbots using Wilcoxon signed-rank tests. Inter-rater agreement was assessed with Kendall’s coefficient of concordance since there were >2 raters. The two-sample binomial proportion test was used to compare the proportions of certain ratings between chatbots.

To further judge accuracy and completeness, 20 different clinical scenarios were generated by DBJ and approved by other participating authors, and entered into ChatGPT (not Bard given the amount of time that had elapsed and poorer performance) on March 20, 2024. Two questions were generated from each of the 10 categories, and were judged by four of the rating physicians.

Results

Both chatbots were rated for accuracy and completeness on 50 questions from 10 different categories (see online supplemental file). Both chatbots had relatively high scores overall; ChatGPT scored a median of 3.88 for accuracy (mean 3.87) and 3.88 for completeness (mean 3.83) across all questions and raters. Bard scores were median 3.5 for accuracy (mean 3.5) and 3.5 for completeness (mean 3.46). Inter-rater agreement was fair across all raters (Kendall’s correlation coefficients for accuracy and completeness were 0.21 and 0.24 for ChatGPT and 0.27 and 0.24 for Bard).12 Overall, GPT-4 received significantly higher ratings compared with Bard in both accuracy and completeness (p<0.001).

We then assessed scores stratified by category by pooling scores across five questions per category (maximum of 20 per category). Both mean and median scores for each category for both accuracy and completeness were between 19 and 20 except for general immune checkpoint inhibitor (ICI) questions for ChatGPT (table 1). Median scores for Bard ranged from 15.5 to 19, with similar ranges for mean scores (16–18.5) (table 1). Scores in all categories were rated numerically higher with ChatGPT. This difference reached statistical significance (p<0.05) in one category for accuracy (cardiac) and five categories for completeness (hepatic, dermatologic, thyroid, pituitary/adrenal, and cardiac). An additional six categories for accuracy, and two categories for completeness showed marginal statistical significance (p<0.1) favoring ChatGPT. By category, the “general” category had the lowest scores for ChatGPT with generally high scores across specific irAE categories, whereas Bard seemed to perform highest in dermatologic, rheumatologic, neuromuscular, and cardiac categories.

Table 1

Scores for accuracy and completeness for each engine in each category

There were multiple questions that received ratings of 4 from all eight reviewers, including 22/50, 44% (ChatGPT accuracy) and 16/50, 32% (ChatGPT completeness) compared with 2/50, 4% (Bard accuracy) and 1/50, 2% (Bard completeness) (p<0.001). Ratings of 1 (fully inaccurate or incomplete) were uncommon, given for 2/800 ChatGPT rater-responses (0.3%) and 9/800 Bard rater-responses (1.1%). Ratings of 2 (mostly incorrect or missing multiple key pieces of information) were of similar incidence for ChatGPT (4/800 rater-responses, 0.5%), though more common for Bard (83/800 rater-responses, 10.4%) (p<0.001).

To assess utility in specific clinical scenarios, we provided 20 different patient-specific scenarios (see online supplemental file) into ChatGPT. These answers were also rated highly; mean accuracy was 3.73 (median 4) and mean completeness was 3.61 (median 4). Of the 80 physician-answers, scores were 4 (n=53), 3 (n=23), 2 (n=4), and 1 (n=0).

Discussion

In this study, we found that chatbots, particularly ChatGPT (V.GPT-4), provided generally accurate and complete information surrounding irAEs. Questions were open-ended (not multiple choice), mirroring real-life situations rather than board examinations. The median rating for many questions was 4 (fully accurate and complete), and egregiously wrong answers were uncommon. Thus, these engines appear promising for use in receiving guidance for irAEs.

Although both engines had a reasonably high degree of accuracy and completeness, it appeared that ChatGPT was further advanced in providing accurate and comprehensive information compared with Bard. Ratings of 3 or 4 predominated for ChatGPT (794 of 800 rater-responses), thus showing consistently high grades across physician raters. As a new technology, it is likely that chatbots will change and upgrade rapidly though, thus comparisons between engines may be rapidly outdated. It is also likely that different engines will ultimately be optimized for distinct tasks and prioritize different capabilities (eg, accuracy vs comprehensiveness). In addition, chatbots may be designed to maximize other goals, such as conciseness (eg, avoiding extraneous information) or delivering information at a specific educational attainment level. These goals are also important to maximize high-yield information delivery to busy clinicians. Of note, ChatGPT and other engines have shown promise in providing high-quality medical information across a range of medical conditions.13–15 This includes general immune-oncology questions,16 urological cancers,17 and preoperative counseling for head and neck cancer surgery.18

Interestingly, ratings of 1 (fully incorrect or incomplete) were very uncommon, suggesting that outright “hallucinations” were very rare. At the outset of these technologies, this phenomenon appeared to occur with troubling frequency.19 The rarity of egregiously wrong answers in this data set suggests that such hallucinations may be a surmountable problem, at least in this type of focused question set with concrete answers available in publicly available guidelines. However, it could be argued that less frequent wrong answers may increase the impact of residual incorrect information, since increasing trust in the outcomes may decrease reliance on other more validated sources.

Tempering this enthusiasm is the fact that most questions did not universally receive a rating of 4 (fully accurate and/or complete) on all questions. This could reflect subjective disagreement by highly experienced physicians, but could also suggest that these chatbots may not be reliable as stand-alone sources of medical information. A potentially important future direction could include training chatbots specifically on irAE and other cancer-specific guidelines, as has been done with other corpus of texts. Until those types of advances, available guidelines remain a golden standard when making medical decisions. It is also important to note that ratings were subjective, and could differ with different clinicians (and could be impacted based on the particular Likert scale used). It is also possible that new features or upgrades worsen the model performance; this will be difficult to assess.

In conclusion, current iterations of chatbots provide fairly accurate and complete information to many questions surrounding irAEs, though important differences are present between different chatbots. Additional research and validation are needed prior to using these engines as “stand-alone” resources.

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information. All data associated with this manuscript has been provided in the form of supplemental materials and can be found in online supplemental table 1.

Ethics statementsPatient consent for publicationEthics approval

Not applicable.

留言 (0)

沒有登入
gif