Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis

In recent years, large language models (LLMs) have attracted attention for their potential to improve traditional approaches in diverse domains [1]. Within healthcare, ChatGPT is a notable example that exhibits promising characteristics in generating text that resembles human-like communication [2]. These features have led to exploratory uses of ChatGPT in tasks such as responding to medical inquiries and crafting detailed medical content. While growing interest surrounds ChatGPT's potential to assist in diagnosis [3], treatment recommendation [4], [5], patient education [6], and medical report interpretation [7], it also raises various concerns that warrant careful evaluation by practitioners and researchers [8], [9], [10], [11].

Since the launch of ChatGPT, numerous peer-reviewed scientific papers have been published on its appropriateness in addressing medical questions. ChatGPT’s performance has been evaluated across various medical specialties, including basic medical disciplines [12], [13], [14], internal medicine [15], [16], surgical medicine [17], [18], obstetrics and gynecology [19], paediatrics [20], and radiology and laboratory medicine [21], [22]. ChatGPT has generated considerable attention by achieving accuracies ranging from 36 % to 90 %.

However, it is crucial to use these findings with caution. The field is still developing and lacks standardized guidelines for the evaluation of LLMs' performances, which leads to inconsistent evaluation methodologies across studies. Their discrepancies varied in the source of questions posed to ChatGPT (e.g., online medical forums [16] vs. summaries of hospital patient records [23]), the process of questioning (e.g., asking each question only once [12] vs. repeating each question multiple times [7]), the evaluation metrics used (e.g., accuracy [24], safety [7], empathy [16], appropriateness [4]), and some studies not reporting evaluation details. Inadequate evaluation may lead to unrigorous conclusions, potentially misleading healthcare professionals and the public, and resulting in inappropriate medical advice and decisions.

Given the growing attention to the ChatGPT’s ability to address medical inquiries, a systematic review of the approaches used for the evaluation of the performance of LLMs becomes essential. Recent studies have reviewed the advancement in LLMs and evaluated LLM performance in general [25], [26]. In the specific area of healthcare, several reviews have summarized the application of LLM in education, research, practice, scientific writing, and ethical considerations [26], [27], [28], [29], [30], but there remains a notable gap in the literature concerning the evaluation of LLMs’ performance and methodologies in answering medical questions, particularly in the use of ChatGPT. In addition, ChatGPT, as a product of general-purpose artificial intelligence, holds the potential to answer inquiries in various medical disciplines, populations, and languages. The diversity in medical contexts highlights the complexity of assessing ChatGPT's performance. A meta-analysis can synthesize these findings, address individual study limitations, and provide a comprehensive view of ChatGPT's capability to answer medical questions.

In this study, we conduct a systematic review of the existing literature evaluating LLMs in the use of ChatGPT. Our objectives are to: (a) review the available evidence on ChatGPT’s performance in answering medical questions, (b) synthesize these findings through meta-analysis, (c) examine the methodologies used in the literature, and (d) propose an evaluation framework for LLMs in addressing medical inquiries.

留言 (0)

沒有登入
gif