Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases

In this study, we compared the diagnostic performance of flagship LLMs from three companies based on Radiology Diagnosis Please cases. To ensure reproducibility and to compare different vendors' LLMs under as similar conditions as possible, we utilized their respective APIs and specified similar parameters for each model. The LLMs were provided with the clinical history and imaging findings from each case.

The results showed that the models performed in the following order from best to worst: Claude 3 Opus, GPT-4o, and Gemini 1.5 Pro. Furthermore, statistically significant differences were observed between all pairwise combinations.

Notably, as of the time of writing, the technical report for GPT-4o has not been released. However, Claude 3 Opus reportedly outperforms Gemini 1.5 Pro on eight text-based language benchmarks, including reasoning, coding, and mathematics [15].

In the context of medical natural language processing capabilities, despite being a general-purpose LLM, Claude 3 Opus achieved an accuracy of 74.9% for 0-shot and 75.8% for 5-shot on PubMedQA [16], which is nearly equivalent to the performance of Google's Med-PaLM 2 [17], an LLM specialized in medicine.

Regarding Gemini 1.5 Pro, one of its design philosophies is the extension of context length [11]. The developers have also released Gemini 1.5 Flash, a lightweight and fast model with slightly reduced performance [15]. These points suggest that the Gemini 1.5 series may prioritize real-world implementations, such as integration into devices, over benchmark performances.

In this study, the accuracy of GPT-4o was lower than that of GPT-4 reported by Ueda et al. [5]. One possible reason for this discrepancy is the strict grading criteria used in this study. This issue arises from the fact that the actual correct answer criteria for Radiology Diagnosis Please cases are not publicly available, which represents a limitation of this study.

Another limitation of this study is potential data leakage. The answers for each case used in this study are available online. According to a previous study on GPT-4 [5], there was no significant difference in the accuracy rate between questions related to the period used for GPT-4's training and those related to the period outside the training. However, considering the vast amount of data these models are trained on, it is possible that some information from these cases was inadvertently included in their training data, which could lead to an overestimation of the LLMs' performance in this study.

In previous studies in which the GPT-4 Turbo with Vision was tasked with solving the Japanese Board of Radiology examination [18], and “Freiburg Neuropathology Case Conference” cases from the journal Clinical Neuroradiology [19], the GPT-4 Turbo with Vision, given both image and textual information, could not outperform the GPT-4 Turbo, which was only provided with textual information. Claude 3 Opus, which achieved the best performance in this study, showed significantly inferior diagnostic performance when given only the history and key images as input, without the textual information of imaging findings, compared to when the textual information of both history and imaging findings were provided, as reported in previous research [13].

In conclusion, at least at present, the main role of LLMs is not to replace radiologists, but rather to assist in diagnosis using imaging findings based on accurate interpretations and verbalization of imaging findings by radiologists.

However, to effectively utilize rapidly evolving LLMs in the field of diagnostic radiology and maximize their potential benefits, it is essential to continue conducting research and evaluations in future. As these models advance and new capabilities emerge, ongoing studies will be crucial to understand their strengths, limitations, and optimal applications in clinical practice.

留言 (0)

沒有登入
gif