Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties

1. INTRODUCTION

Generative Pre-trained Transformer 4 (GPT-4) is the latest advancement in artificial intelligence (AI) and possesses outstanding problem-solving skills, such as passing the United States Medical Licensing Examination.1 GPT-4 shows promising accuracy in diagnosing complex medical cases.2,3 It is important for a competent AI consultant to generate comprehensive differential diagnoses as well as providing the accurate final diagnosis, an ability not tested in GPT-4.

The performance of GPT-4 across various subspecialties may differ. Our previous small-scale studies demonstrated that GPT-4 performed admirably in diagnosing patients with infections, rheumatic diseases, and adverse drug reactions, but not in the diagnosis of patients with impaired cognition.3,4 To explore further, it is necessary to evaluate GPT-4 with a larger dataset of complex medical cases.

Finally, there is no known evidence of a performance comparison between GPT-4 and its predecessor GPT-3.5. Therefore, case records with complex patient histories were acquired from the New England Journal of Medicine (NEJM) to compare the performance of GPT-4 to GPT-3.5.

2. METHODS

From the 210 case records from NEJM from 2018 up to late March 2023, 81 cases under the subspecialties of “cognitive impairment” (CI), “infectious disease” (ID), “rheumatology” (RH) and “drug reactions” (DR) were selected for analysis. Clinical information was copied to the chat-bots (GPT-3.5 and GPT-4) with prompts (Supplementary File 1, https://links.lww.com/JCMA/A232). Text descriptions of the imaging results were input without modification. Laboratory results presented in tables but not in the text were manually inputted.

Primary AI diagnoses were considered correct if they matched the final diagnoses provided by the authors. Unless explicitly ruled out by the authors, differential diagnoses were included if discussed in the case reports. In cases where causative pathogens were identified, diagnoses were considered accurate if the chat-bots offered the correct class of pathogens. Investigations were considered correct if they led to a diagnosis.

A new scoring system was proposed to assess the diagnostic accuracy and comprehensiveness of the differential diagnoses of GPT-3.5 and GPT-4. Only the primary diagnosis and the four differential diagnoses were considered for scoring. Five points were awarded for the first-ranking diagnosis (ie, primary diagnosis). Four points were awarded for second-ranking diagnoses, three points for third-ranking diagnoses, two points for fourth-ranking diagnoses, and one point for fifth-ranking diagnoses. The points doubled if the diagnosis was correct.

Data are expressed as means. Statistical analyses were performed using the Prism 9 software (GraphPad Software, San Diego, CA). Chi-square and Fisher’s exact tests were performed to compare the accuracy between subspecialties. Scoring differences between GPT-3.5 and GPT-4 were analyzed using t tests.

3. RESULTS

Eighty-one patients were identified and categorized into four subspecialties: CI (n = 13), ID (n = 50), RH (n = 13), and DR (n = 5). Supplementary File 1, https://links.lww.com/JCMA/A232, summarizes the chat-bot responses. The primary diagnostic accuracy of the GPT-4 was 38.3%, which improved to 71.6% after including the suggested differential diagnoses. Primary diagnoses were determined in 84.0% of all cases following the investigations suggested by GPT-4 (Table 1). The mean scores achieved by GPT-3.5 and GPT-4 were 8.72 and 12.59, respectively.

Table 1 - Performance of GPT-4 with NEJM cases Accurate primary Dx Primary Dx included in DDx Investigation leads to primary Dx Accuracy of GPT-4 No. No. No. Whole series (n = 81) 31 58 68 CI category (n = 13) 4 6 10 ID category (n = 50) 21 40 42 RH category (n = 13) 2 8 12 DR category (n = 5) 4 4 4 Subspecialties performance comparison Odds ratio (95% confidence interval) Odds ratio (95% confidence interval) Odds ratio (95% confidence interval) CI (n = 13) vs non-CI (n = 68) 0.67 (0.21-2.22) 0.26 (0.08-0.89a) 0.57 (0.14-2.23) ID (n = 50) vs non-ID (n = 31) 1.52 (0.62-3.73a) 2.89 (1.05-7.45a) 1.01 (0.33-3.10) RH (n = 13) vs non-RH (n = 68) 0.24 (0.05-1.01) 0.58 (0.18-1.83) 2.57 (0.35-29.66) DR (n = 5) vs non-DR (n = 76) 7.26 (1.08-90.28) 1.63 (0.24-20.75) 0.75 (0.11-9.87)

Odds ratios and 95% confidence intervals were calculated by Fisher’s exact test.

CI = cognitive impairment; DDx = differential diagnoses; DR = drug reactions; Dx = diagnosis; GPT-4 = Generative Pre-Trained Transformer 4; ID = infectious disease; NEJM = New England Journal of Medicine; RH = rheumatology.

aOdds ratios and 95% confidence intervals were calculated by Chi-square test.

GPT-4 was superior in making the primary diagnosis in the DR category and in providing differential diagnoses that accounted for the primary diagnosis in the ID category. In the CI category, GPT-4 underperformed in providing a differential diagnosis (Table 1).

The GPT-4 scored higher than the GPT-3.5 in general and individual subspecialties, except in the DR category (Table 2).

Table 2 - Comparison of performance between GPT-4 and GPT-3.5 GPT-4 score (mean) GPT-3.5 score (mean) Difference (95% confidence interval) p Whole series (n = 81) 12.59 8.72 3.88 (2.89-4.86) <0.001 CI category (n = 13) 13.15 9.31 3.86 (1.50-6.20) 0.0064a ID category (n = 50) 12.76 8.44 4.32 (2.98-5.67) <0.001 RH category (n = 13) 12.46 9.23 3.23 (1.12-5.34) 0.0059 DR category (n = 5) 9.8 8.6 1.2 (−5.10 to 7.50) 0.7977a

p value was calculated using paired t test.

CI = cognitive impairment; DR = drug reactions; GPT-4 = Generative Pre-Trained Transformer 4; GPT-3.5 = Generative Pre-Trained Transformer 3.5; ID = infectious disease; RH = rheumatology.

ap value was calculated using unpaired t test.


4. DISCUSSION

GPT-4 exhibited a diagnostic accuracy comparable to that of a previous study.2 Diagnostic accuracy can be further improved by conducting the investigations suggested by GPT-4.

GPT-4 demonstrated the highest performance in infectious diseases and drug reactions but less favorably in cognitive impairment. These results were consistent with our previous findings.3,4 Infectious disease cases typically present quantitative information, allowing effective data incorporation into diagnoses. In contrast, cases of cognitive impairment provide complex qualitative descriptions, making it challenging to generate accurate diagnoses.

In addition to the diagnostic accuracy of GPT-4, our scoring system accounts for comprehensiveness in addressing complex cases, which is critical to ensure safety and is as important as making an accurate final diagnosis.

GPT-4 outperformed GPT-3.5 except for drug reactions, which is likely explained by the small sample size.

GPT-4 offers reasonable differential diagnoses and investigations when provided with limited information, as in the case of withholding complete histories and investigation results. These observations warrant further evaluation of the AI performance across different stages of diagnosis.

The small case volume limited our study, particularly with respect to drug reactions. Studies with larger sample sizes may provide more comprehensive conclusions.

GPT-4 offers comprehensive differential diagnosis and appropriate investigation, in addition to providing a reasonably accurate diagnosis. Performance varies across subspecialties. Further studies may help to usher in a new era in which AI can reliably support patient management.

APPENDIX A. SUPPLEMENTARY DATA

Supplementary data related to this article can be found at https://links.lww.com/JCMA/A232.

REFERENCES 1. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med 2023;388:1233–9. 2. Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 2023;330:78–80. 3. Shea YF, Lee CMY, Ip WCT, Luk DWA, Wong SSW. Use of GPT-4 to analyze medical records of patients with extensive investigations and delayed diagnosis. JAMA Netw Open 2023;6:e2325000. 4. Shea YF, Ma NC. Limitations of GPT-4 in analyzing real-life medical notes related to cognitive impairment. Psychogeriatrics 2023;23:885–7.

留言 (0)

沒有登入
gif