Evaluating the accuracy and reliability of AI chatbots in disseminating the content of current resuscitation guidelines: a comparative analysis between the ERC 2021 guidelines and both ChatGPTs 3.5 and 4

In response to inquiries about the five chapters, ChatGPT-3.5 generated a total of 60 statements, whereas ChatGPT-4 produced 32. The number of statements generated by the AIs was fewer than the number of key messages for each chapter. In total, 172 key messages were compared with the AI outputs (Table 1).

Table 1 Number of messages addressing the different guideline chaptersCompleteness and actuality of ChatGPT

When asked about the key messages, ChatGPT-3.5 provided a reference to the official ERC website for accurate and up-to-date information, noting that its knowledge cut-off was in September 2021. ChatGPT-4, on the other hand, began its response with a brief summary of the chapter without mentioning any access restrictions to the guideline text.

Among the 172 key messages, ChatGPT-3.5 addressed 13 key messages completely and failed to address 123, whereas ChatGPT-4 addressed 20 key messages completely and did not address 132. Both versions of ChatGPT more frequently addressed BLS key messages completely than they did key messages from other chapters. In all the other chapters, more than two-thirds of the key messages were not addressed at all (Fig. 1).

Fig. 1

Results of the performance analysis of two ChatGPT versions in addressing the key messages of clinically relevant ERC guideline chapters

ChatGPT-3.5 partially addressed 36 key messages, whereas ChatGPT-4 partially addressed 20. The error type “superficial” was assigned 28 times for ChatGPT-3.5 and 16 times for ChatGPT-4. The error “inaccurate” was noted four times in ChatGPT-3.5 and twice in ChatGPT-4. In ChatGPT-3.5, four sentences, and in ChatGPT-4, two sentences did not distinguish between evidence and recommendation (Fig. 1).

Examples for the error type “superficial”

Key message on post-resuscitation care: “Use multimodal neurological prognostication using clinical examination, electrophysiology, biomarkers, and imaging”.

Corresponding ChatGPT-4 statement: “Post-resuscitation care should include early coronary angiography and revascularization when indicated, targeted temperature management, seizure control, multimodal prognostication, and organ donation when appropriate.”

Example for “not discriminating between evidence and recommendation

Key message on BLS: “AEDs can be used safely by bystanders and first responders”.

Corresponding ChatGPT-3.5 BLS statement: “Public access defibrillation: Encourage the use of public access defibrillation programs to improve early defibrillation in the community.” “Use of AEDs: Promptly apply an AED when available, ensuring the correct pad placement and following voice and visual prompts.”

ConformityChatGPT-3.5

Of the 60 output statements of ChatGPT-3.5, 46 (77%) were in accordance with the guidelines. Thirteen statements were partially conform with the guidelines and these statements were distributed across all the chapters. In the chapter on special circumstances, one sentences was not conform with the guidelines because it addressed a symptom (burns) which is associated with cardiac arrest due to electrocution and lightning strike but is outside the scope of the resuscitation guidelines (Fig. 2). Among the 13 statements rated as partially conform with the guidelines, three messages were deemed too “superficial,” seven were “inaccurate,” one was “hallucinated,” and two were “not addressed in this chapter or guideline version.” (Fig. 2).

Fig. 2

Results of the performance analysis of two ChatGPT versions for the accordance of the AI output with the ERC guideline text for the relevant chapters

An example of an inaccurate statement is “Begin chest compressions as soon as possible in a ratio of 30 compressions to 2 rescue breaths for adult and pediatric patients.” (ChatGPT-3.5, Chapter BLS).

A comparison of the statement with the key messages from 2021 on BLS of the ERC guidelines page (ERC Guidelines cprguidelines.eu) demonstrated that this recommendation only applies to adults.

An example of a hallucinated statement is “Initiate CPR immediately, and once the patient is in a hospital setting, consider rewarming and further management of potential complications such as pulmonary edema or hypoxemia.” (ChatGPT-3.5, Chapter special circumstances).

Comparing the statement with the key messages from 2021 on Special Circumstances (ERC Guidelines cprguidelines.eu) demonstrates that this recommendation is not part of the guideline text and that hypoxia and pulmonary edema must be addressed immediately. The temporal separation is hallucinated.

ChatGPT4

Out of the 32 output statements from ChatGPT-4, 27 (84%) were conform with the guidelines. For BLS and ALS, some statements were only partially conformed with the guidelines (Fig. 1). One statement was deemed too “superficial,” one was “inaccurate,” and three were “not addressed in this chapter/guideline version.”

An example of a superficial statement (because the clinical condition of the patient is not respected): “Foreign body airway obstruction should be managed by encouraging the victim to cough, followed by back blows and chest thrusts (or abdominal thrusts in obese or pregnant victims).” (ChatGPT-4, Chapter BLS).

An example of a message “not addressed in this guideline version”: “Minimizing interruptions in chest compressions and avoiding excessive ventilation are essential to optimize blood flow and oxygen delivery during CPR.” (ChatGPT-4, Chapter ALS).

Rater agreement

The analyses of the interrater agreement revealed for the comparison of the key messages with the AI statements (completeness analysis) a moderate interrater reliability for both versions (Cohen’s kappa: 0.48 for ChatGPT-3.5 and 0.56 for ChatGPT-4).

In terms of the conformity of the AI output with the guidelines (conformity analysis) the interrater reliability, as measured by Cohen’s kappa, was significantly better for ChatGPT-4 (0.76) than for ChatGPT-3.5 (0.36).

View original article

SCANDINAVIAN JOURNAL OF TRAUMA RESUSCITATION & EMERGENCY MEDICINE

Like

分享书签

0 0 0 0 0 0 0

More from this channel

Evaluating the accuracy and reliability of AI chatbots in disseminating the content of current resuscitation guidelines: a comparative analysis between the ERC 2021 guidelines and both ChatGPTs 3.5 and 4

留言 (0)