Evaluation of Prompts to Simplify Cardiovascular Disease Information Generated Using a Large Language Model: Cross-Sectional Study

Abstract

In this cross-sectional study, we evaluated the completeness, readability, and syntactic complexity of cardiovascular disease prevention information produced by GPT-4 in response to 4 kinds of prompts.

J Med Internet Res 2024;26:e55388

doi:10.2196/55388

Keywords

Introduction

Many web-based patient educational materials about cardiovascular disease (CVD) are inaccessible for the general public []. Artificial intelligence (AI) chatbots powered by large language models (LLMs) are a potential source of public-facing CVD information [-]. Generative language models present risks related to information quality but also opportunities for producing accessible information about CVD at scale, which could advance the American Heart Association’s 2020 impact goals related to health literacy []. Recent studies have used LLMs to simplify medical information in different contexts [,-], but quantitative comparison of prompt engineering strategies is needed to assess and optimize performance and to ensure that the rapid deployment of clinical AI tools proceeds in an equitable manner []. In this cross-sectional study, we evaluated the completeness, readability, and syntactic complexity of CVD prevention information produced by GPT-4 in response to 4 kinds of prompts.


Methods

A set of 25 questions about fundamental CVD prevention topics was drawn from a previous study, which found that the GPT 3.5 version of ChatGPT provided generally appropriate responses []. We devised 3 prompt strategies for generating simplified ChatGPT responses to these questions, including a zero-shot prompt to use plain and easy-to-understand language, a one-shot prompt with a sample simplified passage on an unrelated subject, and a combined prompt to use simplified language and cover specific key points (which we termed “rubric prompting”; ). Responses to these three prompts were compared to baseline responses for which the prompt contained only the question about CVD. The full set of responses is provided in .

For each question and prompt type, 3 independent responses were generated between April and June 2023, using the GPT-4 version of ChatGPT with default parameters, which was available from OpenAI through a ChatGPT Plus subscription. Two authors, who are preventive cardiologists (AS and NWK), scored the responses as “complete,” “incomplete,” or “inconsistent” according to a custom rubric (); disagreements were resolved by consensus. For all generated responses, we calculated 5 readability scores, using Readability Studio Professional (version 2019.3; Oleander Software), and 2 measures of syntactic complexity, using the L2 Syntactic Complexity Analyzer (version 3.3.3), as described previously [].

Differences from baseline completeness were assessed using the Fisher exact test, and 2-sample readability and syntactic complexity comparisons were done using the Wilcoxon rank-sum test. Statistical significance was set as P<.05.


Results

Baseline responses to 80% (20/25) of the questions were scored as “complete” (). Completeness was significantly lower for both the zero-shot (8/25, 32%; P=.001) and one-shot (8/25, 32%; P=.001) simplification prompts but significantly higher for the rubric prompts (25/25, 100%; P=.001). All 3 prompts significantly improved readability according to every metric and lowered 1 measure of syntactic complexity ().

Table 1. Evaluation of the completeness of cardiovascular disease information generated using 4 large language model prompt strategies.QuestionConsensus grade for each prompta
BaselinePlain language (zero-shot prompt)Plain language (one-shot prompt)Plain language (rubric prompt)How can I prevent heart disease?CompleteCompleteCompleteCompleteWhat is the best diet for the heart?CompleteCompleteCompleteCompleteWhat is the best diet for high blood pressure and high cholesterol?CompleteCompleteCompleteCompleteHow much should I exercise to stay healthy?CompleteInconsistentIncompleteCompleteShould I do cardio or lift weights to prevent heart disease?CompleteInconsistentInconsistentCompleteHow can I lose weight?CompleteInconsistentInconsistentCompleteHow can I decrease LDLb?InconsistentIncompleteIncompleteCompleteHow can I decrease triglycerides?CompleteCompleteCompleteCompleteWhat is lipoprotein(a)?CompleteIncompleteIncompleteCompleteHow can I quit smoking?CompleteCompleteInconsistentCompleteWhat are the side effects of statins?CompleteInconsistentCompleteCompleteI have muscle pain with a statin. What should I do?InconsistentInconsistentCompleteCompleteMy cholesterol is still high and I’m already on a statin. What should I do?InconsistentIncompleteIncompleteCompleteWhat medications can reduce cholesterol other than statins?CompleteCompleteInconsistentCompleteWhat is ezetimibe?CompleteInconsistentIncompleteCompleteWhat are Repatha and Praluent?CompleteIncompleteIncompleteCompleteWhat is inclisiran?CompleteIncompleteIncompleteCompleteWhat are the side effects of Repatha and Praluent?CompleteCompleteInconsistentCompleteShould I take aspirin to prevent heart disease?CompleteCompleteCompleteCompleteMy cholesterol panel shows triglycerides 400 mg/dL. How should I interpret this?CompleteInconsistentCompleteCompleteMy LDL is 200 mg/dL. How should I interpret this?InconsistentIncompleteIncompleteCompleteWhat does a coronary calcium score of 0 mean?CompleteIncompleteIncompleteCompleteWhat does a coronary calcium score of 100 mean?InconsistentInconsistentIncompleteCompleteWhat does a coronary calcium score of 400 mean?CompleteIncompleteIncompleteCompleteWhat genetic mutations can cause high cholesterol?CompleteInconsistentIncompleteComplete

aFor every prompt strategy, we generated 3 responses to each of the 25 questions about cardiovascular disease prevention. “Complete” indicates that all 3 responses received a full score according to our coverage rubric, “Incomplete” indicates that all 3 responses received less than a full score, and “Inconsistent” indicates that some responses were “Complete” and others were “Incomplete.” Grades shown were determined by consensus between 2 reviewers.

bLDL: low-density lipoprotein.

Table 2. Comparison of the readability and syntactic complexity of cardiovascular disease information generated using 4 large language model prompt strategies.a
Prompts
Baseline, median (IQR)Plain language (zero-shot prompt)Plain language (one-shot prompt)Plain language (rubric prompt)

Value, median (IQR)Difference from baselineb, median (IQR; P value)Value, median (IQR)Difference from baselinec, median (IQR; P value)Value, median (IQR)Difference from baselined, median (IQR; P value)Readability formulas
FKGLe13.4 (12.3 to 15.4)9.7 (7.6 to 11.1)−4.2 (−5.7 to −3.1; <.001)3.8 (2.9 to 5.3)−9.4 (−11.1 to −8.3; <.001)8.0 (7.3 to 9.5)−5.3 (−6.6 to −4.0; <.001)
SMOGf14.8 (13.7 to 16.5)12.1 (10.2 to 13)−3.6 (−4.5 to −2.4; <.001)7.9 (7.2 to 9.2)−7.1 (−8.2 to −5.7; <.001)10.9 (10.4 to 11.9)−4.1 (−5.4 to −3.0; <.001)
GFIg14.0 (12.1 to 17)11.3 (8.0 to 13)−4.0 (−5.6 to −2.7; <.001)6.3 (5.4 to 7.6)−7.5 (−10.3 to −6.0; <.001)10.2 (8.9 to 11.3)−3.9 (−6.3 to −2.8; <.001)
FORCASTh11.5 (11.2 to 11.9)10.2 (9.8 to 10.7)−1.3 (−1.8 to −0.9; <.001)8.8 (8.2 to 9.4)−2.7 (−3.4 to −2.3; <.001)9.7 (9.3 to 10.2)−1.9 (−2.3 to −1.4; <.001)
CLIi13.8 (13.2 to 15.1)10.4 (9.0 to 11.8)−3.7 (−4.7 to −2.4; <.001)6.2 (5.1 to 7.3)−7.9 (−9.0 to −6.5; <.001)9.4 (9.0 to 10.4)−4.5 (−5.4 to −3.5; <.001)Syntactic complexityj
MLCk15.0 (12.7 to 16.6)12.3 (10.5 to 15.5)−1.8 (−4.4 to 0.9; .01)8.7 (7.8 to 10.7)−5.7 (−7.6 to −3.4; <.001)9.6 (8.9 to 10.3)−4.2 (−6.9 to −3.1; <.001)
DC/Tl0.3 (0.2 to 0.5)0.3 (0.2 to 0.5)0 (−0.2 to 0.1; .36)0.2 (0.1 to 0.3)−0.2 (−0.3 to −0.1; <.001)0.6 (0.4 to 0.7)0.2 (0.1 to 0.4; >.99)

aFor every prompt strategy, we generated 3 responses to each of the 25 questions about cardiovascular disease prevention. Lower scores indicate higher readability.

bDifference between responses to the baseline prompts and prompts for plain language. P values are from a 1-tailed Wilcoxon signed rank test.

cDifference between responses to the baseline prompts and prompts for plain language with an example. P values are from a 1-tailed Wilcoxon signed rank test.

dDifference between responses to the baseline prompts and prompts for plain language with coverage. P values are from a 1-tailed Wilcoxon signed rank test.

eFKGL: Flesch-Kincaid Grade Level.

fSMOG: Simple Measure of Gobbledygook.

gGFI: Gunning Fog Index.

hFORCAST: Ford, Caylor, Sticht formula.

iCLI: Coleman-Liau Index.

jMLC is a measure of elaboration at the clause level (ie, number of words per clause), and DC/T is a measure of subordination.

kMLC: mean length of clause.

lDC/T: dependent clauses/T-unit.


Discussion

We found that zero- and one-shot prompting of GPT-4 to produce simplified information about CVD generated more readable but less comprehensive responses. This loss of information, however, could be averted by combining a zero-shot simplification prompt with a short reminder to include critical information (rubric prompting). Our findings highlight the importance of optimizing prompts and incorporating expert clinical judgment when considering the use of LLMs to produce patient education materials, including AI-drafted replies to patient messages [,,]. Accordingly, prospective guidelines for the use of AI in medicine should address best practices for prompt engineering, standardized evaluation of model outputs, and outreach to clinicians and the public to cultivate relevant skills []. Such guidelines will provide important parameters for clinician-in-the-loop information simplification systems [,,], which have already been deployed to improve the accessibility of surgical consent forms [].

The limitations of this study include the evaluation of a single model at a specific point in time and the absence of reading comprehension data from patients. Since the prompt strategies developed herein are not model specific, it should be straightforward to extend these strategies to other LLMs. Future research should further evaluate trade-offs between prompt engineering and fine-tuning of LLMs for medical applications using multiple models. It would also be useful to integrate ongoing user testing with structured health literacy assessment of generated responses to identify types of simplification that are especially important for improving patient understanding.

Acknowledgments

We thank Stephen Blackwelder, PhD (Duke University Health System), for helpful discussions and comments on the manuscript and Vasudha Mishra, MBBS (AIIMS Patna), for assistance with data collection. JPD was supported by a Harvard Data Science Fellowship and the Institute of Collaborative Innovation at the University of Macau. The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Authors' Contributions

VM, AS, and JPD designed the study. VM and JPD generated the ChatGPT responses and performed the computational and statistical analyses. AS and NWK performed the completeness scoring. VM and JPD wrote the manuscript. All authors edited and reviewed the manuscript.

Conflicts of Interest

None declared.


AbbreviationsAI: artificial intelligenceCVD: cardiovascular diseaseLLM: large language model

Edited by T de Azevedo Cardoso; submitted 11.12.23; peer-reviewed by R Mpofu; comments to author 12.01.24; revised version received 25.01.24; accepted 31.01.24; published 22.04.24.

Copyright

©Vishala Mishra, Ashish Sarraju, Neil M Kalwani, Joseph P Dexter. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 22.04.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.


留言 (0)

沒有登入
gif