Patient- and clinician-based evaluation of large language models for patient education in prostate cancer radiotherapy

This study investigated the benefits of employing LLMs for patient education in men undergoing radiotherapy for localized prostate cancer. Clinicians evaluated the quality of responses from five different LLMs to typical patient questions regarding definitive radiotherapy for prostate cancer. Moreover, the responses of ChatGPT‑4 were also evaluated by patients. Clinicians’ evaluations underscored the relevance, correctness, and completeness of most of the responses. However, specific responses were critiqued for lacking certain details or containing inaccuracies as evidenced by the incorrect statements presented in the results section. This is particularly relevant in oncology, where misinformation, even at low rates, can have severe consequences for patients. All in all, we found significant differences between the performance of the different LLMs regarding relevance and completeness, while there were no significant differences regarding correctness.

These findings align with existing data, despite methodological differences in the conducted studies. The appropriateness and accuracy of information given by ChatGPT to uro-oncological topics were generally rated as moderate to high [20,21,22,23,24]. This was also observed for queries related to radiotherapy [25,26,27]. The study of Alasker et al. compared ChatGPT‑3.5 with ChatGPT‑4 and Google Bard, the predecessor of Google Gemini, regarding their responses to prostate cancer questions and found, overall, accurate, comprehensive, and easily readable responses. Similar to our study, the Google LLM provided easier-to-read responses [28]. In another study evaluating LLM responses to prostate cancer questions there were significant differences between ChatGPT‑3.5, Microsoft Copilot, and Google Gemini [29]. However, several authors found that the quality of LLMs’ responses declined with increasing specificity and complexity of the questions [22, 23, 30]. This decline is attributable to the models being trained on general internet texts rather than on specialized medical data, resulting in a lack of specialized knowledge. Additionally, ChatGPT uses real-time information from the internet only upon specific request when generating responses, and information is otherwise based on data from up to 2023. Another criticism of LLMs is the occurrence of so-called “hallucinations” (also referred to as “fact fabrication” to avoid inappropriate anthropomorphisms) [2]. In such cases, the LLMs invent incorrect information and present it as factual truth. In our study, inaccuracies were primarily due to imprecise or insufficiently differentiated responses from the LLMs. For example, stating that marker implantation particularly increases targeting accuracy in brachytherapy. While this is true for EBRT, it is incorrect for brachytherapy. We would not classify this as a “hallucination.” Nevertheless, the response is incorrect and could cause confusion for patients.

Furthermore, because the LLMs generate responses anew each time, the answers for the same prompt are potentially not identical. However, in our queries, the differences between responses from the LLMs were only marginal and were not deemed relevant by CT and PR. Nonetheless, other studies have noted significant variability in responses between iterations [31, 32].

As far as we know, the performance of LLMs has been evaluated in previous studies exclusively by clinicians or investigators. In our study, we also examined the patients’ perspective. In summary, patients found the information to be relevant to their experiences and accurate regarding prostate radiotherapy. Most patients expressed confidence in the information received and stated that it would have helped them to feel more informed about their treatment. However, prior to their treatment, all patients were informed about the therapy using a standardized information sheet. Therefore, the information from ChatGPT‑4 was not new; rather, the different and more active engagement with the information might have led to the patients feeling better informed.

Additionally, most patients indicated a willingness to use ChatGPT for future medical inquiries. However, 26% were neutral or did not agree regarding this statement. Interestingly, 94% found ChatGPT’s responses to be easy to understand. This contrasts with our readability analysis, which revealed that the text generated by ChatGPT may be challenging for some readers. This was also demonstrated by other studies [20, 27]. A possible reason for this finding could be that we surveyed patients after their treatment, when they were already familiar with the topic and terminology. In this context it is also noteworthy that responses provided by the LLMs may also vary depending on the prompts. For example, the responses’ quality may increase when asking the LLM to also take into account results of an internet search, as is possible for ChatGPT, Gemini, and Copilot. Another possibility to enhance utility of the LLMs in this context can be the use of modified prompts. For example, LLMs can be asked to answer in a very complete or accurate way or to direct the responses to a specific audience. Hershenhouse et al. prompted ChatGPT to rephrase the answers for medical laypersons, which resulted in more readable answers regarding prostate cancer [33]. In our study, prompting ChatGPT‑4 to answer the questions in an easy-to-understand way also improved the readability of its responses.

Alongside the positive aspects of using LLMs for patient information, it is crucial to recognize their limitations, including the risk of incorrect answers already mentioned above. Ethical concerns as well as security and privacy issues are frequently raised [23, 34,35,36]. Another common criticism of LLMs is their lack of human touch and empathy [37]. However, a study comparing responses from clinicians and ChatGPT to patient questions posted in an online forum found the chatbot’s answers to be significantly more empathetic [38]. Furthermore, ChatGPT does not necessarily aim to replace the clinician, as prophesied and feared in some articles [5, 39]. Instead, it could serve as a source of supplementary information before or after a medical consultation, as was the case in the setting of our study.

Our study has some limitations that warrant consideration. Firstly, questions were formulated by the study team and not by actual patients, potentially limiting the representation of diverse clinical scenarios and increasing the risk of inaccurate responses due to patients providing incomplete or incorrect information. Secondly, queries and responses were in German, which may affect the generalizability of findings, as the performance of ChatGPT could vary across other languages. Thirdly, there are no standardized and validated criteria to assess the accuracy and reliability of AI-generated responses. Fourthly, our study initially focused on ChatGPT because it stood as the most widely used LLM in practice and had consistently demonstrated superior response quality compared to other LLMs for medical topics [16, 40, 41]. A few months later, we added the comparison with other LLMs as part of the reviewing process, so that the timing of the questions to ChatGPT‑4 does not coincide with the questions to the other LLMs. All in all, it is worth noting that a study like this can only make a statement at a specific point in time. In the dynamic development of LLMs, results can change quickly due to new versions or models.

留言 (0)

沒有登入
gif