Utilizing large language models in breast cancer management: systematic review

We reviewed the literature on LLMs applications related to breast cancer management and care. Applications described included information extraction from clinical texts, question-answering for patients and physicians, manuscript drafting and clinical management recommendations.

A disparity in performance was seen. The models showed proficiency in information extraction and responding to structured questions, with accuracy rates between 88 and 98%. However, their effectiveness diminished down to 50–70% in making clinical decisions, underscoring a gap in their application. In breast cancer care, attention to detail is crucial. LLMs excel at processing medical information quickly. However, currently, they may be less adept at navigating complex treatment decisions. Breast cancer cases vary greatly, each case distinguished by a unique molecular profile, clinical staging, and patient-specific requirements. It is vital for LLMs to adapt to the individual patient. While these models can assist physicians in routine tasks, they require further development for personalized treatment planning.

Interestingly, half of the studies included real patients’ data as opposed to publicly available data or fictitious data. For the overall published literature on LLMs in healthcare, there are more publications evaluating performance on public data. This includes performance on board examinations and question-answering based on guidelines (Sallam 2023). These analyses may introduce contamination of data, since LLMs were trained on vast data from the internet. For commercial models such as ChatGPT, the type of training data is not disclosed. Furthermore, these applications do not necessarily reflect on the performance of these models in real-world clinical settings.

While some claim that LLMs may eventually replace healthcare personnel, currently, there are major limitations and ethical concerns that strongly suggest otherwise (Lee et al. 2023). Using such models to augment physicians’ performance is more practical, albeit also constrained by ethical issues (Shah et al. 2023). LLMs enable automating different tasks that traditionally required human effort. The ability to analyze, extract and generate meaningful textual information could potentially decrease some physicians’ workload and human errors.

Reliance on LLMs and potential integration in medicine should be made with caution. The limitations discussed in the studies further underscore this note. These models can generate false information (termed “hallucination”) which can be seamlessly and confidently integrated into real information (Sorin et al. 2020a, b). They can also perpetuate disparities in healthcare (Sorin et al. 2021; Kotek et al. 2023). The inherent inability to trace the exact decision-making process of these algorithms is a major challenge for trust and clinical integration (Sorin et al. 2023a, b, c). LLMs can also be vulnerable to cyber-attacks (Sorin et al. 2023a, b, c).

Furthermore, this study highlights the absence of uniform assessment methods for LLMs in healthcare, underlining the need of establishing methodological standards for evaluating LLMs. The goal is to enhance the comparability and quality of research. The establishment of such standards is critical for the safe and effective integration of LLMs into healthcare, especially for complex conditions like breast cancer, where personalized patient care is essential.

This review has several limitations. First, due to the heterogeneity of tasks evaluated in the studies, we could not perform a meta-analysis. Second, all included studies assessed ChatGPT-3.5, and only one study evaluated GPT-4. There were no publications identified on other available LLMs. Finally, generative AI is currently a rapidly expanding topic. Thus, there may be manuscripts and applications published after our review was performed. LLMs are continually being refined, and so is their performance.

To conclude, LLMs hold potential for breast cancer management, especially in text analysis and guideline-driven question-answering. Yet, their inconsistent accuracy warrants cautious use, following thorough validation and ongoing supervision.

留言 (0)

沒有登入
gif