Proof-of-concept study of a small language model chatbot for breast cancer decision support – a transparent, source-controlled, explainable and data-secure approach

Main findings

Over the last two decades, significant improvements in diagnostic and therapeutic methods have enhanced treatment outcomes and survival rates in clinical oncology, culminating in a growing body of meaningful scientific evidence (Taylor et al. 2023; Subbiah 2023). Partially automated data processing and preparation with the help of AI is seen as a decisive accelerator in breaking up this flood of scientific knowledge for clinical practitioners (Meskó and Görög 2020; Johnson et al. 2021). Due to their strength in textual processing, Large Language Models (LLMs) are discussed as a meaningful technological solution in addressing this issue (Benary et al. 2023; Sorin et al. 2024). Current barriers to the deployment of LLMs in clinical settings include inadequate control over the sources used for decision-making, missing transparency of the decision-making process and concerns about the security of health data being processed via decentralized international servers (Sorin et al. 2024). Due to their adaptability and the possibility of local server hosting, Small Language Models (SLMs) are gaining attention for their potential in addressing these issues (Schick and Schütze 2020; Dhunoo 2024).

This study is, to our knowledge, the first to adapt an open-source SLM to a clinical oncology guideline. It proves the concept of aligning a SLM with the national evidence consensus for an oncological entity, achieving reliable initial clinical accuracy for breast cancer by providing binary treatment decisions that are consistent with a conventional tumor board’s expert recommendations. Additionally, it achieves concordance levels comparable to publicly available LLMs such as ChatGPT4 and GPT3.5. This finding underscores the technical functionality of the SLM design concept and suggests that SLMs could offer a secure solution for health data processing by operating on local servers or computers. This helps to overcome the previously stated barriers to the deployment of LLMs in clinical care. In line with the Explainable AI (XAI) approach, the decision-making process’s traceability can be enhanced by restricting the model’s decision pathways and narrowing the AI system’s scope (Kundu 2021). Consequently, the developed BC-SLM remains transparent in its decision-making by disclosing the breast cancer guideline sources it references and how these inform its treatment recommendations. This adaptation of an open-source SLM offers a transparent, source-controlled, explainable, and data-secure approach for using language models in clinical oncology, enabling the processing of patient-specific health data in alignment with established national and international diagnostic and treatment standards.

Further findingsExpanding the potential of LLMs to SLMs in breast cancer care

In the previous course of exploration of the practical use of LLMs in breast cancer care, Rao et al. showcased the successful employment of GPT3.5 for radiology imaging evaluations, confirming its value in breast cancer care with regard to mammography analysis (Rao et al. 2023). Haver et al. showcased the capability of a chatbot to educate patients on breast cancer prevention and screenings measures (Haver et al. 2023). Additionally, Choi et al. exhibited the potential of custom prompts for LLMs in retrieving clinical insights from extensive breast cancer patient records, encompassing multimodal data from pathology and ultrasound reports (Choi et al. 2023). In the context of decision-support, Lukac et al. and Sorin et al. have conducted explorative studies to compare the quality of decision-making between GPT3.5 and tumor boards (Lukac et al. 2023; Sorin et al. 2023). Sorin et al.‘s recent review article synthesizes the current literature on the utilization of LLMs in breast cancer management (Sorin et al. 2024). The overview identifies the most promising application areas in breast cancer care in the processing of textual data and disease-related question-answering. However, they conclude that the current level of evidence regarding the deployment of LLMs in breast cancer management remains in an early-stage phase of feasibility exploration, highlighting a critical need for future rigorous clinical validation and continuous monitoring going forward (Sorin et al. 2024). This study ties into these findings by underscoring the potential of language models for textual processing and decision support and expanding these findings from Large to Small Language Models in the field of breast cancer care.

Nevertheless, it is crucial to note in the context of interpretation that this proof-of-concept still represents an early step in the further development of language models in the medical domain. As stated, the findings of this study demonstrate that the adaptation of a SLM may help to overcome prevailing issues with use of LLMs. Nonetheless, it is a proof-of-concept study, which entails significant limitations in the clinical interpretability of the results. An exploratory proof-of-concept study for a health technology tool aims to assess the initial viability, functionality, and potential impact of the tool in a controlled, oftentimes preclinical, setting, providing preliminary insights into its functionality. In contrast, an early feasibility study focuses on evaluating the tool’s safety, usability, and basic efficacy in a real-world context, while a clinical validation study rigorously tests its effectiveness and reliability in a larger, more diverse patient population to establish its clinical value. In the following, we address these limitations of the current state of knowledge and outline how corresponding studies can gradually increase the evidence level in the use of language models in breast cancer care and clinical oncology.

Limitations: iterative technological modification towards clinical validation

This study serves as a preclinical proof-of-concept, evaluating a newly developed technological model within a preclinical simulation environment, focusing on its initial clinical accuracy and technical functionality. It is crucial to emphasize that this study does not offer clinical validation for either the performance of LLMs or SLMs. While the results may indicate potential patterns of concordance across different cancer subtypes or stages (e.g., DCIS, TNBC, etc.) and varying levels of agreement for specific treatment modalities (e.g., the relatively low concordance and non-significant Cohen’s kappa for GT), these observations merely suggest possible differences. However, due to the exploratory nature of this proof-of-concept, such patterns are beyond the scope of this evaluation and warrant investigation in future studies. It is known that language models continue to have crucial problems with reliability and reproducibility (Sorin et al. 2024). Thus, it should not be directly inferred from the results that one model is better than the other or performs better or worse for different treatment modalities and different tumor stages or subtypes. As described by Sorin et al. in the recent literature review, the exploration of language models in breast cancer care is in an early stage of development but requires ongoing supervision and monitoring as the practical application of language models in clinical oncology is evolving (Sorin et al. 2024). Following technological refinement, future feasibility and clinical validation studies should include study designs that incorporate larger-scale study populations and more diverse settings to allow for comprehensive validation. Additionally, preclinical studies should include simulation settings with various users assessing user-specific aspects and hybrid decision-making. Important limitations are explained in more detail below in order to avoid misinterpreting the results by drawing conclusions that are outside the scope of this study.

Firstly, the study uses a small number of patient profiles for testing. This process was chosen to comprehensively cover the spectrum of patho- and immunomorphological types of breast carcinoma in accordance to Sorin et al.´s recommendations and findings of previous studies (Sorin et al. 2024). Nevertheless, neither does this approach allow for a conclusive comparison of treatment modalities or cancer subtypes (i.e., DCIS versus invasive, Her2 positive versus Her2 negative, Luminal A versus Luminal B, TNBC versus HR + or early-stage versus metastatic carcinoma) nor one should expect the cases to produce, for instance, an age distribution that aligns with the epidemiological or demographic data of a specific population, e.g. on national level for a specific country. This consideration is crucial for the subsequent feasibility studies and the comprehensive clinical validation of the technology. A crucial step in further developing the system will be to test it with a more diverse or nationally representative study cohort, encompassing hundreds to thousands of patient profiles. This will provide more robust evidence on identifying particularly useful applications of language models in clinical oncology, specifically determining whether language models offer significant performance benefits for certain treatment modalities or specific cancer subtypes and stages. Secondly, the study establishes the recommendations of a singular multidisciplinary tumor board as the gold standard. Several international research groups, i.e. the EURECCA and EUSOMA networks, have carried out extensive observational studies, uncovering significant differences in the treatment choices and outcomes for breast cancer between certified centers (Derks et al. 2018; van Walle et al. 2023). There is a significant scope for decision-making in breast cancer treatment and, therefore, future studies should incorporate a larger group of national and international centers to enable a more balanced basis for comparison (Derks et al. 2018). Thirdly, the study is based on the German breast cancer guideline and was carried out in a German gynecological center. Nevertheless, there is significant variability in national standards and guidelines for breast cancer care decision-making. The results should therefore be interpreted on the basis of German standards, although the intuitive interpretation may vary depending on the international background of the reader.

Research perspective: feasibility of guideline navigation and the perspective on SLM-powered oncological decision support

Facing the growing body of meaningful evidence in breast cancer care, clinical practitioners are confronted with increasingly lengthy and complex guidelines that they can use to guide their clinical decision-making in order to bring treatments in line with the current state of scientific knowledge (Porter et al. 2023). To improve accessibility, guideline organizations and medical societies are investing considerable financial and personnel resources in synthesizing this extensive research into guidelines (Boca et al. 2018). Regarding German gynecological oncology, this is traced back to extensive evidence syntheses, i.e., for breast (467 pages) or endometrial cancer (354 pages) (Leitlinienprogramm Onkologie 2021; Leitlinienprogramm Onkologie 2023). Beyond that, further oncological specialties offer even more complex evidence synthesis, e.g., for lung (592 pages) and prostate cancer (473 pages) (Leitlinienprogramm Onkologie 2024a, b). These guidelines, which incorporate references to up to thousands of primary publications in their metadata, e.g., over 1600 primary publications for the lung cancer guideline, need to be updated on a regular basis to reflect the rapid advancement of medical knowledge.

The application of SLMs may offer a prospective solution to bridge the gap between cutting-edge oncological evidence and clinical practice. The study showcases how the localized, guideline-based chatbot provides an interactive platform that exceeds a simplified keyword research and that responds to specific queries, thereby facilitating quick navigation to pertinent sections within the extensive 467-page German breast cancer guideline. The future adaptation of guideline based SLMs may provide an affordable and feasible solution that can help lower the information asymmetry between state-of-the-art oncological research and clinical oncology by efficient guideline navigation. In qualitative assessment, the BC-SLM strictly conforms to the breast cancer treatment recommendations of the DGGG guideline while all data processing occurs on the local computer in the hospital. This can enable a transparent and explainable decision-making process in alignment with the AIX approach. Users can understand the decision-making process by consulting the specified guideline sections or by engaging with the chatbot. A necessity to build trust between the medical user and the AI (Kundu 2021). Based on the simplified architecture of the SLM, the clinical outputs become more transparent and interpretable. The possibility to focus the SLM on preselected evidence and high-quality scientific data allows for the adaption of the model to a personalized and disease-specific patient pathway. A future area of exploration might be the dynamic coupling of the BC-SLM to existing machine-readable guideline corpora. For example, by deploying an application programming interface to national or international guideline apps, e.g., “Oncology Guidelines App” of the oncology guideline program (Leitlinienprogramm Onkologie) of the German Cancer Society (Deutsche Krebsgesellschaft, DKG) (Borchert et al. 2022), this could allow for access to the most current evidence synthesis and the underlying metadata with its primary literature. In perspective, this may also provide a valuable foundation to steer device modification to explore more reliable oncological decision support. Based on the findings of the study, a future area of exploration might be the integration of predefined treatment algorithms, knowledge graphs and doctoral decision trees of the breast cancer patient pathway into the newly developed SLM design concept to minimize the prevailing challenge of language model hallucination and optimize decision reliability and accuracy (Ji et al. 2022; Benary et al. 2023; Sorin et al. 2024). Another area of exploration for SLM-powered decision support is its integration into preexisting care processes and information technology infrastructures. To enhance patient-centricity, dynamic coupling of a tailored SLM with digital health or telemonitoring applications could enable the incorporation of more personalized, multimodal real-world data (e.g., continuous vital parameters, patient-reported outcomes on psychosocial factors, environmental data) into its decision-making process. Additionally, SLMs could be integrated with existing data infrastructures within hospital information systems, such as electronic health records, histopathology, laboratory results and imaging data. Enabling future models to integrate a patient’s comprehensive history by multimodal data integration would allow these models to consider more patient-specific criteria, thereby providing more personalized decision support. At the same time, this integration would provide more efficient support to clinicians by reducing the need for manual data entry and automating data processing within clinical data infrastructures.

留言 (0)

沒有登入
gif