Optimal strategies for adapting open-source large language models for clinical information extraction: a benchmarking study in the context of ulcerative colitis research

Abstract

Background: Closed-source large language models (LLMs) like GPT-4o have shown promise for clinical information extraction but are potentially limited by cost, data security concerns, and inflexibility. Open-source models have emerged as an attractive alternative, with many LLM adaptation strategies developed in the literature. However, it is currently unclear what adaptation strategies are optimal, and how they ultimately compare to closed-source models. Methods: We studied the effects of three common LLM adaptation strategies: chain-of-thought prompting, few-shot prompting, and fine-tuning. Our target for information extraction was the Mayo Endoscopic Subscore (MES). We applied those strategies in all combinations to six open-source models (8-70 billion parameters) using an annotated set of colonoscopy procedure reports from two centers: the University of California, San Francisco (N=608) and San Francisco General Hospital (N=217). We analyzed the relationship of these strategies to several performance metrics with a mixed-effects model, accounting for the variability between centers and LLMs. GPT-4o was not subject to QLoRA due to its closed-source nature but was used as a comparator in our benchmarks. We also provide in-depth commentary on the cost-effectiveness of these open-source LLMs and GPT-4o for MES extraction. Results: Across adaptation strategies, QLoRA statistically (p<0.001) improves the performance of open-source LLMs by 8.3-15.6 percentage points across accuracy, precision and recall. However, GPT-4o with prompt engineering is superior to the best open-source model by a margin of 2.5-5.4%. Yet, a simple cost-effectiveness analysis suggests that GPT-4o is expensive compared to open-source models. Conclusion: GPT-4o is currently the most performant LLM for MES extraction . If unavailable, open-source models optimized with QLoRA are a competitive alternative. However, our results also suggest that current instruction-following LLMs including GPT-4o do not fully follow user-provided instructions, leaving room for improvement. More work is needed to achieve consistent, near-perfect performance in clinical information extraction by LLMs.

Competing Interest Statement

VAR receives research support from Alnylam, Takeda, Merck, Genentech, Blueprint Medicines, Stryker, Mitsubishi Tanabe, and Janssen. He also is a shareholder of ZebraMD. RPY, ADS, and SW have nothing to disclose.

Funding Statement

Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under Award Number K99LM014099, the National Center for Advancing Translational Sciences, National Institutes of Health, through UCSF-CTSI Grant Number UL1 TR001872, as well as the UCLA Clinical and Translational Science Institute through grant number UL1TR001881. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. This research project has benefitted from the Microsoft Accelerate Foundation Models Research (AFMR) grant program through which leading foundation models hosted by Microsoft Azure along with access to Azure credits were provided to conduct the research.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Ethics committee/IRB of UCSF waived ethical approval for this work.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data produced in the present work are contained in the Supplemental Appendix. Data used to generate results, patient clinical data, will not be available to others.

View original article

Medrxiv - Health Informatics

Like

分享书签

0 0 0 0 0 0 0

More from this channel

Optimal strategies for adapting open-source large language models for clinical information extraction: a benchmarking study in the context of ulcerative colitis research

留言 (0)