PANDORA: An AI model for the automatic extraction of clinical unstructured data and clinical risk score implementation

Abstract

Introduction: Medical records and physician notes often contain valuable information not organized in tabular form and usually require extensive manual processes to extract and structure. Large Language Models (LLMs) have shown remarkable abilities to understand, reason, and retrieve information from unstructured data sources (such as plain text), presenting the opportunity to transform clinical data into accessible information for clinical or research purposes. Objective: We present PANDORA, an AI system comprising two LLMs that can extract data and use it with risk calculators and prediction models for clinical recommendations as the final output. Methods: This study evaluates the model's ability to extract clinical features from actual clinical discharge notes from the MIMIC database and synthetically generated outpatient clinical charts. We use the PUMA calculator for Chronic Obstructive Pulmonary Disease (COPD) case finding, which interacts with the model and the retrieved information to produce a score and classify patients who would benefit from further spirometry testing based on the 7 items from the PUMA scale. Results: The extraction capabilities of our model are excellent, with an accuracy of 100% when using the MIMIC database and 99% for synthetic cases. The ability to interact with the PUMA scale and assign the appropriate score was optimal, with an accuracy of 94% for both databases. The final output is the recommendation regarding the risk of a patient suffering from COPD, classified as positive according to the threshold validated for the PUMA scale of equal to or higher than 5 points. Sensitivity was 86% for MIMIC and 100% for synthetic cases. Conclusion: LLMs have been successfully used to extract information in some cases, and there are descriptions of how they can recommend an outcome based on the researcher's instructions. However, to the best of our knowledge, this is the first model which successfully extracts information based on clinical scores or questionnaires made and validated by expert humans from plain, non-tabular data and provides a recommendation mixing all these capabilities, using not only knowledge that already exists but making it available to be explored in light of the highest quality evidence in several medical fields.

Competing Interest Statement

All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: no support from any organization for the submitted work; all authors are employed at Arkangel AI; no other relationships or activities that could appear to have influenced the submitted work.

Funding Statement

This study did not receive any funding

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This only source of human data in this study are the deidentified clinical notes publicly available in the MIMIC-IV-Note dataset (available at: https://www.physionet.org/content/mimic-iv-note/2.2/)

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data produced in the present study are available upon reasonable request to the authors

View original article

Medrxiv - Health Informatics

Like

分享书签

0 0 0 0 0 0 0

More from this channel

PANDORA: An AI model for the automatic extraction of clinical unstructured data and clinical risk score implementation

留言 (0)