Variation in medical practices and reporting standards across healthcare systems limits the transferability of prediction models based on structured electronic health record (EHR) data. We introduce GRASP, a novel transformer-based architecture that enhances the generalizability of EHR-based prediction by embedding medical codes into a unified semantic space using a large language model. We applied GRASP to predict the onset of 21 diseases and all-cause mortality in over one million individuals from UK Biobank (UK), FinnGen (Finland) and Mount Sinai (USA), all harmonized to OMOP common data model. Trained on the UK Biobank and evaluated in FinnGen and Mount Sinai, GRASP achieved an average ΔC-index that was 83% and 35% higher than language-unaware models, respectively. GRASP also showed significantly higher correlations with polygenic risk scores for 62% of diseases. Notably, GRASP mantained robust performance even when datasets were not harmonized to the same data model, accurately predicting disease risk from ICD-10-CM codes without direct mappings to OMOP. GRASP enables accurate and transferable disease predictions across heterogeneous healthcare systems with minimal resource requirements.
Competing Interest StatementA.G. is the founder of Real World Genetics Oy
Funding StatementFunding are listed in the manuscript
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
data collection was approved: for UKB by the North West Multi-centre Research Ethics Committee (MREC); for Mount Sinai data by an internal IRB of the Mount Sinai School of Medicine. Patients and control subjects in FinnGen provided informed consent for biobank research,based on the Finnish Biobank Act. Alternatively, separate research cohorts, collected prior to the Finnish Biobank Act came into effect (in September 2013) and the start of FinnGen (August 2017), were collected based on study-specific consents and later transferred to the Finnish Biobanks after approval by Fimea (Finnish Medicines Agency), the National Supervisory Authority for Welfare and Health. Recruitment protocols followed the biobank protocols approved by Fimea. The Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa (HUS) statement number for the FinnGen study is Nr HUS/990/2017.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data AvailabilityThe code for the project is available at https://github.com/mkirchler/grasp. The individual-level data in these studies is protected for data privacy, access is regulated through the biobanks. The Finnish biobank data can be accessed through the Fingenious services (https://site.fingenious.fi/en/) managed by FINBB. UK Biobank data are available through a procedure described at http://www.ukbiobank.ac.uk. Mount Sinai EHR data can be accessed via use agreement with researchers at Mount Sinai.
留言 (0)