Generating high quality, real-world clinical and molecular datasets is challenging, costly and time intensive. Consequently, such data should be shared with the scientific community, which however carries the risk of privacy breaches. The latter limitation hinders the scientific community's ability to freely share and access high resolution and high quality data, which are essential especially in the context of personalised medicine. In this study, we present an algorithm based on Gaussian copulas to generate synthetic data that retain associations within high dimensional (peptidomics) datasets. For this purpose, 3,881 datasets from 10 cohorts were employed, containing clinical, demographic, molecular (> 21,500 peptide) variables, and outcome data for individuals with a kidney or a heart failure event. High dimensional copulas were developed to portray the distribution matrix between the clinical and peptidomics data in the dataset, and based on these distributions, a data matrix of 2,000 synthetic patients was developed. Synthetic data maintained the capacity to reproducibly correlate the peptidomics data with the clinical variables. Consequently, correlation of the rho-values of individual peptides with eGFR between the synthetic and the real-patient datasets was highly similar, both at the single peptide level (rho = 0.885, p < 2.2e-308) and after classification with machine learning models (rho synthetic = -0.394, p = 5.21e-127; rho real = -0.396, p = 4.64e-67). External validation was performed, using independent multi-centric datasets (n = 2,964) of individuals with chronic kidney disease (CKD defined as eGFR < 60 mL/min/1.73m2) or those with normal kidney function (eGFR > 90 mL/min/1.73m2). Similarly, the association of the rho-values of single peptides with eGFR between the synthetic and the external validation datasets was significantly reproduced (rho = 0.569, p = 1.8e-218). Subsequent development of classifiers by using the synthetic data matrices, resulted in highly predictive values in external real-patient datasets (AUC values of 0.803 and 0.867 for HF and CKD, respectively), demonstrating robustness of the developed method in the generation of synthetic patient data. The proposed pipeline represents a solution for high-dimensional sharing while maintaining patient confidentiality.
Competing Interest StatementHM is the cofounder and co-owner of Mosaiques Diagnostics (Hannover, Germany). MAJC, MF, AL and JS are employees of Mosaiques Diagnostics. TK is the cofounder and co-owner Atomic Intelligence (Zagreb, Croatia). SK, EA, VD, and DV are employees of Atomic Intelligence. All other authors declare no competing interests.
Funding StatementMAJC holds a doctoral grant through the DisCo-I project that has received funding from the European Union's Horizon Europe Marie Sklodowska-Curie Actions Doctoral Networks - Industrial Doctorates Programme (HORIZON-MSCA-2021-DN-ID) under grant agreement No 101072828, funded by the European Union. This study was also in part funded by UPTAKE (urinary proteome analysis for the prediction of type, extent, prognosis, and therapeutic response of acute and chronic kidney diseases), funded by the Bundesministerium fur Bildung und Forschung (BMBF; Federal Ministry of Education and Research) in the program Translational projects in personalized medicine under the grant numbers 01EK2105A, 01EK2105B, and 01EK2105C, by SIGNAL funded by BMBF (grant number 01KU2307) and by Austrian Science Fund (FWF, Project number I 6464, Grant-DOI 10.55776/I6464), by Accurate-CVD (ZIM- KK5560002AP3) funded by the by the BMWK (Federal Ministry for Economic Affairs and Climate Protection), ProSTRAT-AI (01DS23014) funded by the BMBF (Federal Ministry of Education and Research), PERMEDIK COST Action, supported by COST (European Cooperation in Science and Technology) grant no.CA21165 and DC-ren (Horizon 2020 research and innovation programme under grant agreement No 848 011) and MULTIR (HORIZON-MISS-2023-CANCER-01-01; project number: 101136926) both funded by the European Commission. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority (HORIZON-MSCA-2021-DN-ID, Horizon 2020 Research and Innovation programme and HORIZON-MISS-2023-CANCER-01-01) can be held responsible for them.
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The study was conducted according to the guidelines of the Declaration of Helsinki and all datasets were fully anonymized. The ethics committee of the Hannover Medical School Germany waived ethical approval under the reference number 3116-2016 for all studies involving re-use of data from anonymized urine samples. The PROVALID study received ethical approval from the Institutional Review Boards in each participating country, with the Medical University of Innsbruck's ethics committee providing approval under reference number EK 1188/2020. The ethics committee of the Friedrich-Alexander-University Erlangen-Nuernberg approved the nephrological biobank (ethics approval code 264_20 B) and the urinary proteomics analysis (ethics approval code 221_20 B).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data AvailabilityThe synthetic data generation package was written in Python. The script is freely available under MIT license (https://github.com/Atomic-Intelligence/Peptide-synthesis.git ). The results from the statistical analyses are all included in the Supplementary Files. The data that support the findings of this study are available from the corresponding author upon reasonable request.
https://github.com/Atomic-Intelligence/Peptide-synthesis.git
留言 (0)