Machine learning framework for predicting the presence of high-risk clonal haematopoiesis using complete blood count data: a population-based study of 431,531 UK Biobank participants

Abstract

Background Clonal haematopoiesis (CH), the disproportionate expansion of a haematopoietic stem cell and its progeny, driven by somatic DNA mutations, is a common age-related phenomenon that engenders an increased risk of developing myeloid neoplasms (MN). At present, CH is identified by targeted sequencing of peripheral blood DNA, which is impractical to apply at population scale. The complete blood count (CBC) is an inexpensive, widely used clinical test. Here, we explore whether machine learning (ML) approaches applied to CBC data could predict individuals likely to harbour CH and prioritise them for DNA sequencing. Methods The UK Biobank was filtered to identify 431,531 participants with paired CBC and whole exome sequencing (WES). Somatic mutations were previously identified from blood WES using Mutect2 to classify individuals with CH driver mutations. Using 18 CBC indices/features and basic demographics (age and sex), we trained a range of tree-based ML classifiers to infer as binary output, the presence/ absence of CH. Findings Using Random Forest (RF) classifiers, we predicted the presence/absence of CH driven by mutations in one of five genes known to confer a high-risk of incident MN (JAK2, CALR, SF3B1, SRSF2 and U2AF1). We subsequently developed a unified, optimised RF classifier for high-risk CH driven by any of these genes and assessed its performance (median AUC 0.85). However, the low prevalence of high-risk CH implies that our model cannot be generalised to population scale without compromising its sensitivity (20.1% using stringent cutoff probability score). Interpretation We showcase a proof-of-concept that the presence of high-risk CH can be inferred from CBC perturbations using RF classifiers. The future integration of raw blood cell analyser data can help improve the performance of our model and facilitate its application at scale.

Competing Interest Statement

G.S.V. is a consultant to STRM.BIO and holds a research grant from AstraZeneca for research unrelated to that presented here. S.W. is an employee of AstraZeneca. M.A.F. is an employee and stockholder of AstraZeneca. The other authors declare no competing interests.

Funding Statement

WGD is funded by a Clinical Research Fellowship from the Cancer Research UK Cambridge Centre (CTRQQR-2021\100012). GSV is supported by a Cancer Research UK Senior Cancer Fellowship (C22324/A23015), and work in his laboratory is also funded by the Leukemia Lymphoma Society, Blood Cancer UK, European Research Council, Cancer Research UK, Kay Kendall Leukemia Fund, AstraZeneca and Wellcome Trust.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study utilised data available in the United Kingdom (UK) Biobank accessed under the approved application numbers 56844 and 69328. UK Biobank has approval from the North West Multi-centre Research Ethics Committee (MREC) as a Research Tissue Bank (RTB) approval. This approval means that researchers do not require separate ethical clearance and can operate under the RTB approval.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data used in this study are publicly available from the UK Biobank (https://www.ukbiobank.ac.uk/). Researchers may apply for access to the UK Biobank data via the Access Management System (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access).

留言 (0)

沒有登入
gif