Genes & Health is a long-term, community-based study of British Pakistani and British Bangladeshi individuals aged 16 years and older living in the UK19. At recruitment, participants provide a saliva sample for genotyping, complete a short questionnaire on basic demographic information and consent to linkage for primary care, secondary care and national NHS EHRs. Since recruitment began in 2015, over 60,000 participants have been recruited, with linked genetic and EHR information available for 44,396 as of July 2023 (number of T2D cases = 9,771) and 50,556 as of February 2024. A participant flow diagram showing individuals included in analyses is shown in Fig. 1.
Ethical approvalWe conducted this research under an approved application to the Genes & Health Executive. The Genes & Health study is approved by the London South East NRES Committee of the Health Research Authority (14/LO/1240).
Inclusion and exclusion criteriaWe used no specific inclusion criteria. We excluded individuals with clinical codes consistent with type 1 diabetes, maturity-onset diabetes of the young (MODY) or causes of secondary diabetes, such as cystic fibrosis and pancreatectomy.
Genetic data processing and curationGenotyping was performed on Illumina Infinium Global Screening Array v3 with additional multi-disease variants. Quality control was performed following a standardized approach33. In brief, variants with call rates less than 0.99 and/or minor allele frequency (MAF) < 1% were excluded. We excluded individuals unlikely to have genetically inferred Pakistani or Bangladeshi ancestry. Imputation was performed using the TOPMed-r2 panel. We excluded SNPs with low imputation scores (INFO < 0.3) or MAF < 0.1%.
Sex determinationWe defined sex on the basis of XX (female) and XY (male) chromosomal presence in genotype data.
EHR data processing and curationWe curated routine UK NHS EHR data from primary care (Systematized Nomenclature of Medicine (SNOMED) coded) and secondary care (International Classification of Diseases, Tenth Revision (ICD-10) coded) sources. Data were combined without mapping between coding formats. For each clinical code, we took the earliest ever measure recorded in a participant’s medical records, excluding erroneous code dates preceding the participant’s recorded date of birth.
ExposurespPS construction and ancestry correctionWe used PLINK to calculate pPSs for 12 diabetes-associated genetically determined endotypes described by Smith et al.16, derived from high-throughput genetic clustering techniques in European (78%), African (19%) and East Asian (2.1%) ancestry individuals, using only SNPs above the authors’ specified inclusion threshold (cluster weight > 0.78), weighted by their cluster weights16. These comprise three endotypes related to glucose sensing, insulin secretion and insulin production (Beta Cell 1, Beta Cell 2 and Proinsulin, respectively); three clusters related to insulin resistance and unfavorable adiposity (Obesity, Lipodystrophy 1 and Lipodystrophy 2); and six clusters with unclear effects on insulin resistance and deficiency (Liver/Lipid, Alkaline Phosphatase (ALP) Negative, Hyper Insulin Secretion, Cholesterol, Sex Hormone-Binding Globulin Lipoprotein A (SHBG/LpA) and Bilirubin).
Regressing the effect of genetic PCs out of pPSs to allow direct comparison between British Pakistani and Bangladeshi individualsPC analysis of genetic data shows distinct population structure for people of Bangladeshi and Pakistani ancestries35, and we observed differences in pPS distribution between these groups (Extended Data Fig. 1). Therefore, to maximize power and facilitate combined analyses of all individuals, we regressed out the effect of the first 10 genetic PCs from each pPS, using an approach described by Liu et al.35. In brief, we constructed residual PRSs and pPSs after regressing the first 10 genetic PCs out of each PRS and pPS separately in non-diabetic controls, after which no statistically significant differences between Pakistani and Bangladeshi distributions were observed (Extended Data Fig. 1). These residual PRSs and pPSs were used in all downstream analyses, except those exploring ancestry-specific pPS distributions (in which case the term ‘unmodified pPS’ is used). Although genetic PCs were subsequently included in sensitivity analyses downstream, these (as would be expected) had no statistically significant effect in any analysis employing residual PRSs or pPSs and were, therefore, not included or presented in principal analyses in this paper. When comparing distributions, we varied the applied test between ANOVA and Kruskal–Wallis depending on distribution normality.
T2D polygenic scoreWe additionally calculated scores for a global T2D PRS using a previously published score comprising 1,091,608 variants derived in European ancestry individuals36. We selected this PRS by calculating scores for all T2D PRSs published on the PGS Catalog37 and comparing performance, assessed as area under the receiver operating characteristic curve (estimated using the R package pROC) and beta, both estimated from multivariable logistic regression models describing score associations with incident T2D, adjusted for age, sex, ancestry and the first 10 genetic PCs. Score performance is summarized in Supplementary Table 12; the best-performing scores in Genes & Health were similar to those in European ancestry populations38. We corrected this score for genetically determined ancestry using the same process as that described above for the pPS.
pPS ‘extremes’We defined pPS ‘extremes’ as scores in the top or bottom 10% of each residualized pPS distribution and ‘combined extremes’ as individuals with scores in the top or bottom 10% of multiple pPS distributions.
UK Biobank—cross-ancestry differences in genetic burden and viability for replicationWe used the UK Biobank39 to compare the distribution of pPSs between individuals with T2D of European and South Asian ancestry; T2D was defined in line with established clinical codelists40. Differences in distribution of pPSs across ancestry groups were assessed using t-test for normally distributed pPSs and Wilcoxon signed-rank testing for all other pPSs. Data from individuals of European ancestry in the UK Biobank were included in the T2D GWAS, which defined genetic variants that were partitioned as part of the pPS discovery study16, and in a subset of phenotype GWAS used to define these pPSs. However, South Asian ancestry individuals were not included, likely due to small sample size. We provide a population flow chart showing the numbers of T2D cases split by ancestry and number of recorded complications to determine suitability for replication analyses (Fig. 1). Analyses in the UK Biobank were conducted under application IDs 44448 and 153692.
MedicationsDiabetes-controlling medication classes were defined according to method of action: insulin secretagogues (sulfonylureas and meglitinide), incretin mimetics (GLP1 receptor agonists and DDP4 inhibitors), insulin sensitizers (pioglitazone and thiazolidinediones) and renal tubular glucose reabsorption modifiers (SGLT2 inhibitors) in addition to metformin and insulin. Initiation of medication was defined as the first instance of medication prescription in the EHR. For treatment response analyses (described in further detail in the ‘Outcomes’ subsection below), concurrently prescribed medications were defined as medications from another class prescribed within a conservative window of 6 months before or after initiation of the medication being treated as the exposure.
OutcomesDiabetes phenotypes and complicationsClinical phenotypes were defined on the basis of diagnostic codes present in the EHR. We used clinically curated ICD-10 and SNOMED codelists adapted from the AI-MULTIPLY resource41 using reproducible, consensus-derived methods to define all diabetes phenotypes and complications (Supplementary Table 13). Where appropriate, our curated codelists align to structured and incentivized clinical coding processes used in the UK NHS. Diabetes phenotypes included T2D, GDM and incident T2D after GDM. Diabetes-related complications were defined as microvascular (nephropathy, neuropathy and retinopathy) and macrovascular (coronary artery disease, cerebrovascular disease and peripheral vascular disease).
Diagnostic codes with unrealistic timestamps were removed (before or on date of birth or after the date of last data extraction). The earliest code date across primary and secondary care records was defined as the condition diagnosis date. Sex-specific codes applied to the wrong sex (for example, males with a diagnostic code of GDM) were removed.
T2D was defined as a clinical code of T2D in the EHR (Supplementary Table 13), entered after age 18 years, in the absence of excluder conditions (type 1 diabetes, MODY, cystic fibrosis, pancreatectomy, Cushing’s syndrome, Cushing’s disease and all documented cases of secondary diabetes). GDM was defined as a clinical code of GDM in the EHR of female participants; GDM codes occurring after a documented code of T2D, or any excluder conditions, were discounted. T2D after GDM was defined as incident T2D after GDM—that is, individuals for whom the earliest clinical code for GDM preceded the earliest clinical code for T2D. Individuals with GDM and T2D were not removed from T2D-specific analyses (Fig. 1).
Diabetes-related complications were defined as microvascular (nephropathy (n = 1,470), retinopathy (n = 4,764), neuropathy (n = 462)) and macrovascular (coronary artery disease (n = 2,606), cerebrovascular disease (n = 1,233) and peripheral vascular disease (n = 297)). Individuals with pre-existing complications at the time of T2D diagnosis were excluded from survival analyses—that is, only incident complications after T2D diagnosis were analyzed. Clinical codelists for these conditions were taken from the AI_MULTIPLY resource, a codelist tool developed using consensus methodology by local clinicians, including diabetologists and primary care doctors, designed to capture reasonable definitions of complications. For some complications that may be ambiguous— such as nephropathy, which lies on a spectrum of disease defined by estimated glomerular filtration rate and albuminuria, and retinopathy—the AI-MULTIPLY codelist sought to capture codes harmonizing with the UK Quality Outcomes Framework (QOF)—that is, the incentivized and structured approach to coding of these diabetes-related complications in routine healthcare in the UK42. Although these conditions may be described differently across different healthcare systems, nations and populations, the use of a robust and well-defined clinical codelist algorithm to define conditions allows for reproducibility of results and alignment with other populations using EHRs in the UK.
Age at diagnosisDate of diagnosis for each outcome was defined as the earliest recorded date in either primary or secondary care above the age of 18 years.
Quantitative traitsQuantitative outcomes were, unless otherwise stated, defined as the measure taken closest to the date of T2D within 1 year (before and/or after) and included age, BMI, waist circumference, HbA1c, fasting and random blood glucose, low-density and high-density lipoprotein cholesterol, serum triglycerides, alkaline phosphatase (ALP) and alanine transaminase (ALT). Because diabetes-related traits may rapidly change after diagnosis and/or initiation of treatment, for each trait the value closest to the time of diagnosis was used. In addition to traits at diagnosis, we explored the number of medication classes (as defined above) that an individual was prescribed in 5 years and the change in HbA1c from time of diagnosis to 5 years (the HbA1c closest in time to 5 years from diagnosis date was taken, and only values between 4 years and 6 years after diagnosis were included in the analysis). Quantitative traits were processed as previously described, including exclusion of outliers lying 6 or more s.d. above or below the mean43.
Response to glucose-lowering treatmentFor treatment response analyses, medication data were extracted from the primary care EHR. In line with pharmacogenomic studies44, treatment response was defined as the percentage change between the most recent HbA1c in 6 months before medication initiation and the lowest HbA1c in the 1 year after initiation, as a proportion of pre-medication HbA1c (that is, percent change from before initiation).
Diabetes-controlling medication classes were defined according to method of action: insulin secretagogues (sulfonylureas and meglitinide), insulin sensitizers (pioglitazone and thiazolidinediones) and renal tubular glucose reabsorption modifiers (SGLT2 inhibitors) in addition to metformin and insulin. Time to initiation of insulin was calculated as the time lag between the earliest T2D diagnostic code in the medical record and the earliest record of insulin prescription; insulin prescriptions with an earlier date than time of diabetes diagnosis were discarded.
Statistical analysesDescriptive analysisWe calculated mean values for quantitative traits at diagnosis and 5 years after diagnosis, stratified by ancestral group, and compared these using ANOVA.
Multivariable analysisWe described the association of each pPS (the exposure) with each diabetes phenotype outcome, using multivariable logistic regression models adjusted for age, sex and ancestry, to estimate the per-s.d. increase in odds of diabetes phenotype between diabetes phenotype cases and non-diabetic controls. We estimated the association of each pPS (the exposure) with diabetes-related traits at the time of diagnosis (the outcome). To allow comparison of effects of pPS between quantitative traits at the time of diagnosis, each quantitative trait was scaled to a normal distribution, and the beta per s.d. of pPS was presented for each trait, estimated from multivariable logistic regression models adjusted for age, sex and ancestry. Multivariable linear regression was used to estimate the effect of pPS on age of T2D diagnosis, adjusted for ancestry and sex; partial R2 was calculated with the R package ‘partialR2’. We assumed a priori that associations may differ between sexes and ancestry groups; because of this, sex-stratified and ancestry-stratified analyses were also performed. For pharmacogenomic analysis of treatment response, association between each pPS (the exposure) and HbA1c change in response to medication (the outcome) was estimated from multivariable logistic regression models adjusted for age, sex and ancestry as well as concurrently prescribed anti-diabetic medication from all other classes within 6 months before or after initiation of each. We meta-analyzed treatment response analyses from the discovery and replication samples using fixed effects models with the R package ‘metafor’.
Survival analysisWe constructed survival models starting from each individual’s date of diagnosis, running until last data extraction, for two outcome categories: initiation of insulin and progression to diabetes-related complications. We explored the association of each pPS with each complication outcome using Cox proportional hazard models adjusted for age, sex and ancestry. We calculated Schoenfeld residuals for each model to check assumptions of proportionality.
Bonferroni correction for multiple testingAnalyses reported in this paper tested associations of multiple polygenic scores with multiple outcomes. Where appropriate, we present analysis-specific Bonferroni-corrected P values, calculated as P = 0.05 / (number of associations tested in analysis).
Software and statistical computingGenotype curation and pPS calculation were performed using PLINK version 2.0 (ref. 45). Statistical analyses were performed using R version 4.2.3.
ReportingWe report this study following the STREGA46 and SAGER47 guidelines.
Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
留言 (0)