Variability of polygenic prediction for body mass index in Africa

Cohorts and Biobanks used for the BMI GWAS

We assembled GWAS of BMI across diverse ancestry groups that were imputed to the 1000 Genomes Project or Haplotype Reference Consortium reference panels from the UK Biobank (UKBB), Biobank Japan (BBJ), the African Partnership for Chronic Disease Research (APCDR), Network and the Population Architecture and Genetic Epidemiology (PAGE) study. BMI was inverse rank normalized in all the studies considered for the meta-analysis. These discovery studies and the two target data sets AWIGen and EstBB are briefly described below.

Full details of BMI GWAS analyses in the UKBB have been previously reported [16]. The UKBB is a large-scale biomedical database comprised of half a million UK participants with de-identified genetic and health information. For our study, we considered 456,422 individuals of European ancestry. Imputation was performed using the 1000 Genomes Project (Phase 3) reference panel, resulting in 8,531,416 variants after excluding those with minor allele frequency (MAF) < 0.01 and missingness of > 0.1. Genetic association analysis was undertaken using FastGWAS in which linear mixed models were fitted for inverse rank normalized BMI residuals while adjusting for age, age2, and sex. The random effect for the genetic relationship was included to account for population structure and relatedness. These summary statistics are accessible through this link (https://yanglab.westlake.edu.cn/data/ukb_fastgwa/imp/pheno/21001) [16].

The Biobank Japan (BBJ) is a prospective genome biobank that recruited participants from 12 medical institutions in Japan. BBJ GWAS of BMI comprised 158,284 individuals of East Asian ancestry [6]. Imputation was conducted using East Asian populations in the 1000 Genomes Project (Phase 3) as a reference, and after quality control, there were 6,108,953 SNPs. Residuals fitted for BMI while adjusting for age, age2, and sex were transformed using the inverse rank normalization. Linear models were then fitted for the allele dosages while adjusting for the first 10PCs using mach2qtl. Summary statistics were accessed from the Japan Biobank via this link (https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST004001GCST005000/GCST004904/) [6].

The APCDR Network conducted a meta-analysis of GWAS summary statistics from the Uganda, Durban Diabetes Study (DDS), Durban Diabetes Case–Control Study (DDC), and AADM cohorts in 14,126 individuals for multiple traits, including BMI [7]. Imputation was performed using a merged reference panel of the whole-genome sequences from the African Genome Variation Project, Uganda sequences, and the 1000 Genomes Project (Phase 3). An imputation info filtering threshold of 0.3 and a minimum MAF of 0.5% were applied, resulting in 24,423,923 SNPs after quality control. Before meta-analysis, the inverse rank normalized residuals of BMI were fitted in linear mixed models while adjusting for age, age.2, and sex. The Han-Eskin random effects meta-analysis approach implemented in METASOFT (RE2) was used to aggregate summary statistics from these four cohorts. The summary statistics are accessible at (https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST009001GCST010000/GCST009057/) [7].

The PAGE study recruited individuals of diverse ancestries who reside in the USA [8]. In this study, 22,216 Hispanics/Latinos, 17,299 African Americans, 4680 Asians, 3940 Native Hawaiians, 652 Native Americans, and 1052 individuals of other ancestries, totaling 49,839 participants were enrolled. Imputation was conducted using the 1000 Genomes Project (Phase 3) reference panel, and SNPs with an imputation information score > 0.4 (39,723,562) were included in the analysis. Linear mixed models for the inverse rank normalized residuals for BMI were fitted while adjusting for 10PCs in a joint analysis of all the individuals of varied ancestries. The summary statistics were accessed from the GWAS catalog (https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST008001GCST009000/GCST008025/) [8].

The AWIGen study recruited participants from four African countries that are representative of the East, West, and South regions of Africa [17]. Imputation was performed on the cleaned dataset (with 1,729,661 SNPs and 10,903 individuals, that remained after quality control, which included the removal of closely related individuals) using the Sanger Imputation Server and the African Genome Resources as reference panel. We selected EAGLE2 for pre-phasing and the default PBWT algorithm was used for imputation. After imputation, poorly imputed SNPs with info scores less than 0.6, MAF less than 0.01, and HWE P-value less than 0.00001 were excluded. The final QC-ed imputed data had 13.98 M SNPs.

The Estonian Biobank (EstBB) is made up of volunteers resident in Estonia [18, 19]. A total of 136,421 individuals were genotyped using the Illumina Global Screening Arrays (GSAs) and we imputed the dataset to an Estonian reference created from the whole-genome sequence data of 2244 participants Individuals with BMI values 12 > BMI > 65 were removed, quality control, which included the removal of related individuals was performed resulting in 84,578 individuals remaining for analysis. For this analysis, the ESTBB target data set was randomly split into validation (N = 8456) and testing (N = 76,096) datasets and then utilized in the PRS computation.

Assessment of lifestyle factors in AWI-Gen

Lifestyle factors were captured using questionnaires in AWI-Gen [17, 19]. Physical Activity was captured using the Global Physical Activity Questionnaire (GPAQ). Smoking status was categorized as never and ever smoked. The sum of household assets was used as a proxy of socioeconomic status. Alcohol use was categorized as; never consumed, current non-problematic consumer, current problematic consumer, or former consumer [20]. Problematic drinkers were defined as those who answered two of the following responses based on the CAGE (cut-annoyed-guilty-eye) questionnaire [21]: Have you ever felt that you should cut down on your drinking? Have people annoyed you by criticizing your drinking? Have you ever felt bad or guilty about your drinking? Have you ever had an alcoholic drink first thing in the morning to steady your nerves, or get rid of a hangover? In the past year, did you ever have 6 or more alcoholic drinks in a single morning, afternoon, or night?

Multi-ancestry meta-analysis

We aggregated GWAS summary statistics from the UKBB, BBJ, APCDR Network, and PAGE study using the fixed-effects inverse-variance weighted meta-analysis implemented in METASOFT to generate our multi-ancestry meta-analysis discovery dataset. Notably, we applied double genomic control to control for population structure. The square roots of the LDSC intercepts from the UKBB and Japan Biobank were multiplied with the standard errors of the individual studies for single genomic control. In view that PAGE is a multi-ancestry, it was not plausible to obtain an LDSC intercept representative of diversity. An initial run of the meta-analysis was run using the LDSC-corrected summary statistics. Double genomic controls were then implemented using the lambda from this initial analysis in the subsequent meta-analysis to correct for population structure. Overall, our meta-analysis included 678,671 individuals and 21,338,816 biallelic SNPs, each reported in at least two or more studies.

Associated locus definition in GWAS from UKBB and multi-ancestry meta-analysis

We selected lead SNPs attaining genome-wide significant evidence of association (p < 5 × 10−8) in the two discovery datasets — (1) UKBB (European only) and (2) multi-ancestry meta-analysis — that were separated by at least 1 Mb. Loci were then defined by the flanking genomic interval mapping 1 Mb up and downstream of lead SNPs.

Fine mapping

We performed fine-mapping to identify potential causal variants driving BMI association signals for each locus attaining genome-wide significance in the multi-ancestry meta-analysis using a Bayesian approach23. The Bayes’ factor (BF) for the ith SNP was computed as.

$$_=exp\left[\frac_^-log(_)}\right].$$

In this expression, Ki denotes the number of studies reporting summary statistics for the ith SNP, and \(_=\frac_}(_)}\), where \(_\) denotes the effect size, and SE (\(_)\) is the corresponding standard error for the ith SNP. We then calculate the posterior probability, \(_\), that the ith SNP is driving the association signal at a locus by.

where the summation in the denominator is of all SNPs at the locus. The 99% credible set for the locus was computed by ranking all SNPs according to their posterior probability πi from the highest to the lowest until their cumulative posterior probability reached or exceeded 0.99. We conducted fine-mapping using association summary statistics from the multi-ancestry meta-analysis in the UKBB (European ancestry-specific).

Polygenic score prediction in AWI-Gen and the Estonian Biobank (EstBB)

The PRSice 2 software implemented the clumping and threshold approach for developing PRS. Summary statistics from the UKBB and multi-ancestry meta-analysis were used as “base” datasets, while AWI-Gen (10,900 participants) genotype data were used as the “target” dataset. The optimal parameters (clumping distance and LD) were determined by computing the PRS in the combined dataset at various combinations of clumping distance and LD (Table S4). This target dataset was randomly split into validation (N = 1059) and testing (N = 9809) datasets while ensuring representation by sex and regions of Africa (Fig. 1, Table S1). A clumping distance of 250 kb and r2 of 0.8 where the optimal parameters in AWI-Gen were used to develop the PRS in the testing dataset whilst adjusting for age, sex, and 10 principal components. The best PRS was selected based on the BMI variance explained (see Fig. S1)in the AWI-Gen validation dataset and was used to compute PRS deciles. We used the same procedure to evaluate the performance of the multi-ancestry PRS and UKBB PRS in European ancestry participants from EstBB. The EstBB target data set was randomly split into validation (N = 8456) and testing (N = 76,096) datasets and similar PRS computation parameters were used as had been done in AWI-Gen. Using the same discovery datasets, we trained a PRS using the PRSCSx with the combinations of APCDR, UKBB, and BBJ. We compared this with a PRS trained without the African population (APCDR). In the PRSCx analysis, we used the ancestry-specific GWAS summary statistics with the 1000G references. We evaluated the best linear combination of the training dataset and then evaluated its predictive in the test dataset.

Fig. 1figure 1

The schematic diagram for the UKBB and MAMA discovery data sets that were used to train the UKBB, MAMA, in the Estonian Biobank and AWI-Gen target data set. The MAMA discovery data was used for the computation of the South and West PRS

Interaction of multi-ancestry PRS with sex and lifestyle variables

Boxplots were constructed to show differences in BMI distribution by sex in the AWI-Gen test dataset. Analysis of variance, stratified by sex, was then performed to compare the mean difference in BMI across the deciles of the multi-ancestry PRS. Linear models were used to test the interaction of the multi-ancestry PRS with physical activity, socioeconomic status, smoking status, and alcohol status while correcting for age, sex, and principal components.

PRS prediction across regions of continental Africa

We split the target dataset from AWI-Gen according to geographic region: South (N = 5270), West (N = 3870), and East (N = 1760). Boxplots were constructed to illustrate the distribution of BMI in each region. Then the multi-ancestry PRS prediction was evaluated separately in these three data sets using PRSice while adjusting for age, sex, and residual population structure using five principal components. PRS predictivity was indicated as incremental variance (full model with PRS − null model without the PRS). The distribution of physical activity patterns was evaluated across the African regions using boxplots. The interaction of the multi-ancestry PRS and physical activity in the AWI-Gen validation data set was explored using linear models that correct age, sex, and five principal components in the analysis. An interaction plot was computed using the interactions package in R.

Polygenic prediction of BMI in the West and South regions of Africa

We used the multi-ancestry meta-analysis as a discovery dataset to develop South and West region-specific PRS as they had the largest difference in prediction compared to the East as shown in Fig. 4. The optimal parameters (clumping distance and LD) were determined by computing the PRS separately in the East and South datasets at various combinations of clumping distance and LD (Table S5–S6). We split the South target dataset randomly into validation and testing datasets and then did the same for the West target dataset. A clumping distance of 250 kb and r2 of 0.8, the optimal parameters (Fig. S1), in AWIGen were used to develop the PRS in each testing dataset whilst adjusting for age, sex, and principal components. The best region-specific PRS was then selected based on BMI incremental variance explained (full model with PRS − null model without the PRS) in the region-matched validation dataset. We also tested the South African-specific PRS in the West African testing dataset and the West African-specific PRS in the South African testing dataset (Table S4–S6). We calculated the Pearson correlation coefficient between South and Western allele frequencies for SNPs in the South and West region-specific PRS. We also plotted LD r2 against physical distance in West and South Africa for the same SNPs.

留言 (0)

沒有登入
gif