Analysis of admixed Greenlandic siblings shows that the mean genotypic values for metabolic phenotypes differ between Inuit and Europeans

Study participants

We included Greenlanders from the population health surveys B99 (sample size 1953, recruited 1998–2001) [19], Inuit Health in Transition (IHIT; sample size 2807, recruited 2005–2010) [20] and B2018 (sample size 1236, recruited 2017–2019) [21]. All these cohorts were collected as part of a general population health survey of the adult (18+ years of age) Greenlandic population based on random population samples. Genomic data was utilized to estimate ancestry proportions and to identify full siblings. Based on that, different models were applied to estimate the phenotypic difference and difference in mean genotypic values between Inuit and European genetic ancestry. For comparison, we also included Europeans in the form of 6514 Danish individuals from the population-based Inter99 cohort [22] to directly calculate the phenotypic difference between Inuit and Europeans. Overview of the phenotypes and phenotypic values is available in Table S1.

Anthropometric and biochemical measurements

Anthropometric and biochemical data collection in the Greenlandic cohorts (B99, IHIT, B2018) and the Danish Inter99 cohort have been described previously [19,20,21, 23, 24]. In brief, body weight, height and waist and hip circumference were measured, and body mass index (BMI) and waist-to-hip ratio calculated in both the Greenlandic and the Danish cohorts, and in the Greenlanders also visceral and subcutaneous adipose tissue were measured and lean body mass and fat percentage calculated. In both the Greenlandic and the Danish cohorts fasting levels of HbA1c, serum total cholesterol, high-density lipoprotein (HDL) cholesterol, triglyceride, apolipoprotein A1 (ApoA1) and apolipoprotein B (ApoB) concentrations were measured, and levels of remnant cholesterol and LDL cholesterol were calculated. In the Greenlandic cohorts, participants not treated for diabetes with years of age >18 (IHIT) or ≥35 (B99, B2018), and in the Danish cohort, all participants without known diabetes, underwent an oral glucose tolerance test. To further eliminate the possible effect of treatment for diabetes, individuals with known diabetes were excluded from all analyses of measures of glucose homeostasis. Plasma glucose, serum insulin and serum C-peptide levels were measured at fasting and 2-h after ingestion of the glucose load. Information on birth weight and length were retrieved from midwife records for the Danish cohort, and from medical records at hospitals, including birth records, midwife records and outpatient records as well as information from the central birth register of Greenland for the Greenlandic cohorts.

Whole genome sequencing

A subset of 448 Greenlandic individuals from IHIT and B99, selected based on sampling location independently of phenotype and disease status, underwent whole genome sequencing (WGS; Illumina) with an average sequencing depth of ∼35×. Reads were cleared for adapters using biobambam tools [25] and mapped with BWA-MEM [26] to GRCh38 (bwa mem -t 24 -p -Y -K 100000000). After mapping, duplicated reads were marked. Genotype calling was done using GATK haplotype caller and variant quality score recalibration (VQSR) tools based on the GATK resource bundle [27]. Only variants in the T98 tranche and above were used. The sites were parsed through Plink (v1.90b6) [28], keeping the two most common alleles of multiallelic sites.

Genotyping and imputation

The Greenlandic samples that were not whole genome sequenced were genotyped using Multi-Ethnic Global Array (MEGA chip, Illumina), which was described previously in [7]. After quality control, it comprised about 1.6 million variants. To impute the missing data, we first inferred the relatedness among the WGS individuals using ancestry fractions and inferred ancestry-specific allele frequencies from Admixture [5, 28] with the assumed number of ancestral populations equal to two as input to NGSremix [29], which outputs pairwise relatedness coefficients. We then phased the WGS data using Shapeit2 [30] with trio information inferred from the relatedness results. Next, we used IMPUTE2 [30, 31] to pre-phase the participants with genotyping array data and then imputed them with merged reference of WGS and 1KGP (specific pops are CDX|CEU|CHB|CHS|GBR|TSI|IBS) [32]. The workflow is available at GitHub [33]. Sites in the imputed data with MAF below 0.05, missing call frequencies greater than 0.05 and high LD (r2 > 0.8) were filtered out, resulting in a dataset of 856,924 sites. We used this dataset to estimate ancestry proportions and pairwise relatedness.

Genetic ancestry inference

We estimated ancestry proportions for the Greenlandic sample using ADMIXTURE with the assumed number of ancestral populations equal to two, which is an approach we have previously shown leads to accurate estimates of Inuit and European ancestry proportions [6, 7]. Which of the two ancestral populations that corresponds to Inuit was identified by comparing our results to those from [6], in which a panel of 50 Danes were included to allow for an unambiguous genetic ancestry component assignment.

Relatedness estimation and sibling identification

With the ancestry proportions and genotype data as input, we estimated the relatedness between samples using the RelateAdmix method from [34] as implemented in NGSremix [29]. This method allows estimation of pairwise relatedness between admixed individuals as the fractions of loci sharing 0, 1, 2 alleles identical by descent (IBD; k0, k1, k2) for all pairs of individuals in a manner that takes admixture into account. Since the expected IBD sharing for full siblings as (k0, k1, k2) is (0.25, 0.5, 0.25), we visualised the distribution of k2 along k1 and chose the cluster with k1 between 0.25 and 0.75, and k2 between 0.15 and 0.5 as full siblings. Some pairs near the boundary, which had k1 between 0.4 and 0.6, and k2 above 0.125, were included because they are both full siblings to other individual(s) within the former cluster.

To estimate the genetic ancestry of the parents of the siblings, we also used NGSremix, which has an implementation of the method apoh [35] that provides such estimates.

Estimation of the difference in mean phenotype values and in mean genotypic values between Inuit and Europeans

To quantify the phenotype differences between Inuit and Europeans, we first regressed the phenotype values with ancestry proportions of all the individuals from the admixed full-sibling pairs using the following regression model:

$$_i=\alpha +_Y_i+_A Ag_i+_S Se_i+_i,$$

(model 1a)

where Yi and Qi denote the phenotype value and the Inuit ancestry proportion, respectively, for individual i. α is the intercept. Recruitment age Agei and sex Sexi for individual i are included as covariates and their effect size are denoted as βA and βS,respectively. The residual is represented as ϵi with ϵi ~ N(0,\(_^2\)) and importantly, βY is an estimate of the difference in the mean phenotype value (∆Y) between Inuit and Europeans. For clarity, we denote the estimated βY as \(\hat_}\) in the “Results” section.

Similarly, we estimated the difference in mean genotypic value between Inuit and Europeans using the following regression model:

$$\partial _j=\alpha +_G\partial _j+_\partial Ag_j+_ Ag_+_\partial Se_j+_ Se_+_j,$$

(model 1b)

where ∂yj is the phenotype value difference between the two siblings in admixed full-sibling pair j and ∂Qj is the difference of Inuit ancestry proportion between the two siblings in admixed full-sibling pair j. The difference in age (∂Agej), and the age of the first sibling (Agej1) are included as covariates. Similarly, their sexes are also included with ∂Sexj denoting the difference of sexes between them and Sexj1 denoting the sex of the first sibling. α is the intercept and the residual is ϵj ~ N(0,\(_^2\)). Importantly, here the estimate for βG is an estimate of the difference in mean genotypic value (∆G) between Inuit and Europeans. For clarity, we denoted the estimate of βG as \(\hat_}\) in the “Results” section.

The basic models above (model 1a and 1b) do not take into account that some of the sibling pairs are from the same family. Therefore, we also explored a model similar to the one recently used to evaluate within family polygenic score prediction [8]. Specifically, we explored the following regression model which adjusts for the family effect and overcome the correlation of multiple sibling pairs within a family:

$$_=\alpha +_W\left(_-\overline\right)+_B\overline+_ Ag_+_ Se_+_j+_,$$

(model 2)

where yij and Qij denote the phenotype value and Inuit ancestry proportion, respectively, for the admixed full sibling i in family j, and \(\overline\) is the mean Inuit ancestry proportion in family j. α is the intercept. Age and sex are included as covariates and their effect size are denoted as βage and βsex, respectively. γj denotes a family random effect with γj ~ N(0,\(_^2\)). The residual is represented as ϵij with ϵij ~ N(0,\(_^2\)). Notably, in this model βB is the difference in mean phenotypes for Inuit and Europeans (∆Y) and βW is the difference in mean genotypic value for Inuit and Europeans (∆G). For clarity, we denoted βB as \(\hat_}\) and βW as \(\hat_}\) in the “Results” section.

Simulation-based sensitivity analysis

First, we investigated the direction and extent of the potential bias induced by ancestry-by-environment interactions on the estimates of the difference in mean phenotypic values (∆Y) and mean genotypic values (∆G). We first simulated a scenario, where the environment is correlated with the family’s average ancestry proportion and each sibling’s phenotype value is as \(_\sim N\left(_\Delta G+_\overline,^2\right)\), where Eanc is the ancestry-by-environment interaction term. Because we do not expect siblings or their surroundings to know which of them have the higher or lower proportion of a certain ancestry, then we expect an ancestry-by-environment interaction to act on a family level and not on the individual level. However, if there is effect on the individual independently of the family, then we expect it to have an impact on our estimates. Therefore, we also simulated ancestry-by-environment interaction on the individual sibling’s admixture proportion level such that yij~N(Qij∆G + EancQij, σ2). We explored varying values for Eanc with ∆G equals to 1 and σ equals to 1. We estimated the ∆Y using model 1a and estimated the ∆G using model 1b and compared them to the true difference in mean genotypic values. We used estimated Qij from our real data. We performed 10,000 simulations for each value of Eanc.

To further explore possible bias, we simulated so-called participation bias where individuals with a higher phenotype value are less likely to participate in the study. This was done by choosing a threshold to define what a high phenotype value is. If an individual has a phenotype value above this threshold, there was a 20% chance that the individual would not participate in the study. If one individual does not participate, then the sibling pair containing this individual will be removed from estimation of ∆G. For families with more than 1 sibling pair, the remaining siblings who all participate will be kept. The phenotype values for all siblings were simulated as yij~N(Qij∆G, σ2). To keep the number of individuals and sibling pairs the same regardless of the degree of participation bias, a new phenotype was simulated for non-participating siblings based on their genetic ancestry until it was below the threshold phenotype value. We simulated different thresholds for having a high phenotype based on the quantiles of the simulated phenotypes. Similarly, we also explored another kind of participation bias where participation in the study depends on the phenotype and ancestry of the mother. The mothers’ ancestry proportion was estimated by NGSremix and was used to simulate the phenotype values of the mothers.

Finally, we performed simulations to investigate to what extent errors in the estimation of Q might affect our estimation of ∆G. The phenotypes were simulated as yij~N(Qij∆G, σ2), but errors were added to Qij when estimating ∆G. We performed 1 million simulations where the estimation error of Q was as \(_Q\sim N\left(0,_^2\ \right)\) and \(_^2\) varied between 0 and 5%. We quantify the estimation error of ∂Q in our data to be 0.24% based on F1s (see how the error of ∂Q was estimated below), so our simulation range also includes values of estimation errors many folds larger than our estimated error. From there we estimated ∆G and reported the deviation from the true ∆G along with the absolute mean error in the difference of Inuit genetic ancestry. We tried 100 values randomly between 0 and 0.05 for \(_^2\) and estimated ∆G using model 1b.

Estimation of average error in estimates of ∂Q

To estimate the average error in our estimation of the difference in ancestry proportions within each pair of full siblings, we took advantage of the fact that the dataset contained a number of F1 full-sibling pairs. For these pairs, each sibling must per definition have exactly 50% genetic ancestry from Inuit and exactly 50% genetic ancestry from Europe and true difference in their ancestry proportions must therefore be 0. This means that any estimated difference in genetic ancestry is due to error alone and we therefore estimated the average error in the estimated difference in ancestry proportions as the average absolute difference in genetic ancestry within all F1 sibling pairs.

留言 (0)

沒有登入
gif