The hazards of genotype imputation in chromosomal regions under selection: A case study using the Lactase gene region

1 INTRODUCTION

Imputation of missing Single Nucleotide Polymorphism (SNP) results (Marchini & Howie, 2010), originally made necessary by the inclusion of different SNPs in commercial genome-wide genotyping platforms, as a way of combining data from different sources, continues to be widely used in disease association studies (see e.g., Bycroft et al., 2018). Many claims have been made about its high level of accuracy (Howie et al., 2012; Howie et al., 2011). Imputation has also been used in the context of population genetics, (Hellenthal et al., 2014; Ilardo et al., 2018) and for ancient DNA where coverage is low (Allentoft et al., 2015; Martiniano et al., 2017). It has been suggested that incorrect imputation can occur for SNPs adjacent to or within regions of the breakdown of Linkage Disequilibrium (LD) (Weng et al., 2014).  Here we address another issue, namely that of variants that have become frequent only recently, in particular modern populations, due to selection. We investigate this in relation to variants present in an enhancer upstream of the lactase gene, LCT.

Functional regulatory variants in this LCT enhancer have emerged at a detectable frequency only in the last 5000 years or so, as assessed from ancient DNA samples (Mathieson et al., 2018), haplotype approaches (Bersaglieri et al., 2004; Coelho et al., 2005) and also using modelling approaches (Itan et al., 2009). The recent history of these variants, with little time for recombination, combined with the selection that has increased their frequency, means that they each occur on very extended haplotypes (Bersaglieri et al., 2004; Liebert et al., 2017; Poulter et al., 2003; Ranciaro et al., 2014). So far, five variants have been convincingly confirmed as functional ((Liebert et al., 2016) and references therein), of which one, rs4988235 (−13910*T), has been studied in most detail and is present predominantly in Europeans (reviewed in Ingram et al., 2009; Segurel & Bon, 2017). Imputation has been used to help assess its presence in ancient DNA samples with missing data (Allentoft et al., 2015; Ilardo et al., 2018). The apparent detection of lactase persistence (LP) in samples of bronze age pastoralists from the western Steppe region contributed to the belief held by some that LP arose in that population, even though the evidence for this is not strong, since the early occurrences (∼4000 BP) were found in western Europe as well as Ukraine (Mathieson et al., 2015, 2018).

Here we test the accuracy of imputation of 10 LP enhancer region SNPs by comparing with true genotyping results, using modern DNA samples. While 95% of the genotypes are correct the 5% that are wrongly imputed or fail to impute are non-random with respect to gain or loss of an allele or genotype. We examine the haplotype pairs of these individuals in an attempt to understand the basis of this bias.

2 METHODS 2.1 Strategy

The global 1000 Genomes Phase III data (1000 Genomes Project Consortium et al., 2012, 2015) were used as a reference panel for all the imputations we conducted, with all groups combined as recommended. For the study panels, we used two different datasets (Population studies 1 and 2). Each study provided a “validation” set of directly genotyped (true) genotypes that were then compared with the genotypes we obtained after imputation using the SHAPEIT2 +IMPUTE 2 approach (Marchini & Howie, 2010). The overall experimental strategy is shown in Figure S1A, B.

2.2 Population Study 1

Our genotype and sequencing data were available for 52 SNPs (Table S1A, B, which shows the full nomenclature of key SNPs) covering a 1.77Mb region (from rs1446525 at chr2:135637847 to rs6711718 at chr2:137407012; build GRCh37/hg 19), from multiple populations (Liebert et al., 2017). Figure 1 shows the positions of the genotyped SNPs and genes, in relation to a Linkage Disequilibrium Unit (LDU) map of the 1.77 Mb and 500 bp genomic intervals on chromosome 2. Forty-two of the 52 SNPs in the study panel were used as scaffold SNPs (blue diamonds) for imputing genotypes for the 10 markers (red diamonds) within the 500 bp segment covering the enhancer region. Scaffold SNPs were selected so that they were distributed to take into account LD, rather than physical distance (Liebert et al., 2017). Plotting the high-resolution genetic maps with distances expressed in LDU on the physical map shows the non-linear relationship between the two and the underlying structure of LD, which is typically a “Block-Step” structure. LDU blocks represent areas of conserved LD and low haplotype diversity, while Steps (increasing LDU distances) define LD breakdown, primarily caused by recombination since crossover profiles agree precisely with the corresponding LDU steps (Maniatis et al., 2007).

image (A) Diagram showing the positions of the genotyped SNPs and genes, in relation to the LDU map of the 1.77 Mb and 500 bp genomic intervals on chromosome 2 (build 37; hg19). Physical location on chromosome 2 is shown on the horizontal axis and genetic location in linkage disequilibrium units (LDU) on the vertical axis. The maps were constructed using HapMap data for the six populations YRI (grey), LWK (green) ASW (purple), MKK (blue), TSI (orange), CEU (turquoise). Upper plot. 52 SNPs used in Population study 1 are represented by blue and red diamonds. Blue SNPs were used as a scaffold for imputation and red SNPs were masked. The larger black diamond shows rs182549 Lower plot. Shows the position of the 10 analysis (masked) SNPs with rs IDs, in the 500 bp interval (maps ASW, MKK, TSI, CEU) within an intron of MCM6. Sequence position from Build 37. Note that there is no evidence of recombination (no change in LDU location) in this small region in any of the groups. (B) Core haplotype network showing the LP variants (with literature nomenclature) and haplotypes mentioned in this paper. Continuous lines represent a single nucleotide change; dotted lines represent several changes and /or recombinations. Core haplotype names follow those used previously (Liebert et al., 2017) and colour code follows colours in Table S7

The 802 individuals included in this study were of “European”, “Middle Eastern” and “African” origin (Liebert et al., 2017), and further details about these data collections are given in the Supporting Information of that paper (see Liebert et al., 2017, Table S2b for samples). Diplotypes of the samples tested are also shown in the Supporting Information of the current paper. For imputation, the three continental groups were phased and imputed independently.

2.3 Population Study 2

As an independent analysis, public datasets from HapMap were used. SNP genotype data for the same 1.77 Mb segment of chromosome 2 were obtained for individuals from the HapMap populations ASW, CEU, GIH, LWK, MXL, TSI, YRI, CHB and JPT, for which there was rs4988235 SNP data https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html. These data were used [rather than whole genome sequence (WGS) data] to mimic a more realistic imputation experiment where the study panel is of lower resolution than the WGS reference panel. Quality control (QC) cut-offs were applied to each population, MAF 0.01,  geno (missingness/SNP) 0.1, mind (missingness per individual) 0.1, HWE 0.001 and applied to all scaffold SNPs. These filtered populations were then merged to produce one population (n = 475) in which the 490 overlapping SNPs were retained and used for imputing rs4988235. Note that the reference panel includes most of these same HapMap samples which should provide a better opportunity for the software to impute the missing data correctly.

2.4 Imputation

The scaffold SNPs for the Population studies 1 and 2 were phased with SHAPEIT2 (Delaneau et al., 2013) and all the untyped and masked SNPs in the full 1.77 Mb region were then imputed with IMPUTE2 (Howie et al., 2011) using the 1000 Genomes data as the reference panel (1000 Genomes Project Consortium et al., 2012, 2015; Delaneau, Marchini, and 1000 Human Genomes Project Consortium, 2014). Imputed genotypes were called with GTOOL (http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html) using a calling threshold of 0.9 (so-called “hard” calls), and later for comparison using a less stringent threshold of 0.8.

2.5 Haplotype Pairs

To obtain individual haplotype pairs for Population study 1, 52 SNPs (Table S1A,B) were also phased using PHASE (Stephens & Donnelly, 2003; Stephens et al., 2004) and assigned to the core haplotype groups of Hollox and colleagues (2001) as reported in Liebert et al. (2017) using the key markers listed in Table S1A. The three continental groups were combined as done previously (Liebert et al., 2017). The diplotypes can be found in the Supporting Information.

For Population study 2 the 40 SNPs spanning 653 kb (Chr2:136324225-136976936, GRCh37/hg19) in the HapMap dataset were also phased as one group, using PHASE, and haplotypes assigned to those of Hollox (Hollox et al., 2001), Table S1A, using informative overlapping SNPs (rs2304370 and rs3739022, as well as rs4988235). As an independent test of phasing, and for direct comparison with dataset 1, we extracted from the 1000 Genomes sequence data (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502) all the populations in which there were erroneous or missing imputations (CEU, ASW, GIH, MXL, TSI) as well as three others not included in the SNP dataset, GBR, FIN and IBS that segregate rs4988235 at good frequency. We used 44 of the same 52 SNPs as in Population study 1 (i.e., excluding the rare enhancer SNPs) from the full 1.77 Mb region (see Table S1A). The aim was to generate the longer phased 1.77 Mb haplotypes as done for our own data, using the same method, namely PHASE.  Thirty-one additional SNPs in the LCT/MCM6 region were added to improve accuracy (total of 75 SNPs, see Table S1A), while keeping computational time manageable.

3 RESULTS

For Population Study 1 pairwise D’ values across the core LCT/MCM6 haplotype region show the previously reported similarity and very low level of recombination across this chromosomal region, not only across continental groups, but when subdivided by linguistic group (Fig. S2). Figure 1 likewise shows the similarity of the LD profile in different Hapmap population groups.

3.1 Imputation of the Masked SNPs of Population Study 1

Initially, all 42 phased “scaffold” study SNPs were used for imputation. The first important observation was that two of the 10 SNPs that were to be imputed (rs41525747 and rs869051967), of which there were 66 and 32 derived allele occurrences, respectively, in the directly genotyped data, somewhat surprisingly were not present in the reference 1000G data. Therefore they could not be imputed within the test sample (Table S2), leaving eight SNPs that could be imputed.

For these eight SNPs there was a total of 98 wrongly imputed genotypes in the 802 samples. While the overall discordance rate is quite low (1%–3% per SNP; Table 1, Table S2), this is not a complete reflection of the true situation, as summarised in Table 1. In most cases, the errors were asymmetric, so that in the case of rs4988235 most were due to incorrect imputation of the presence of the derived allele where it is not present in the directly genotyped data (14/16), while other SNPs were in the other direction (i.e., failure to impute the derived allele), for example, rs56348046 (11 out of 15) and rs41380347 (six out of seven).

TABLE 1. Incorrect allele calls and missing data in the first imputation study, Population study 1. Data expressed as gains and losses of derived allele, and % error or missingness shaded in grey; total counts for each group in bold. * See Supporting Information for details of total counts for each locus; genome positions are shown. Details of the correct genome nomenclature are shown in Table . For example −13910 C>T is rs4988235, or genomic reference NC_000002.11:g.136608646G>A or ClinVar ref. NM_005915.6:c.1917+326C>T (MCM6:c.1917+326C>T) SNP literature name −13495 C>T −13603 C>T −13730 T>G −13910 C>T −13913 T>C −13915 T>G −14010 G>C −14011 C>T SNP rs ref rs4954490 rs56348046 rs4954492 rs4988235 rs41456145 rs41380347 rs145946881 rs4988233 Overall Minor Allele Frequency 0.240 0.083 0.027 0.063 0.004 0.128 0.017 0.002 Incorrect allele calls and percentages Population subdivision N gain|loss % gain|loss % gain|loss % gain|loss % gain|loss % gain|loss % gain|loss % gain|loss % South Europe 15 0|1 7 - - - - 2|0 13 - - - - - - 0|1 7 Northwest/Central Europe 38 1|0 3 - - - - 2|0 5 - - - - - - - - East/Southeast Europe 39 6|0 15 1|0 3 - - 6|0 15 - - - - - - - - Total Europe 92 8|0 9 1|0 1 - - 10|0 11 - - - - - - 0|1 1 Total Middle Eastern 327 4|5 3 1|6 2 4|0 1 4|1 2 0|2 1 1|4 2 1|0 0 0|2 1 North Africa 102 1|1 2 1|3 4 1|1 2 0|1 1 0|1 1 0|2 2 1|0 1 2|0 2 East Africa 241 2|3 2 1|1 1 3|2 2 - - 0|3 1 - - 4|1 2 5|0 2 West Africa 40 1|0 3 0|1 3 1|0 3 - - - - - - - - - - Total Africa 383 4|4 2 2|5 2 5|3 2 0|1 0 0|4 1 0|2 1 5|1 2 7|0 2 Total incorrect allele calls 98 16|9 3 4|11 2 9|3 1 14|2 2 0|6 1 1|6 1 6|1 1 7|3 1 Grand Total 802* 25 15 12 16 6 7 7 10 Missing genotype counts and percentages Total missing genotypes 218 27 4 26 3 15 2 2 0.2 0 0 80 10 17 2 51 6 Missing; carrier|non-carrier 13|14 7|19 15|0 2|0 0|0 79|1 6|11 1|50

Most dramatically, the samples that evaded imputation (and returned as “missing”) were also skewed with respect to genotype. In addition to the 2 SNPs that could not be imputed at all, a total of 218 further genotypes were not imputed, that is, 0%–10%, (Table 1), and were thus missing from the imputed dataset (at a calling threshold 0.9), and in most cases this was not evenly distributed between carriers and non-carriers of the derived alleles. This effect was particularly prominent for rs41380347 (–13915T>G), where 79 out of 80 samples that failed imputation were either heterozygous (n = 53), or homozygous (n = 26) for the derived allele. The summary of this is shown in Table 1 and details in Table S2. It should be noted that this anomaly occurred notwithstanding having an “info score” of more than 0.9 in many cases (Table S3, see particularly rs41380347 for Middle Eastern and African samples). Table S3 also shows allele frequencies in the different continental groups compared with the 1000 Genomes samples, where the MAF for rs41380347 is 0.0006 in the reference panel.

In order to check the effect of a probability threshold for imputation, we lowered the stringency of the cut-off to 80% for calling imputed genotypes. The results were varied. For example, in the case of rs41380347, the missing calls for almost all derived allele carriers were replaced by correct calls. However, this was not the case for most other SNPs, where the number of incorrect calls increased. In the case of rs4988235 the results were identical. This is summarised in Table S4.

3.2 Examination of haplotype background of rs4988235 (−13910C>T): Population Study 1

Haplotypes determined by PHASE showed that all 104 chromosomes carrying the derived allele for rs4988235 share the same core “A” haplotype (Hollox et al., 2001; Poulter et al., 2003), which is extended over more than 200 kb in most cases (Liebert et al., 2017). This haplotype also carries the derived allele of rs4954490, an SNP that is within the analysis region, and in 103/104 cases also the derived allele of rs182549 C>T (also known as –22kb G>A in the lactase literature), an SNP outside the analysis region (Fig. 1).

3.3 rs4954490, rs182549 and LCT core haplotypes in relation to errors in imputation of rs4988235

The haplotype pairs determined by PHASE and assigned to the core haplotypes of Hollox (Hollox et al., 2001; Poulter et al., 2003) for each of the 16 discordant individuals are shown in Table 2.

TABLE 2. LCT core haplotype pairs of the 16 discordant individuals in Population study 1, showing sequenced and imputed genotypes for rs4988235, and PHASE derived haplotypes. More detail is shown in Table . The capital letters denote the Hollox core haplotype defined by the SNPs shown in Table . A-der = + derived allele rs4988235 +derived allele rs182549; rA-anc = ancestral allele rs4988235 +derived allele rs182549; C-der = + derived allele rs41380347 Person Continent Sequenced genotype Imputed genotype Haplotype 1 Haplotype 2 1 South Europe GG AG rA-anc P-like 2 South Europe GG AG B rA-anc 3 Northwest/Central Europe AG AA rA-anc A-der 4 Northwest/Central Europe AG AA A-der rA-anc 5 East/Southeast Europe GG AG rA-anc D 6 East/Southeast Europe GG AG rA-anc C 7 East/Southeast Europe GG AG B rA-anc 8 East/Southeast Europe GG AG rA-anc A-anc 9 East/Southeast Europe GG AG rA-anc A-anc 10 East/Southeast Europe GG AG rare rA-anc 11 Middle East GG AG rA-anc P 12 Middle East GG AG P rA-anc 13 Middle East GG AG C rA-anc 14 Middle East GG AG B rA-anc 15 Middle East AG GG C-der A-der 16 North Africa AG GG A-der P

留言 (0)

沒有登入
gif