We have developed a customized SEA population-specific reference panel consisting of 2550 samples via cross-panel imputation that resulted in 113,851,450 variants. Our reference panel has successfully imputed the genotyping data for OA and other SEA populations. Our analysis revealed that BEAGLE5 imputed more rare variants, however owing to the capability of including monomorphic variants, IMPUTE5 was included in our preferred pipeline.
Although genotype imputation offers a cost-friendly solution to generate dense data for genomic analyses, there is no appropriate reference panel that allows accurate and rare variant-rich imputation for underrepresented populations from SEA. Imputation for rare variants is of special interest because they are often difficult to impute, despite being biologically important for disease association studies due to the effect size and impact they often carry26,27,28. Addressing the lack of a reference panel and comprehensive imputation study among the underrepresented populations especially on the SEA populations, this study successfully evaluated the performance comparison of different sets of imputation tools and reference panels.
The development of a reference panel in this study utilizes a hierarchal imputation or cross-panel scenario, whereby a reference panel is imputed with another set of reference panels, and vice versa. This method not only increases the volume of each dataset, but also allows variants that are specific to each dataset to be transferred to one another. Without imputation, taking only overlaps between datasets may also generate a reference panel, but the resulting dataset would have much fewer variants, as shown in Supplementary Table 4, may imply that the impact of the studied sample size offsets the impact of samples with closer genetic ancestry.
The overlaps of each cross-imputed dataset were taken as a new reference panel and used to impute other datasets. To test the performance of this reference panel, we imputed a set of 46 masked OA whole genome sequencing samples and measured its concordance against true called genotype. Having the same number of samples in the reference panel, our analyses provided supporting evidence that a similar ancestry panel imputes genotyping data with better outcome and confidence score. The merged dataset of GA100K and SG10K exhibited a lower NRD rate compared to the 1KGP dataset (Supplementary Fig. 3, Supplementary Table 7). Between rare and common variants, in general, our reference panel has ~3-5% lower NRD rate for rare variants imputation, and ~2% lower for common variants, as observed in different studies29,30. From this result, we proceeded to merge the 46 OA whole genome sequencing samples into the GA100K-SG10K via cross-imputation and then incorporated these datasets as the SEA-specific reference panel for genotyping array data imputation.
We also took into account other methods that similarly try to implement multiple cohorts of studies to become reference panels, one of such is the Meta-imputation. Instead of “merging” the reference panels, Meta-imputation combines the imputed result from multiple reference panels. We performed this imputation to benchmark our imputation and made pros and contras regarding this method, detailed in our supplemental materials (Supplementary Table 5, Supplementary Table 11). We observed that for marginalized populations like OA, cross-panel imputation is preferred, while for SGVP or other mainstream populations, the performance is comparable. Hence we acknowledge that there is no perfect method to generate a reference panel, and applications are on individual’s preference.
Our findings also supported the notion that imputation using a reference panel that shares close genetic ancestry produces a better yield of imputation, as opposed to 1KGP panel or TOPMED. We acknowledge, however, that the ‘SEA-specific reference panel’ has a larger number of variants than the 1KGP thus plausibly allowed more variants that can be imputed into the genotyping dataset. To normalize this difference, we measured the coverage of the imputation and concluded that the proportion of the imputed dataset is of high confidence and passed the internal filter threshold. Our analysis also suggested that both SEA-specific and 1KGP reference panels have a comparable proportion of accurately imputed variants, supporting the notion that 1KGP shows good performance on the imputation of common variants in the OA population. We also benchmarked the performance of the SEA-specific reference panel against both the GA100K and SG10K, respectively (Supplementary Fig. 8, Supplementary Table 4). To make an even comparison, we selected the full SG10K dataset (N = 4810), and then randomized the selection to 2500 samples, except for GA100K, whereby the available samples are limited to 1,099 samples. Despite a smaller samples sample size, GA100K data performed comparable to the 1KGP dataset, again supporting the importance of ancestry in the reference panel. When comparing to an even larger database such as TOPMED, the result is consistent that ancestry relatedness between the reference panel and study dataset plays a crucial role. TOPMED, with larger samples and denser variants did not outperform SEA-specific panels, imputing only close to 4.4 million variants, even lesser than those of 1KGP. However, interestingly we see that this imputation is more accurate than 1KGP, with an average NRD of approximately 4% vs 7% in 1KGP. This implies that the imputation accuracy may be influenced by a larger size and more diverse of a reference panel. It is to be noted that imputation in TOPMED was performed in the Michigan imputation server with Minimac4.
When the imputed datasets were segregated into variable allele frequency bins, imputation using SEA-specific reference panel generated 56.25% lower frequency variants with higher confidence. In addition, the average confidence score in each of the bins showed an obvious accuracy difference between the two panels when compared to the imputed OA dataset. This is because rare variants found in a population are more likely to occur in stretches of haplotype that belong to the same ancestry8, thus underscoring the advantage of the SEA-specific reference panel over the 1KGP. We also evaluated the impact of marker density of target genotypes (Supplementary Table 10). When OMNI and AFFY data were imputed individually and then merged after imputation, more SNPs were generated and yielded better INFO scores. When more variants are present before imputation, prediction of genotypes can be done more confidently, because there are more ‘hints’ for the imputation algorithm to infer what the haplotypes are. This is expected, as the imputation principle works on haplotype matching between the imputed dataset and the reference panel. However, when we look at rare alleles produced, imputing the merged dataset provided more rare variants. We believe that this observation is owing to phasing accuracy, which increases with the number of samples phased31,32,33, and utilizing more variants may not be very informative if they provide similar information for the imputation tool34. Better phasing accuracy can then lead to better imputation performance. In addition, when the SEA panel was used as the reference panel, the initial markers before imputation played a less important role, and the number of imputed SNPs was similar. This is an interesting implication when studying marginalized populations and for cost strategizing, as cheaper, less dense chips can perhaps be used more effectively.
We also imputed the SGVP genotyping data as a comparison to the OA genotyping data. In the SGVP genotyping dataset, the imputation between the SEA-specific reference panel and 1KGP yielded comparable results (Table 4), but the SEA-specific reference panel discerned itself from the 1KGP panel in capturing rarer variants. We reasoned this with two possibilities. Firstly, in 1KGP, almost all the populations are represented in the reference panel, except Malays which share a significant portion of their genetic make-up with East Asia and SEA ancestry26, therefore able to capture the majority of the common variants. Secondly, the SGVP genotyping dataset being imputed contained populations of South Asia and East Asia ancestries in addition to the Malay which also carry the ancestry component of mainland SEA26. This is also observed when the SGVP was imputed with the TOPMED reference panel. Although TOPMED is of a larger resource, the imputation result is comparable to 1KGP and 1KGP-HGDP (Supplementary Table 9). Again, the inclusion of the Malay population could be the factor that differentiates this imputation performance, utilizing haplotype segments unique to the Malay population. Furthermore, we observed that the impact of the choice of reference may be compromised, provided that the reference panel contains large enough variants.
When we compared both the OA and the SGVP datasets, the importance of reference panel choice became clearer. In the case of an underrepresented population, for instance, the SEA populations, a panel specifically made of populations with closer genetic ancestry can boost the imputation yield by approximately 30% in the whole genome and about 60% for variants with lower allele frequencies (Table 2, Fig. 2). This finding is also supported by a higher average confidence score in the lower allele frequency bins, where the imputation of rare variants has been known to be more difficult (Fig. 1).
Previous studies have shown that OA presents a genetic makeup that is unique due to long periods of isolation, admixture, population bottleneck, and complex population history20. In such cases, complex indigenous populations are hard to be represented in a reference panel by other populations, the OA population has to be imputed with an OA panel, because their haplotype has become unique to their own. This condition is often worsened by a lack of funding in many low- and middle-income countries to generate sequencing data that are large enough to be included as part of the reference panel. However, using the reference panel creation method described here, panel availability for less studied populations becomes accessible, and therefore, researchers may cross-impute sequencing dataset of a population of interest into a larger panel, giving the benefits of a large number of samples with population-specific haplotypes and haplotype diversity.
In terms of the tools used, both BEAGLE5 and IMPUTE5 performed comparably. BEAGLE5 yielded more polymorphic variants in almost all imputation scenarios when samples were merged before imputation. However, when individual genotype data were imputed, the performance of rare variant imputation dropped. This could be hinting that the BEAGLE5 algorithm may favor more samples than variant density28, after all not all SNPs could be informative of the haplotype34. We would also like to highlight that in BEAGLE5, monomorphic variants were excluded. While in most practices the monomorphic markers in the study dataset are not considered, these variants may be polymorphic in other populations when merged, as shown in Supplementary Table 12. This is one underlying reason why IMPUTE5 was preferred when merging GA100K and SG10K. IMPUTE5 produces more monomorphic SNPs owing to its “Lazy imputation” procedure20 which is useful to capture overlapping variants in a hierarchal imputation setting. However, we acknowledge that the imputation of monomorphic remains a challenge that is still relatively poorly studied, as they could be a “hit-or-miss”30. As such we limit the relevancy of the monomorphic solely to merging datasets between populations. We do not discount the advantages of BEAGLE5, especially its capability of inferring more rare variants.
Whilst the performance seems tempting, we acknowledge the limitations of the SEA-specific reference panel in this study. Essentially, there is no ideal approach to estimate the imputation accuracy of the panel. Whilst our reference panel exhibited more variants after filtering, we note that the TOPMED reference panel yielded a lower disconcordance rate. In addition, estimated quality scores (INFO, DR2, R2) are helpful to distinguish well-imputed variants versus poorly-imputed, but may not be distinguishable across different reference panels especially when the numerical differences are small. Larger reference panels such as TOPMED may have more conservative quality scores due to more observed information and more possibilities, leading to less confidence (lower quality scores). Perhaps the gold standard would be to leverage some sequencing data for comparison, however it is not available for this study.
In conclusion, we have developed a SEA-specific reference panel containing 2,550 samples with 113,851,450 SNPs. Further, we provided further supporting evidence that the choice of reference panel is an essential component in the imputation process, especially when studying underrepresented populations such as OA. The SEA-specific reference panel that we have developed, is expected to perform arguably better when imputing the Southeast Asian population, as demonstrated by the genotyping data of OA, Malays, and other SEA populations. Although the performance of the SEA-specific reference panel was simulated only with the SNP-array data, we believe that the imputation can be performed on the whole-genome sequencing datasets with comparable quality. On a separate note, we acknowledge, however, that the current imputation reference panel is limited by the number of available SEA representative datasets. The imputation accuracy may have been compromised as have been shown by lower disconcordance rate in TOPMED reference panel. More representative whole genome sequencing data of OA and other native populations from SEA would further increase the imputation power. A larger study to collect various sequencing datasets from diverse indigenous and underrepresented populations is warranted. We therefore recommend more initiatives in an open, collaborative study that would enable access to these sequencing datasets.
留言 (0)