A robust pipeline for ranking carrier frequencies of autosomal recessive and X-linked Mendelian disorders

Selection of deleterious variants in recessive genes

In this study, sequencing results from the publicly available Genome Aggregation Database (gnomAD) were extracted and analyzed as the discovery cohort. gnomAD aggregated high-quality, uniformly processed whole genome sequencing (WGS) data from 76,156 individuals and whole exome sequencing (WES) data from 125,748 individuals12. These are primarily unaffected individuals from case-control studies of common adult-onset diseases. Particularly, samples from second-degree or more closely related individuals and from individuals as well as their first-degree relatives known to have severe childhood-onset diseases were removed12. Therefore, the variant data from gnomAD fulfills the requirement of our analysis on the carrier frequencies of recessive diseases in a generally healthy population. Variant information from gnomAD was extracted and only high-quality variants were retained and annotated (see Methods). Furthermore, gnomAD represents individuals with different self-defined ethnicity background, including African/African American (AFR), Latino/Admixed American (AMR), Ashkenazi Jewish (ASJ), East Asian (EAS), Finnish European (FIN), non-Finnish European (NFE), and South Asian (SAS) (Supplementary Dataset 1). Ethnicity-specific allele frequencies of each variant could be obtained.

2699 known disease-causing recessive genes documented in the Online Mendelian Inheritance in Man (OMIM) database were considered in this study, including 2525 autosomal recessive and 174 X-linked genes (Supplementary Dataset 2, see Methods). As shown in the workflow (Fig. 1), high-quality gnomAD variants aligned to the GRCh38 human genome assembly reference in each gene were processed. Overall, 48,198,273 gnomAD variants were found in the 2699 genes. Among these variants, we aimed to select those that either have been reported in affected patients or could potentially induce deleterious effects on gene function. Four categories of variants were retained based on the following criteria:

Fig. 1: Analysis workflow of ranking carrier frequencies in all recessive genes.figure 1

Steps in the discovery (gnomAD) cohort are illustrated. EAS East Asian, SAS South Asian, AFR African/African American, AMR Latino/Admixed American, ASJ Ashkenazi Jewish, NFE European (non-Finnish), FIN European (Finnish), VCR variant carrier rate, GCR gene carrier rate.

Type 1 variants – known pathogenic changes in ClinVar

ClinVar is a well-established database used to document the pathogenicity of genetic alterations in the context of human diseases13. Variants in ClinVar that have been categorized as “pathogenic” were either reported in the medical literature or submitted by clinical diagnostic laboratories following observation in affected individuals. Therefore, ClinVar is a generally trustworthy resource for disease-causing variants in patients.

We have defined 28,017 pathogenic ClinVar variants in our list of 2699 genes as Type 1 variants for downstream calculations (see Methods for detailed selection criteria). ClinVar variant of uncertain significance (VUS) and undocumented variants were further filtered as potentially deleterious Type 2 to Type 4 changes based on the criteria described below. Notably, any variant with a homozygous call in gnomAD (including hemizygous chromosome X variants in males and homozygous chromosome X variants in females) were removed. Further, variants with alternative allele frequencies (AF) ≥ 0.005 were removed following previous studies5,9. As a result, 45,354,101 ClinVar VUS and undocumented variants were further analyzed.

Type 2 variants – presumed loss-of-function (LoF) changes

Presumed LoF changes include five types of variant consequences designated as HIGH impact alterations in Ensembl: stop gained (nonsense), start lost, frameshift, splice acceptor and splice donor. Notably, if a specific variant results in different consequences in different transcripts due to alternative splicing, the consequence with the most severe impact was considered. Additional filtering was applied to exclude nonsense changes within 50 bp of the final exon junction that could potentially result in an escape of nonsense-mediated decay14. In total, 119,452 Type 2 variants were identified (Supplementary Dataset 3).

Type 3 variants – predicted deleterious missense changes

We obtained prediction scores of missense variant pathogenicity from dbSNFP v3.5a15, which included results from fifteen analysis tools (CADD, DANN, FATHMM, GERP, LRT, M-CAP, MetaLR, MetaSVM, MutationAssessor, MutationTaster, Polyphen2, SIFT, VEST3, fathmm_MKL_coding and phastCons). These tools predict whether an amino acid substitution is deleterious based on evolutionary conservation, the amino acid sequence and protein structure, the derived allele versus de novo simulation, etc. To select the most informative tools, all gnomAD missense variants categorized as pathogenic (P, 7384 variants), likely pathogenic (LP, 3688 variants), benign (B, 27,759 variants) and likely benign (LB, 17,944 variants) in ClinVar were used to evaluate their performance. Among the fifteen analysis tools, seven tools (CADD, DANN, fathmm_MKL_coding, phastCons, Polyphen2, SIFT and VEST3) clearly distinguished ClinVar pathogenic missense variants from benign missense changes (Fig. 2, Supplementary Fig. 1, Supplementary Dataset 4). Of the seven tools, combination of five tools, namely CADD, DANN, Polyphen2, SIFT and phastCons, were reported to be effective to determine the deleteriousness of missense variants9. We also calculated the mean scores of ClinVar pathogenic missense variants for CADD (mean ± SD = 28.04 ± 5.76), DANN (mean ± SD = 0.99 ± 0.06), fathmm_MKL_coding (mean ± SD = 0.90 ± 0.18), phastCons (mean ± SD = 0.84 ± 0.29), Polyphen2 (mean ± SD = 0.90 ± 0.25), SIFT (mean ± SD = 0.04 ± 0.12) and VEST3 (mean ± SD = 0.78 ± 0.24) (Supplementary Dataset 4). We applied these mean scores as cut-offs with which to differentiate deleterious from non-deleterious variants when evaluating the other missense variants in the list of 2699 genes. If a missense variant receives a score equal to or above (or below for SIFT) the mean values for at least five out of the seven tools, it was retained for downstream filtering. Additional criteria were applied to the filtering (see Methods). Overall, 48,634 Type 3 variants were identified (Supplementary Dataset 3).

Fig. 2: Violin plots comparing scores of the selected seven variant analysis tools.figure 2

All gnomAD missense variants defined as pathogenic/likely pathogenic and benign/likely benign in ClinVar were evaluated. Violin plots illustrate the median value (dot) and the 25th to 75th percentile range (black line). In each plot, benign variants are on the left and pathogenic variants are on the right. The deleteriousness scores are along the y-axis. Higher values indicated a higher probability that the variant is damaging in all scores except for SIFT, where a low score is associated with deleteriousness. The y-axis for CADD is a logarithmically transformed score, and the rest are linear probabilities. The x-axis represents the probability density of variants along the range of scores. The CADD plot appears different because its y-axis is on a logarithmic instead of linear scale. Calculated mean scores with standard deviations are listed in Supplementary Dataset 4. See also Supplementary Fig. 1.

We employed an alternative tool, EVE, to assess the deleteriousness of missense variants identified. EVE is a model for the prediction of clinical significance of human variants based on sequences of diverse organisms across evolution, which used fully unsupervised deep learning trained on amino acid sequences of over 140 K species16. EVE classifications were available for missense variants in 2208 genes, among which 17,136 variants from 650 genes overlapped with our defined Type 3 variant list. Reassuringly, only 0.54% (92 out of 17,136) of these variants were classified as Benign by EVE, demonstrating the reliability of our method in determining the pathogenicity of missense changes.

Type 4 variants – potentially harmful in-frame insertion and deletion mutations (INDELs)

In-frame INDELs are much less common than LoF or missense changes. Following our Type 3 variant analysis strategy, only genes with known ClinVar pathogenic in-frame INDELs were included. If there is no ClinVar pathogenic in-frame INDEL variant for a specific gene, there would be zero Type 4 variant. Consequently, 8887 such variants in 1654 genes remained (Supplementary Dataset 5). Further, variants that are evolutionarily conserved (CADD score > 2017), located in functionally critical domains and in close proximity to known pathogenic in-frame INDELs were denoted as Type 4 variants (see Methods). Altogether, 535 Type 4 variants were identified (Supplementary Dataset 3).

Ethnicity-specific ranking of carrier frequencies in the discovery cohort

A combined list of Type 1 through Type 4 gnomAD variants were generated for each of the 2699 genes (Fig. 1). Ethnicity-specific variant carrier rate (VCR) was calculated for all filtered variants in each gene, and the ethnicity-specific gene carrier rates (GCR) were subsequently deduced (see Methods for calculation formula). Because seven sets of VCRs were identified for each ethnicity in each gene, this calculation yielded seven sets of ethnicity-specific GCRs (Fig. 1 middle panel, Supplementary Dataset 6). Genes were then sorted based on the descending GCR values for each ethnicity. As a result, genes with the highest GCRs, and therefore those with the highest probabilities of causing recessive diseases in offspring, were at the top of the list for each population (Fig. 1 lower panel). GCRs for the top ten genes in each population were illustrated (Fig. 3a).

Fig. 3: Gene carrier rates (GCRs) for the top ten genes of each ethnicity within each cohort.figure 3

Genes are listed on the vertical axis, and the GCR values are shown on the horizontal axis. a gnomAD cohort containing GCRs from seven subpopulations. ALL all gnomAD samples, AFR African/African American, NFE European (non-Finnish), ASJ Ashkenazi Jewish, EAS East Asian, FIN European (Finnish); SAS South Asian, AMR Latino/Admixed American. Note that ethnicity-specific top gene(s) not in any other ethnicity’s top 10 gene lists were highlighted in bold with purple color. b Singapore cohort containing GCRs from three subpopulations. c ChinaMAP cohort and d WBBC cohort both composed of Chinese population GCRs.

Since prediction by in silico tools is prone to error, to evaluate the contribution of Type 3 and Type 4 variants to the overall result, we calculated GCR rankings based on Type 1 and Type 2 variants only and compared the values to GCR rankings based on all Type 1 to Type 4 variants in the gnomAD database (Supplementary Fig. 2). Nearly identical rankings were observed (Pearson correlation coefficient R = 0.95 to 0.98, Pearson correlation coefficient R = 0.88 to 0.93), presumably due to the stringent filtering criteria set for Type 3 and Type 4 changes, which resulted in small amount of such variants being retained and their marginal contribution to the overall result. Therefore, we included all Type 1 to Type 4 variants for the remaining analysis to avoid neglecting missense and inframe INDELs changes.

Ranking of carrier frequencies in validation cohorts

To verify the above pipeline in ranking carrier frequencies, we further analyzed variants from three independent genome databases with WGS data on large scale East Asian (Chinese) and South Asian (Malay and Indian) populations. These databases included the Singapore 10 K Genome Project (SG10K)18, the China Metabolic Analytics Project (ChinaMAP)19 and the Westlake BioBank for Chinese (WBBC) pilot project20.

WGS data of 4810 SG10 K samples from three Asian subpopulations (2780 Chinese, 903 Malays and 1127 Indians) were obtained. After removing outlier samples from each subpopulation (see Methods), WGS data from 2613 Chinese, 721 Malay and 1001 Indian were retained for downstream analysis (Supplementary Fig. 3). Type 1 to Type 4 variants were identified based on the criteria described above (Supplementary Dataset 3), and subpopulation-specific GCRs were calculated for the 2699 genes and ranked subsequently (Supplementary Dataset 6, Fig. 3b). We compared the rankings of the 2699 genes in SG10 K subpopulations with those in the gnomAD database. As expected, among the seven gnomAD ethnicities, the ranking in SG10K Chinese correlated with gnomAD East Asian the most (Fig. 4a, Supplementary Fig. 4, Spearman rank correlation coefficient R = 0.76, Pearson correlation coefficient R = 0.52), while the ranking in SG10K Indian correlated with gnomAD South Asian the most (Fig. 4a, Supplementary Fig. 4, Spearman R = 0.63, Pearson R = 0.25). Ranking in SG10K Malay only showed slightly higher correlation with gnomAD South Asian than other ethnicities, presumably due to its small sample size (721 individuals only) and the distinct genetic background between Singapore Malay and self-reported South Asian resided in the U.S.

Fig. 4: Comparison of carrier frequencies among different cohorts.figure 4

a Comparison between gnomAD populations and SG10K subpopulations, ChinaMAP Chinese or WBBC Chinese. Spearman’s rank correlation coefficient and Pearson correlation coefficient R scores are illustrated in each cell and color coded. Red indicates high correlation, while blue indicates low correlation. b Comparison between SG10K subpopulations and ChinaMAP Chinese or WBBC Chinese. c Comparison between ChinaMAP Chinese and WBBC Chinese. See also Supplementary Fig. 4 for comparison scatter plots and corresponding statistical P values. d Spearman’s correlation coefficients generated between our calculated carrier frequencies of gnomAD populations and actual ethnicity specific carrier frequencies recorded from NGS based ECS (Taber et al. 2022 study, ref. 6). AFR African or African-American, ASJ Ashkenazi Jewish, MWH Mixed or Other White, FCA French Canadian or Cajun, EAS East Asian, FIN Finnish, AMR Hispanic (corresponding to the Latino/Admixed American population in gnomAD), MEA Middle Eastern, NEU Northern European, SAS South Asian, SEA Southeast Asian, SEU Southern European. The color stands for row-wise Z score scaled Spearman’s correlation coefficient. See also Supplementary Fig. 5 for Pearson correlation coefficients.

We also calculated the GCR ranking of the 2699 genes in the ChinaMAP cohort (Supplementary Dataset 6, Fig. 3c). The ChinaMAP database aggregated WGS data from 10,588 Chinese individuals, majority of which (9043) were Han Chinese. Again, among the seven gnomAD ethnicities, the ranking in ChinaMAP correlated with gnomAD East Asian the most (Fig. 4a, Spearman R = 0.80, Pearson R = 0.78). Similarly, GCR ranking in the WBBC cohort containing WGS data from 4480 Chinese individuals (Supplementary Dataset 6, Fig. 3d) also showed high correlation with gnomAD East Asian (Fig. 4a, Spearman R = 0.79, Pearson R = 0.78). In addition, ChinaMAP ranking and WBBC ranking resembled that from the SG10K Chinese, but not that from the SG10K Indian or Malay (Fig. 4b). Notably, when comparing between ChinaMAP and the WBBC cohorts, both of which contained Chinese population resided in China and thus the most similar genetic background among all the databases, their GCR rankings demonstrated exceedingly high correlation (Fig. 4c, Pearson R = 0.92, Spearman R = 0.78), confirming the robustness of our analysis pipeline.

Comparison with carrier frequencies from previous studies

We further compared our rankings in the gnomAD cohort to actually observed carrier frequencies from a retrospective NGS based ECS study6. This ECS study evaluated carrier frequencies of 176 conditions in >460,000 individuals across 11 self-defined ethnicities. Again as expected, comparison of populations with the same ethnic background between the two cohorts demonstrated high correlation, e.g., gnomAD ASJ versus ECS ASJ (Spearman R = 0.78, Pearson R = 0.81), gnomAD EAS versus ECS EAS (Spearman R = 0.75, Pearson R = 0.66), gnomAD NFE versus ECS MWH (mixed or other white), NEU (Northern European) and SEU (Southern European) (Spearman R = 0.84, 0.83 and 0.79, respectively) (Fig. 4d, Supplementary Fig. 5). These results further validated the reliability of our analysis pipeline.

We also compared our findings to previously published estimation based on genome-wide sequencing data of unaffected individuals (Supplementary Fig. 6). Majority of previous studies on estimated carrier frequencies focused on a single gene or one disease spectrum. We were able to find three studies consisting of analysis on hundreds of recessive genes using gnomAD or gnomAD plus in-house databases, including one analysis on 185 genes associated with AR retinal diseases10, one on 249 genes with AR mitochondrial disorders11 and one on 415 genes with severe recessive conditions9. Except for the relatively low correlation in the Ashkenazi Jewish (ASJ) population between our ranking and the 249 mitochondrial disorder genes study (Spearman R = 0.38; no ASJ data was available for the 185 retinal disease genes study), rankings in other ethnicities demonstrate moderate to high correlations, particularly for the NFE population (Spearman R = 0.76 when compared with the 185 retinal disease genes study, 0.75 with the 249 mitochondrial genes study and 0.72 with the 415 severe condition genes study).

Additionally, we compared GCRs of two representative genes, CFTR and ABCA4, to previously studies (Supplementary Dataset 7)6,7,9,10. We observed similar carrier frequency values for CFTR compared to several published results; while for ABCA4, our GCRs are comparable yet slightly higher. We also calculated the predicted genetic prevalence at the gene level (pGPg) based on GCRs from various cohorts (see Methods for calculation equation). Disease prevalence for 405 genes associated with AR monogenic diseases were estimated (Supplementary Dataset 8).

留言 (0)

沒有登入
gif