Genetic constraint at single amino acid resolution in protein domains improves missense variant prioritisation and gene discovery

Homologous Missense Constraint precisely identifies pathogenic variants

First, we showed that HMC can distinguish pathogenic variants from benign variants in ClinVar. We note that ClinVar pathogenic missense variants are significantly enriched in Pfam domains (RatePathogenic vs Benign = 2.68, 95% CI = 2.63–2.73), and in our defined assessable regions (Pfam domains with multiple copies across the genome) (RatePathogenic vs Benign = 2.10, 95% CI = 2.06–2.15) compared with ClinVar benign variants, indicating domains are hotspots for pathogenic missense variants.

Applying HMC within protein domains, we found that ClinVar pathogenic variants are more likely to occur at constrained domain positions (HMC < 1; RatePathogenic vs Benign = 3.9, 95% CI = 3.5–4.2, P-value = 1 × 10–304) compared with unconstrained domain positions (HMC ≥ 1; RatePathogenic vs Benign = 0.62, 95% CI = 0.61–0.64, P-value = 1 × 10−311) (Fig. 2a). The strength of the association increases as domain residues are under stronger genetic constraint (HMC < 0.5; RatePathogenic vs Benign = 37.9, 95% CI = 15.7–91.3, P-value = 1 × 10−41) indicating that variants with lower HMC scores are more likely to be disease-causing.

Fig. 2figure 2

HMC accurately distinguishes pathogenic variants from benign variants. a Highly constrained positions within protein domains are enriched for pathogenic variants and unconstrained variants are depleted for pathogenic variants. The rate of ClinVar Pathogenic variants vs Benign variants (risk ratios) within various HMC constrained/unconstrained bins are shown with 95% CI. Rate of ClinVar Pathogenic vs Benign variants (RatePathogenic vs Benign) was calculated as N in the bin,pathogenic/N total,pathogenic vs N in the bin,benign/N total,benign. A total of 13,009 ClinVar Pathogenic and 3914 ClinVar benign variants were assessed. b Missense de novo mutations observed in a cohort of individuals with neurodevelopmental disorders (NDD; n = 5264) are found at more highly constrained residues than de novo mutations observed in unaffected controls (n = 2179). The cumulative rate of constrained de novo mutations in cases (N DNMs with HMC<X in cases/N total DNMs in cases) was plotted to compare with that in controls (N DNMs with HMC<X in controls/N total DNMs in controls). In total, 1209 DNMs in cases and 337 in controls are assessed in all genes. c Missense de novo mutations affecting highly constrained domain positions are significantly enriched in NDD cases versus unaffected controls. The rate of DNMs in cases was compared with that in controls in HMC-constrained/unconstrained bins. The rate of DNMs in cases vs controls was calculated as N in the bin,cases/N total,cases vs N in the bin,controls/N total,controls. d In 285 genes associated with developmental disorders, HMC prioritises damaging de novo missense mutations with a comparable effect size as protein-truncating variants (PTV) in 31,058 parent-proband trios of developmental disorders (DD). We compared the prevalence of missense de novo mutations (DNM) in established DD-associated genes in individuals with DD against that of expected de novo mutations predicted by context-based mutability, and plot the ratio (“burden”) for missense DNMs stratified by HMC score. The burden (Obs/Exp) ratio was calculated as N observed DNMs in the bin, in cases/Nexpected DNMs in the bin, in cases. As baseline references, the dotted lines show the burden (Obs/Exp) ratio for synonymous DNMs (ORSyn = 1.1, 95% CI = 1.0–1.2), missense DNMs (without HMC stratification; ORMis = 5.8, 95% CI = 5.6–6.0), missense DNMs within annotated Pfam domains (ORMis in domains = 13.5, 95% CI = 12.9–14.3), and protein-truncating DNMs (ORPTV = 32.4, 95% CI = 30.8–34.0). Missense DNMs at the most highly constrained residues (HMC < 0.6) show an association signal similar to that of protein-truncating DNMs. e Highly constrained (HMC < 0.8) or nominally constrained missense variants (HMC < 1) have increased association with hypertrophic cardiomyopathy compared with controls. We calculated the odds of carrying a rare missense variant for individuals with hypertrophic cardiomyopathy, and for the gnomAD reference population, and show the odds ratio for all rare missense variants, and for rare missense variants stratified by HMC scores. Constrained variants in MYBPC3 are more strongly disease associated. Data are sparser for the other three genes shown, which are much rarer causes of hypertrophic cardiomyopathy, but the trend is concordant. f The association between HMC-constrained missense variants (highly constrained HMC < 0.8 or nominally constrained HMC < 1) and MAVE pathogenic classification in DD-related genes measured by Odds Ratio and its 95% CI. All the seven assessable genes show a positive association with HMC highly constrained variants (HMC < 0.8) and five of them show significant association: BRCA1, CBS, HRAS, KRAS, PPM1D, PSAT1 and YAP1

Next, we asked whether HMC could prioritise deleterious de novo mutations (DNMs). We analysed published DNMs identified in 5264 probands ascertained with severe neurodevelopmental delay (NDD) and 2179 unaffected controls [23]. We found that de novo missense mutations in highly constrained domain positions (HMC < 0.8) are significantly enriched in NDD cases (RateNDD cases vs controls = 4.1, 95% CI = 2.4–6.9, P-value = 3.1 × 10−10; Fig. 2b–c). Similarly, highly constrained DNMs are significantly enriched in probands ascertained with autism spectrum disorders (RateASD cases vs controls = 2.2, 95% CI = 1.3–3.7, P-value = 0.0028; Additional File 1: Fig. S4). In a larger trio cohort with 31,058 probands of developmental disorders (referred hereafter as “the 31 K DD cohort”) [24], we further evaluated the enrichment of constrained DNMs (the ratio of observed to background expectation [20, 25]) in 285 dominant DD-associated genes that showed statistical enrichment of DNMs in that cohort. While missense variants located in annotated domains have a higher burden than those located elsewhere (Obs/Exp = 13.6, 95% CI = 12.9–14.3), HMC can further narrow down to a subset as [20, 25]highly constrained (< 0.8) with an effect size close to that of protein-truncating variants (Obs/Exp = 27.6, 95% CI = 25.5–29.7 vs PTV: Obs/Exp = 32.4, 95% CI = 30.1–34.0; Fig. 2d).

We also tested the ability of HMC to predict deleterious variants causing adult-onset disorders. We performed a case–control gene burden test in 6327 patients with hypertrophic cardiomyopathy from the SHaRe registry [23] using the 125,748 gnomAD v2.1.1 exomes as controls. For four sarcomere genes with HMC assessable variants, cases carry more HMC-constrained variants than controls compared to unconstrained or unclassified variants (Fig. 2e), though this is only individually significant for MYBPC3 (P-value = 1 × 10−121), likely due to limited power given a relatively low total number of variants in assessable positions of other genes. We expect HMC to have more power and a narrower confidence interval in genes with domains from a large Pfam family with more assessable positions. As shown by the examples of cardiomyopathy genes, the I-set (Pfam ID: PF07679) and FN3 (Pfam ID: PF00041) domains in MYBPC3 belong to large domain families with 785 and 597 homologous copies respectively in the exome, while domains from the other three tested genes belong to domain families with fewer than 72 copies in the exome (the largest domain family: EF-hand_1 (Pfam ID: PF00036) in MYL2).

As a further independent evaluation, we compared HMC with functional data from multiplex assays of variant effect (MAVEs) obtained from ProteinGym [26]. There are 17 genes that we can both evaluate using MAVE data and HMC scores including 14,813 variants. Across these genes, HMC highly constrained classification (HMC < 0.8) shows a significant association with MAVE functional classification (OR = 1.1, 95% CI = 1.0–1.2) and medium correlation (mean Spearman r = 0.19 and mean AUC = 0.59) (Additional File 1: Table S1). For seven genes related to rare monogenic developmental disorders (in the DDG2P panel), HMC has both a stronger association (OR = 1.5, 95% CI = 1.3–1.7) and stronger correlation with MAVE data (mean Spearman r = 0.21 and mean AUC = 0.65) on the ranking of variant pathogenicity (Fig. 2f). We infer that the strength of the relationship between the underlying reproductive fitness and the functions measured by MAVE assays influences apparent performance evaluation. Genes under strong negative selection, as exemplified by developmental disorder genes, are expected to have a higher association with HMC (Additional File 1: Fig. S5).

HMC is highly precise and is complementary to existing metrics to prioritise missense variants

We next evaluated the performance of HMC against existing pathogenicity scores, using ClinVar variants as a reference set. We first compared HMC to existing sub-genic constraint models that aim to predict deleterious variants without supervised training on known pathogenic variants including Constrained Coding Region [4] (CCR), Regional Missense Constraint [5] (RMC), and a homologous-residue-based conservation metric para_zscore [27]. As each approach generates predictions on different areas of the exome, we analysed the intersection of ClinVar variants that can be scored by all four methods (3661 pathogenic and 537 benign variants). Among HMC, CCR, RMC and para_zscore, CCR has the highest area under the Precision-Sensitivity curve (Fig. 3a). Within the authors’ recommended thresholds of classifying deleterious variants, HMC achieves the highest precision > 98.6% (under the high constraint threshold < 0.8). HMC’s precision is lower for variants with constraint score between 0.8 and 1 as expected, but HMC applied at this more lenient threshold remains the second-best precision among the four scores, with preserved sensitivity. At the threshold for HMC > 1, precision quickly decreases indicating its limited usage in this range.

Fig. 3figure 3

HMC has greater precision than other constraint metrics, and comparable performance to meta-predictor pathogenicity scores. a Using ClinVar variants, the precision-sensitivity curve demonstrates that HMC has higher precision over the constraint- and homologous-residue-based methods in top-ranked variants and within authors’ recommended thresholds (dots with larger size). The recommended threshold and the corresponding precision and sensitivity for each tool are HMC < 0.8 (98.6%, 48.8%), CCR > 95 (98.6%, 34%), RMC < 0.6 (93.5%, 67.4%) and para_zscore > 0 (92.4%, 89.3%). b HMC has comparable precision as existing state-of-the-art supervised meta-predictors. Dots with larger size indicate the performances (precision, sensitivity) using authors’ recommended thresholds: HMC < 0.8 (98.0%, 29.0%), M-CAP ≥ 0.025 (89.4%, 98.2%), REVEL > 0.5 (94.9%, 92.9%), CADD ≥ 10 (no variants are scored as deleterious), MPC ≥ 2 (97.6%, 31.2%), AlphaMissense ≥ 0.5 (96.5%, 88.3%), EVE ≥ 0.5 (95.3%, 82.8%), ESM1b ≤  − 7.5 (93.1%, 92.8%). cd Using the 31 K trio data of DD, the burden of de novo mutations (observed/expected) in DD-associated genes is compared to evaluate the precision of predicting damaging variants (the higher the burden, the more likely the variants are damaging). For a given tool, variants are grouped into bins based on their percentile of predicted pathogenicity among all assessed variants. As baseline references, the dotted lines show the burden ratio for synonymous DNMs, missense DNMs, missense DNMs within annotated Pfam domains, and protein-truncating DNMs as shown in Fig. 2d. c Comparison between HMC with existing constraint-based and homologous residue-based scores using DNM burden and its 95% confidence interval; d Comparison between HMC with existing state-of-the-art meta-learners using DNM burden and its 95% confidence interval

We also compared HMC to the state-of-the-art machine-learning-based variant pathogenicity predictors: M-CAP [28], MPC [5], REVEL [29], CADD [30], AlphaMissense [31], EVE [31, 32] and ESM1b [33] (Fig. 3b). A set of 9187 ClinVar pathogenic variants and 1465 benign variants that could be assessed by all methods were used to evaluate performance. AlphaMissense performs the best with the highest precision-sensitivity area under the curve among all the tools. Using the high constraint threshold (< 0.8), HMC has the highest precision at the authors’ recommended classification thresholds, with reduced precision (as would be expected) at the nominal constraint threshold (< 1). We note comparable performance, with most tools achieving very high precision (≥ 97%) at the left of the PRC curve. Here we caution that benchmarking against the supervised meta-predictors (including M-CAP, MPC, REVEL and CADD) using ClinVar variants in this context might be biased since they leverage multiple features and previously developed scores, some of which have been trained directly or indirectly on ClinVar, thus potentially leading to an inflated performance of these tools in this evaluation. In contrast, HMC remains independent of the well-established pathogenic variant sets in the score construction and evaluation.

We further compared HMC with the above existing pathogenicity scores for prioritising deleterious DNMs. Using the 31 K DD cohort, HMC outperforms all ten existing tools, being able to prioritise a subset of DNMs with the highest enrichment in the 285 dominant DD genes (top 5%, Fig. 3c and d), with an effect size as strong as protein-truncating variants. This highlights that HMC is highly precise in identifying de novo missense variants causing DD compared with existing approaches.

After confirming that HMC has favourable precision compared with existing scores, we ask whether HMC is complementary to them. As a measure of genetic constraint within human populations, HMC is methodologically independent of other lines of computational evidence, such as conservation across species and structural effect predictions. To evaluate the predictive power added by HMC compared with the existing constraint metrics, we assessed the number of HMC-constrained variants that could be missed by existing missense constraint metrics including gene-level constraint (MOEUF) [2], and sub-genic regional constraint scores (CCR and RMC) using the intersection of all possible exome missense variants that could be evaluated by the two scores compared. Constrained homologous residues detected by HMC are distributed across full ranges of these existing metrics in either constrained or unconstrained genes/regions. We find if a gene/region is more constrained as a whole, on average it also has more constrained residues compared to a less constrained gene/region (Fig. 4). However, there are substantial numbers of highly constrained missense variants uniquely classified by HMC (< 0.8): 893,063 not prioritised by either of the sub-genic metrics, of which 351,175 are not prioritised by any of the existing metrics. In summary, HMC provides a novel and independent effect prediction that could be combined with other classes of computational evidence to best interpret variants.

Fig. 4figure 4

Comparing the distributions of HMC score with existing gene-level and regional-level constraint scores. Here we show that HMC is not co-linear with other metrics, and therefore is likely to provide additional information when used in combination. Bar plots in the first column display the proportion of variants identified as constrained by HMC across genes or sub-genic regions (N HMC-constrained variants in the bin/N variants in the bin). Providing a further detailed view of the bar plots, 2D-histogram plots in the second column display the number of HMC-constrained variants within various ranges across gene/region constraint scores. ab The relationship between HMC and a gene’s MOEUF score (gene-level constraint of missense variants; a lower value indicates higher constraint). A gene with MOEUF ≥ 1 (grey dashed line) is considered as nominally unconstrained. cd The relationship between HMC and CCR (a higher percentile indicates higher constraint). A region with CCR percentile < 95% (grey dashed line) is considered unconstrained recommended by authors [4]. ef The relationship between HMC and RMC (a lower value indicates higher constraint). A region with RMC > 0.8 (grey dashed line) is considered unconstrained as recommended by authors

HMC prioritises newly significant genes associated with developmental disorders

We have analysed distributions of HMC scores across all assessable missense variants in different gene categories, including known autosomal dominant and autosomal recessive disease genes, as well as genes causing developmental disorder (with definite confidence from DDG2P), and genes with high GDI [34] (genes with high missense load in the general population and less likely to be disease-causing). We also included genes with high intolerance of variants measured by gnomAD LOEUF (intolerance of loss-of-function variants; < 0.6) and MOEUF (intolerance of missense variants; < 0.6).

Comparing the median of the HMC distributions, we found genes that are intolerant of missense variants (MOEUF constrained) or loss-of-function variants (LOEUF constrained) and disease-causing dominant genes, which are likely under stronger selective pressure, tend to have variants with lower HMC scores compared to genes with high GDI or autosomal recessive ones (two-sample Brown-Mood median test P-value < 2.2 × 10–16) (Additional File 1: Fig. S5). This reflects that the distribution of HMC across genes corresponds to the selective pressure acting at the gene level.

We further investigate whether HMC could improve gene discovery in developmental disorders given our above analyses showing that (1) HMC represents an orthogonal measure of variant deleteriousness, (2) is highly accurate in predicting disease-causing DNMs in known DD genes, and (3) genes under strong negative selection have variants with lower HMC scores. HMC prioritises a subset of missense DNMs that show a significant excess burden in the 31 K DD probands, in genes that are not previously known to be associated with DD (HMC < 0.8: Obs/Exp = 1.37 95% CI = 1.25–1.51; Additional File 1: Fig. S6), suggesting its potential to discover unknown DD genes. We updated a gene-specific de novo enrichment test (DeNovoWEST) [18] by incorporating HMC to weight missense variants (see Supplementary Methods) in the 31 K DD cohort. Consequently, we observed an increased DNM burden of up-weighted variants and a decreased DNM burden of down-weighted ones, indicating the improved separation of pathogenic from benign variants after adding HMC to the test (Additional File 1: Table S2). Our upgraded tests identified 286 disease-associated genes in the full cohort, and 97 in those previously undiagnosed (probands who do not carry pathogenic variants in consensus diagnostic genes, as previously defined [24]) at genome-wide significance (Bonferroni adjusted P-value < 0.05/(2 × 18,762)).

Compared with the original study [24], there are seven newly significant genes across the two tests, which carry at least one constrained missense variant, confirming that their elevated significance signal is driven by HMC (Additional File 1: Table S3–S4). Four of these genes have previously been published in association with DD via other lines of evidence and are currently included in the Developmental Disorders Genotype-to-Phenotype Database [35] (DDG2P), indicating that our results provide independent support for their gene-disease association. Three of these genes (BMPR2, KCNC2, and RAB5C) have not yet been included in the DDG2P. BMPR2 is known to cause pulmonary arterial hypertension [36, 37]. KCNC2 has been independently suggested to be a new candidate epilepsy gene [38,39,40]. Importantly, the newly significant genes all have more constrained missense DNMs than protein-truncating DNMs, suggesting the potential involvement of a gene-function-altering mechanism (Fig. 5).

Fig. 5figure 5

De novo variants identified in 31,058 parent-proband trios reveal seven genes associated with developmental disorders at genome-wide significance for the first time in the full DD cohort (a) and the previously undiagnosed subset (b). Four of these genes have been previously curated as DD genes on the basis of other lines of evidence, and are already included in the G2P database as established Developmental Disorder genes (blue), while three genes represent new candidate DD genes (red). Numbers of constrained missense DNMs classified by HMC and protein-truncating DNMs were compared. The newly significant associated genes likely act through altered function mechanisms as there are more constrained missense variants than PTVs

留言 (0)

沒有登入
gif