Clinical variant interpretation and biologically relevant reference transcripts

Mapping ClinVar variants to coding exon sets

We mapped ClinVar variants to the coding exons (CDS) of three sets of reference transcripts, APPRIS “principal transcripts”, MANE Select transcripts and the longest CDS of each coding genes (see methods for details). The fine details of the mapping and analysis are detailed solely for the APPRIS principal transcripts, but we carried out the same process of computer analysis and manual curation for all three sets.

Only a handful of pathogenic variants map to alternative exons

Just 2.43% of GENCODE v3717 coding nucleotides are wholly alternative (do not overlap APPRIS principal transcripts), and considerably fewer ClinVar variants map to these nucleotides than would be expected by chance (1.6%). Variants can be distinguished by clinical significance (Fig. 2). We found the more damaging the ClinVar clinical significance label, the fewer variants mapped to alternative exons. For example, while 2.3% of variants tagged as “Benign” mapped to alternative exons, the same was true of just 1.31% of “Uncertain Significance” variants. Very few “Pathogenic” variants mapped to alternative exons (0.37%). Pathogenic variants that have undergone expert curation are even less likely to map to alternative exons than ordinary pathogenic variants. Just five of 9,491 variants reviewed by an expert panel (0.05%) mapped to alternative nucleotides.

Fig. 2: The percentage of ClinVar variants that map exclusively to alternative exons.figure 2

The small percentage of variants that do not map to APPRIS principal exons. Variants are grouped by ClinVar labels; the labels correspond to the CLIN_SIG entry. “Expert review” are variants labelled as “reviewed by expert panel”. Alternative exons make up 2.43% of all coding nucleotides, but <0.5% of all variants labelled with the word “pathogenic” fall in alternative exons.

Restricting variants to those supported by PubMed references magnifies the differences. The proportion of pathogenic and likely pathogenic variants mapping to alternative exons decreases for variants with PubMed references (from 0.37% to 0.22% for variants tagged as “Pathogenic”, and from 0.46% to 0.25% for variants tagged as “Likely pathogenic”), while the proportion of “Benign” variants in alternative exons increases (2.29% to 2.94%).

Manual curation of pathogenic variants in APPRIS alternative exons

We mapped ClinVar pathogenic variants with PubMed support to exons and splice sites from APPRIS principal transcripts (see methods). We defined “pathogenic variants” as those variants tagged as Pathogenic, Likely pathogenic or Pathogenic/Likely pathogenic in ClinVar. There were 115,508 pathogenic variants, 17.6% of all variants annotated in ClinVar (Fig. 3). Coding exons from APPRIS principal transcripts captured 114,387 of these variants (99.03%).

Fig. 3: The process of validating the pathogenic variants that map to alternative exons.figure 3

The process of mapping pathogenic variants from the ClinVar VCF file (version 4th April 2021) to the APPRIS principal and alternative transcripts (left side of the flow chart), and the breakdown of the manual analysis of the 76 pathogenic variants that map uniquely to APPRIS alternative variants (right side of the chart). We found that just 48 pathogenic variants had a direct effect on the expressed alternative protein. We tagged these pathogenic variants as “validated”.

Most of the 1211 pathogenic variants that map to alternative coding exons are found in 5′ or 3′ exon extensions within just a few bases of principal coding exon splice sites. Although they map to alternative coding exons, these variants are much more likely to affect the splice site of the principal transcript. To account for this, we carried out an intronic splice site motif aware mapping of ClinVar variants by extending coding exons in APPRIS principal transcripts by three nucleotides at the 5′ end, and by 5 nucleotides at the 3′ end. The 667 pathogenic variants that mapped to these extended principal CDS counted as mapping to reference transcripts rather than to alternative exons, meaning that the total of pathogenic variants captured by APPRIS principal transcripts rose to 115,054 (99.61%). Alternative exons captured just 454 pathogenic variants and 76 of these had PubMed support.

The ClinVar database does not assign pathogenicity labels. Instead, these are determined by the submitting group following guidelines issued by the ACMG. The submission guidelines have changed over time. Given the changing guidelines and the possibility of human error, some pathogenic variants may be erroneous. We carried out a manual curation of these 76 variants, reviewing the supporting publications to determine whether the variant was correctly transferred between research paper and clinical database, and whether it affected the translated protein.

We removed seven pathogenic variants from the list because they were not mentioned in the PubMed papers to which they were linked (Fig. 3). In addition, two of the variants appear to be annotation errors. The coordinates of the pathogenic variant in KCNQ218 seem to have been erroneously mapped to an alternative exon, and in PTCH1 the authors cannot have not carried out confirmatory experiments using the equivalent to the alternative exon in fish19 because this exon is not conserved outside of primates.

We eliminated six pathogenic variants because the supporting PubMed papers found that the effect of the variant was actually on the splicing or the expression of the main transcript20,21 rather than the alternative transcript, while a further six pathogenic variants affected non-coding features, including three that mapped to a nonsense-mediated decay (NMD) exon in SNRPB22. The variant in HBD mapped to a GATA1 binding site23, and the variant in SDCCAG8 affected an exonic splicing enhancer24. The alternative coding exons the variants mapped to in all these cases are only conserved in primates and have little transcript support.

Finally, we believe the authors of seven research papers have mistakenly classified the variant as having a pathogenic effect via the alternative protein. One example is in the gene ACTG2, which produces smooth muscle actin, a protein that is even 95% identical between vertebrates and invertebrates. The predicted pathogenic variant25 affects a novel 3′ exon derived from a LINE2 transposon that is conserved only in chimpanzee. The isoform produced from this novel exon would be missing almost three-quarters of the actin fold (Fig. 4). It is hard to imagine how a variant in an exon that produces a truncated protein isoform could affect megacystis-microcolon-intestinal hypoperistalsis syndrome, a severe disorder that affects bladder and intestine muscles25. Especially since the transcript appears not to be expressed in any tissue26, and certainly not in bladder or intestines. The pathogenicity of this variant is solely supported by association studies, and the authors even admit that “the data from this family suggests but perhaps do not prove entirely that the alternative exon 4… is functionally important”.

Fig. 4: The effect of ACTG2 alternative splice event on protein structure.figure 4

a The sequence of alternative isoform from the novel exon mapped onto the structure of chicken smooth muscle actin (PDB: 3W3D, 28). The region that is maintained in the alternative isoform is shown in blue, while the 34 residues that would be replaced by the novel primate exon with the pathogenic variant are shown in green. The remainder of the structure (in yellow) would be lost from this presumed protein. The ATP and calcium bound by the chicken actin protein (lost in the alternative isoform) are in orange space fill. Mapping was carried out using HHPRED47. b A model of the same truncated ACTG2 isoform generated by AlphaFold45. The maintained sequence is in blue, the novel predicted region is in grey. Despite the AlphaFold prediction, the substituted sequence is unlikely to fold into an extended helix, and will certainly not bind ATP. Both images were generated with PyMol.

The variant in the primate-derived alternative exon of LDB3 would change an isoleucine for a methionine residue. Expression of the exon is limited to testis, and there is no evidence it is expressed in heart26. This is incongruent, given that the variant is supposed to cause dilated cardiomyopathy27. LDB3 reference sequences in both UniProtKB and the Locus Reference Genomic incorporate this exon, presumably because of this unlikely pathogenic variant.

The variant in the NOBOX alternative transcript, ENST00000467773.1, was assumed by the authors to be pathogenic28 because it falls within a conserved homeobox domain. However, the variant, a conservative serine to threonine swap, maps to a little expressed26, primate-derived alternative exon that itself inserts into the region that produces the homeobox domain (Fig. 5). The inserted exon would almost certainly disrupt the domain, particularly since the inserted exon is adjacent to the conserved asparagine and arginine residues that bind DNA (Fig. 5). This novel exon almost certainly would eliminate DNA binding with or without the variant. ENST00000467773.1 is the longest CDS. It is also the MANE Select transcript and produces the UniProtKB display isoform, in part because of this erroneous variant.

Fig. 5: The effect of the NOBOX 3′ splice site extension on protein structure.figure 5

a The image shows the crystal structure of Drosophila melanogaster Aristaless and Clawless homeodomain proteins bound to DNA (PDB: 3A01). The Aristaless protein (chain F, yellow) has 57% identity to the NOBOX homeobox domain. The residues where the 32 amino acid insert would break the NOBOX homeobox domain are highlighted in red. The insertion is right next to the conserved DNA-binding residues of the homeobox domain and this primate-derived exon would almost certainly banish the NOBOX homeobox DNA-binding function. b The same PDB structure with the inserted exon modelled by AlphaFold for the UniProtKB NOBOX display isoform grafted onto the structure. The inserted exon is in red. The predicted effect of the insertion would be to extend the helix (again) and to interfere with the DNA-binding of the human Aristaless homologue. Mapping was carried out using HHPRED and both images were generated with PyMol.

At the end of our analysis, we found that PubMed publications validated a pathogenic effect on the alternative protein product for 48 of the annotated pathogenic variants (see Supplementary Table 1). The 48 variants mapped to 30 different alternative transcripts. These 48 “validated” variants are just 0.138% of all 34,833 PubMed-supported pathogenic variants.

Almost all of the 28 PubMed-supported pathogenic variants that mapped to alternative exons but that did not affect the alternative protein were primate-derived, many only across higher primates. Recently evolved alternative exons are highly unlikely to have gained sufficient functional importance for variants to have a pathogenic effect on the protein product29, and in order to be considered pathogenic should have enhanced supporting evidence. For example, although the alternative exon that houses the REEP6 pathogenic variant evolved recently, during the eutherian clade, its pathogenicity is also supported by expression data. The variant is predicted to cause autosomal-recessive retinitis pigmentosa30, and the alternative isoform is the main isoform in retina31.

Pathogenic variants in exons alternative to MANE Select and longest CDS transcripts

We also analysed the relationship between pathogenic variants and the two other methods for selecting reference sequences, MANE Select transcripts and the longest CDS (see Supplementary Tables 2, 3). We mapped the 115,508 pathogenic variants from the ClinVar VCF file to both these sets of reference transcripts as we had with the principal transcripts (Fig. 3). Prior to manual curation, all but 67 of the 33,736 pathogenic variants with PubMed support mapped to MANE Select transcripts rather than alternative transcripts (Fig. 6). This was similar to APPRIS, and indeed most of the pathogenic variants in alternative exons coincided. The longest CDS captured all but 160 of the pathogenic variants supported by PubMed publications (Fig. 6).

Fig. 6: Pathogenic variants not captured by reference transcripts.figure 6

ClinVar Pathogenic, Likely pathogenic and Pathogenic/Likely pathogenic variants with PubMed support that are not captured by reference transcripts are here termed uncaptured pathogenic variants (UPVs). The method for reference transcript selection is shown in the legend. “All UPVs” are all pathogenic variants with PubMed citations that are not captured by the reference transcripts, “Validated UPVs” are those uncaptured pathogenic variants with PubMed support that were validated by manual curation, “Genes with UPVs” are the number of distinct genes with validated uncaptured pathogenic variants.

As we had with the pathogenic variants that mapped to APPRIS alternative exons, we validated the likely pathogenic effect on the alternative protein products for variants that were in exons alternative to MANE Select transcripts, and those that were in exons alternative to the longest CDS. We validated 47 of the 67 pathogenic variants that mapped to MANE Select alternative exons (Fig. 6). This was 0.139% of PubMed-supported pathogenic variants, similar to the 0.138% of validated pathogenic variants that mapped to APPRIS alternative transcripts. Unsurprisingly, there was no significant difference between the proportion of pathogenic variants captured by APPRIS principal transcripts and those captured by MANE Select transcripts,

The longest CDS was less successful at capturing validated pathogenic variants. After curation, 143 pathogenic variants from 60 genes (Fig. 6) mapped to transcripts with shorter CDS (0.411% of pathogenic variants with PubMed support). The longest CDS miss three times as many validated pathogenic variants as the APPRIS principal and MANE Select transcripts, even though they cover substantially more nucleotides. This difference is clearly significant (two-tailed Fisher exact test, p < 0.00001). In fact, over the whole of ClinVar, the longest CDS fails to pick up 535 pathogenic variants.

APPRIS principal and MANE Select agree on the reference transcript over a total of 2,985 genes that have PubMed-supported pathogenic variants. Within these genes, the MANE and APPRIS supported reference transcripts captured 31,259 of 31,291 PubMed-supported pathogenic variants (99.9%). Just 32 PubMed-supported pathogenic variants mapped to alternative exons (Supplementary Table 4), and of these, we validated just 13 (0.042%) via their PubMed references. When they agree, APPRIS principal and MANE Select transcripts capture almost all annotated clinically important variants.

The same is not true for the longest CDS. Over the set of genes where MANE Select and APPRIS principal transcript coincide, the longest CDS fails to capture 113 validated pathogenic variants. Ten of these are also missed by APPRIS/MANE. That means that over those genes where the longest CDS does not agree with the APPRIS principal and MANE Select reference, the longest CDS fails to capture 103 pathogenic variants, and the MANE Select/APPRIS principal reference just three (in REEP6, SLC25A3 and TCF3). Even counting the pathogenic variant in TCF3 (where the longest CDS is almost certainly not biologically relevant), this is still a ratio of 34 to 1.

Extending the analysis to all pathogenic variants

To quantify the alternative transcripts with ClinVar pathogenic variants, we extended our analysis to include all pathogenic variants, regardless of whether they were supported by a publication. To guarantee that pathogenic variants mapped to alternative transcripts, we did not use MANE or APPRIS to select reference transcripts, and instead tried to map as many pathogenic variants as possible to a single transcript with each gene. Pathogenic variants that were not captured by this transcript were deemed to map to alternative exons.

In this analysis of all ClinVar pathogenic variants, these variants mapped to alternative exons in just 67 transcripts. Because many of these pathogenic variants were without PubMed support, we “validated” the likely effect on the protein by determining the relative age and expression levels of the alternative exons that held the variants. As we have already shown, pathogenic variants in recently evolved exons with little or no transcript support are not likely to have an effect on protein products.

Pathogenic variants mapped to primate-derived exons in 34 alternative transcripts. These 34 primate-derived exons had little or no transcript support26. For 10 of the variants, we have already shown that their PubMed references do not support an effect on the alternative protein (ACTG2, HBD, etc.). Ten of the 34 derived from primate transposons, seven were NMD targets, including all three alternative exons in SCN1A, and one is no longer annotated as coding. We eliminated these alternative transcripts as “not validated”.

Twelve of the 33 remaining alternative transcripts with pathogenic variants had PubMed support. However, a literature search for the variant in the alternative transcript in BNC2 turned up an annotation error. BNC2 has two annotated pathogenic variants (Fig. 7). One variant is in the final coding exon of ENST00000380672.9, the APPRIS principal and MANE Select transcript. This variant produces a histidine for arginine swap at amino acid residue 888 and would banish the zinc binding of the 2nd of four C-terminal zinc-binding motifs. The other pathogenic variant is in an inserted exon in transcript ENST00000418777.5. The inserted exon leads to a frame change and a premature stop codon. The variant is predicted to produce a premature stop codon at residue 852 of the alternative isoform (Fig. 7); a premature stop codon in an already truncated transcript. Both the truncations would eliminate three of the four C-terminal zinc binding motifs (Fig. 7).

Fig. 7: Mis-annotation of a pathogenic variant in BNC2.figure 7

The figure represents the 3′ coding exons of two BNC2 transcripts (not to scale). Exon numbers (shown inside each exon) are from their position in the GENCODE v37 reference set. The position of annotated and experimental pathogenic variants (stars) is marked next to the corresponding transcripts. Exons 8 and 12 in ENST00000380672.9 (the APPRIS principal and MANE Select transcript) code for four zinc finger motifs. Inserted exon 9 in transcript ENST00000418777.5 leads to a frame change and a premature stop codon, which would eliminate three of the motifs. Below the two exons, the motifs are represented by the PDB structures of 3MJH and 1WJ0, mapped using HHPRED. BNC2 has two pathogenic variants in ClinVar (red stars). One is annotated in ENST00000380672.9 and produces a histidine for arginine swap at amino acid residue 888. The other is annotated exon 9 of transcript ENST00000418777.5, and is predicted to affect residue 852 of the alternative isoform. The experimentally determined pathogenic variant (purple star), which would affect both transcripts, is reported to change an arginine to a stop codon at residue 853 of the principal isoform.

Although this variant is not supported by any publication, there is a third experimentally determined pathogenic variant for BNC2 not annotated in ClinVar. It too produces a premature stop codon, but at arginine 853 of the principal isoform, not arginine 852 of the alternative isoform32. This stop codon would also almost certainly be pathogenic since it would remove all four C-terminal zinc binding motifs. It seems that this pathogenic variant was mapped to ENST00000418777.5 erroneously during the lift-over from GRCh37 to GRCh3833. We also removed BNC2 from the set of genes with pathogenic variants in more than one distinct transcript.

The genes with validated pathogenic variants in alternative transcripts are detailed in Supplementary Table 5, while those genes with pathogenic variants that we did not validate are listed in Supplementary Table 6. There are 32 alternative transcripts in total, and three genes, GCNT2, GNAS and PCDH15, have pathogenic variants in three distinct transcripts (Table 1). So, validated pathogenic variants map to more than one transcript in just 29 of the 11,132 genes with pathogenic variants in ClinVar.

Table 1 List of human alternative transcripts that harbour validated pathogenic variants.The clinical relevance of alternative isoforms

We have shown that APPRIS principal and MANE Select transcripts capture almost all pathogenic variants. However, some alternative protein isoforms are also biologically relevant34,35. How can researchers predict which alternative isoforms are clinically significant? Genes with validated pathogenic variants in alternative transcripts provide clues. The most obvious feature is that almost all alternative exons with pathogenic variants are ancient. For example, the alternative exon in SLC25A3 pre-dates the earliest vertebrates36.

More than half of the 32 cases involve the alternative splicing of highly conserved tandem duplicated exons36,37, even though tandem duplicated exon substitutions make up <0.5% of annotated splice events36. Finally, alternative splicing is linked to tissue specificity at transcript38 and at protein level31. Tissue specificity also appears to be a characteristic of the alternative exons in this set31. Manual analysis found that only 20 of the 32 exon pairs (Table 1) had sufficient expression to determine tissue specificity26, but out of these 20 pairs of exons, 18 are clearly tissue specific. Two thirds of the set have both cross-species conservation and expression support.

TRIFID functional importance scores predict clinical importance

A small number of clinically important alternative transcripts are labelled as MANE Plus Clinical14 based on ClinVar annotations. For example, the alternative transcript in SLC25A3 is annotated as MANE Plus Clinical. Of the 58 MANE Plus Clinical transcripts from MANE v1.0, 23 map to the 32 pairs of Ensembl/GENCODE transcripts with pathogenic variants we validated (Supplementary Table 5), 21 correspond to genes in which the MANE Plus Clinical transcript has a uniquely mapped pathogenic variant, but the corresponding MANE Select transcript does not (as is the case of SLC25A3), and 12 transcripts either do not have uniquely mapped ClinVar pathogenic variants, or do not affect the alternative protein product (including PTCH1, for example).

We have demonstrated that reference transcripts produce the most clinically important protein isoform. Beyond this, the functional importance of alternative splice isoforms is clear only in a small number of genes35. We found certain features correlated with functional importance, so we developed TRIFID, a machine learning method that predicts the biological relevance of protein isoforms39. TRIFID inputs include conservation and expression data, and annotations from the Ensembl and APPRIS databases. We have shown that the TRIFID score can distinguish alternative exons that are under selective pressure from those that are not39. Here, we evaluated the ability of TRIFID to distinguish clinically important alternative transcripts.

We combined the 32 alternative exons from the set of genes with validated pathogenic variants in distinct transcripts with the 30 APPRIS alternative exons that have PubMed-supported pathogenic variants. Eleven exons appeared in both lists, so there were 51 alternative exons in all. To each of these exons we assigned the TRIFID score of the best-scoring transcript in which it was found. Just over half (27) had a TRIFID score of over 0.8, while only one had a TRIFID score below 0.2.

The distribution of TRIFID scores for all 76,134 APPRIS-defined alternative coding exons in GENCODE v37 was radically different. Again, we counted only the transcript with the best TRIFID score for each exon. Here, just 3.3% of alternative exons scored >0.8, and the overwhelming majority (almost 85%) had TRIFID scores below 0.2.

We binned the TRIFID scores and calculated the proportion of the alternative exons with validated pathogenic variants in each bin (Fig. 8). We find that the higher the TRIFID score, the more likely an alternative transcript houses a validated pathogenic variant. In fact, exons from the highest scoring TRIFID alternative transcripts have 690 times as many validated pathogenic variants as the 85% of exons in the lowest scoring bin. More than 1% of alternative exons from transcripts with TRIFID scores >0.8 were annotated with validated pathogenic variants, but this fell to just 0.003% for exons from transcripts with TRIFID scores below 0.2.

Fig. 8: Validated pathogenic variants in alternative exons binned by TRIFID score.figure 8

APPRIS alternative exons from the human reference set were binned by the TRIFID score of the best-scoring transcript in which they are annotated. For each set of exons in each bin, we show the percentage of exons that are annotated with validated pathogenic variants. The 51 validated pathogenic variants tend to fall in the highest scoring exons. Differences between the bins were huge, and significant despite the low numbers of validated pathogenic variants, Fisher exact tests showed that the percentage of pathogenic variants with best TRIFID scores exceeding 0.8 was significantly higher than those in all other bins (two tailed Fisher’s Exact test p = 0.0111 against the 0.6–0.8 bin, p = 0.001 against the 0.4–0.6 bin, and p < 0.00001 for the other two bins), and that the percentage of pathogenic variants among exons with TRIFID scores below 0.2 was significantly lower than all other bins (two tailed Fisher’s Exact test p < 0.00001 for all four bins).

Limitations

The interpretation of pathogenic variants in Clinvar depends on the submitter, since submissions are not curated. There are a number of factors that might affect the quality of the interpretation, such as the date of submission and whether the variants are submitted by single submitters or large-scale prediction programs. In this analysis we included all pathogenic variants with a PubMed reference. This does not guarantee quality, but we manually curated pathogenic variants that mapped to alternative exons to create more reliable sets.

What we could not do was question the experimental process presented in each analysis. We did not reclassify the variants using ACMG criteria1,2. We did not attempt to reinterpret the analysis of the variants to determine whether they are truly pathogenic. This analysis should be carried out by the submitting group, or by clinical experts.

Manual curation of the papers allowed us to cross-check the evidence to see if it matched what was in ClinVar. For example, we could show that the coordinates in the paper that supported the pathogenic variant for BNC2 at residue 85332 did not match the coordinates in ClinVar. For some variants we could check whether the assumptions made as part of the definition of pathogenicity were correct (as in the case of NOBOX). We could also check whether the pathogenic effect was reported for the annotated coding exon or for some other feature (as was the case with HBD). Finally, we could use evidence that was not considered by the authors to suggest that the pathogenic label is erroneous (as with the variant in ACTG2).

The ACMG criteria were first published in 20081, so one possibility is that authors were laxer with their submissions prior to 2008. We found no evidence of this in our (limited) set. Just two possible pathogenic variant misannotations pre-dated these recommendations (KCNQ2 and LDB3), and one of these was a simple mis-annotation of protein coordinates18.

留言 (0)

沒有登入
gif