DNA and RNA base editors can correct the majority of pathogenic single nucleotide variants

We downloaded the ClinVar22 database, which included 1,103,629 mutations, of which 984,981 were SNVs. Of these SNVs, 973,996 were located in genes; only 98,513 were reported to be pathogenic. The distribution of the SNVs used in our analysis is shown in Fig. 1. To further illustrate the variant-correcting potential of the base editing approach, two examples of well-known pathogenic SNVs in severe genetic diseases are shown in Fig. 2. The first is achondroplasia, the most common cause for marked short stature (dwarfism). One of the most frequent missense variants is c.1137G>A on the FGFR3 gene. Whereas the reference sequence is GGG, which is translated to glycine, the mutant sequence is AGG, which is translated to arginine. Therefore, correcting the variant is possible by direct A-to-I(G) editing.

Fig. 1: Visualization of the mutations reported in ClinVar and utilized in our analysis, displayed based on mismatch type and molecular consequences.figure 1

a The 98,513 pathogenic SNVs located in genes. This set was utilized in our analysis of DNA base-editing. b The 18,873 pathogenic SNVs located in genes’ non-coding regions. This subset was utilized in our analysis of RNA base-editing in non-coding regions. c The 79,640 pathogenic SNVs located in genes’ coding regions, out of which 78,835 were annotated by the RNA sequence reference. This subset of 78,835 SNVs was employed in our analysis of RNA base editing within coding regions and in the assessment of amino acid improvements.

Fig. 2: Two examples of common pathogenic SNVs that could be corrected by direct A-to-G editing.figure 2

a The known missense variant causing achondroplasia syndrome (FGFR3):c.1138G>A (p.Gly380Arg). Direct base editing could revert the mutant A to a G. b The most common nonsense variant among Ashkenazy Jews causing severe cystic fibrosis (CFTR): c.3846G>A (W1282X). Direct base editing could revert the mutant A to G, thereby converting the stop codon to tryptophan (W). However, G nucleotide 5’ to the edited A may challenge this process, as can be gleaned from the ADAR motif.

A-to-I(G) editing can also be leveraged to correct nonsense variants, such as in the second example provided, concerning cystic fibrosis (CF). This multisystemic disorder is manifested as a defect in the ion transporter encoded by the CFTR gene. Over a thousand mutations in the CFTR gene were described worldwide, and the most frequent one among Ashkenazy Jews is the stop variant W1282X. Whereas the reference sequence is TGG, which is translated to tryptophan, the mutant sequence is TGA, which results in a stop-codon. Reverting this stop codon to tryptophan is possible by direct editing. In this case, however, the editing process is more complicated if the endogenous ADAR enzyme is recruited since the ADAR motif requires the absence of a G 5’ to the edited A.

Pathogenic SNVs that could be amended by RNA base editing

RNA base-editing techniques are usually designed to target mRNA sequences in the cytoplasm and are suitable for targeting variants located in the coding areas of the genes. Of the 78,835 pathogenic SNVs in exons, 21,032 were suitable for direct editing: 15,608 G>A and 5424 T>C SNVs. Fig. 3 depicts all the direct editing manipulation possibilities and data regarding the variants suitable for A-to-I(G) or C-to-U editing. As is evident, 3497 nonsense variants can be reverted by A-to-I(G)editing. The findings regarding A-to-I(G) editing are pertinent to both Cas-13 and endogenous-ADAR BEs, whereas C-to-U editing can only be applied to the former.

Fig. 3: All direct base-editing possibilities at the RNA level.figure 3

I: RNA A-to-I(G) base-editing, as a method to revert G>A SNVs. a An arbitrary example. b All possible amino-acid substitutions by A-to-I(G) editing. The amino acids are presented according to their chemical properties (purple = nonpolar, aliphatic R groups; green = nonpolar, aromatic R groups; yellow = positively charged R groups. orange = polar, uncharged R group, blue = negatively charged R group; red = termination). The left circos presents all the possible substitutions according to the amino acids, and the right circos presents the same data according to the codons. c The amount of G>A pathogenic SNVs that could be corrected by RNA A-to-I(G) base editing, presented according to molecular consequence. d The distribution of these SNVs based on their number of detected off-target hits at the RNA level. e The distribution of these SNVs based on their number of bystander changes predicted to be likely pathogenic. f The percentage of SNVs in which the ADAR motif is detected. II: RNA C-to-U base-editing, as a method to revert T>C SNVs.

When designing an RNA BE, an associated gRNA is programmed to confer target specificity, by base pairing to the targeted sequence. The main concern is gRNA binding to other highly identical targets, resulting in undesired off-target changes. In our search for off-target genomic regions that resemble the nucleotides surrounding the variant—representing the area of adhesion for the programmed gRNA—we constructed a sequence query encompassing the 40 bases surrounding the variant. Using the human BLAT23 program, we aligned the query to the RNA reference. All hits with 85% identity and 20 alignment lengths were deemed off-target sites. We found that for 91% of the G>A and C>T SNVs, zero potential off-target sites were detected, indicating that they are safe therapeutic targets regarding this manner.

Other concerning off-target changes can theoretically occur in proximity to the edited nucleotide; Once the deaminase approaches its target, it can unintentionally edit other nucleotides of the same type in a very close area. To tackle this issue, we adopted a rigorous method, concentrating on the 20 nucleotides surrounding the variant. We examined the number of potential editable nucleotides in this area and assessed the projected impact on the resulting protein following such modifications. These assessments were based on predictions provided by the AlphaMissense project24 for each possible nucleotide alteration. Based on their predictions, the analysis includes the number of potential bystander edits per variant that are likely to be pathogenic. For the G>A SNVs located in coding regions, an average of 4.6 surrounding A nucleotides were identified, of them 1.2 on average were anticipated to have a pathogenic impact. Notably, in 6726 variants (43%), no likely pathogenic effect was observed. In the same manner, for the T>C SNVs, 5.3 surrounding C nucleotides were found on average, and only 1.0 on average were anticipated to be pathogenic. In 2431 (45%) no likely-pathogenic effect was observed.

For 69% of the G>A SNVs, an ADAR motif, that prefers the absence of a G 5’ to the edited A was found. Therefore, these are suitable targets for BEs based on an ADAR enzyme.

RNA base editing techniques can be further expanded to target RNA sequences in the nucleus. For instance, by leveraging the endogenous ADAR p110 isoform, which is abundant in the nucleus. This expands the scope of variants that can be corrected at the RNA level since introns and other non-coding regions are transcribed at the nucleus as well. An analysis of the 18,873 pathogenic variants that are located in genes, but not in exons, identified another 7945 variants that are suitable for direct editing: 6603 G>A and 1342 T>C SNVs. In 58% of the G>A variants an ADAR motif was found (Fig. 3).

Pathogenic SNVs that could be amended by DNA base editing

All the abovementioned G>A and T>C SNVs that are potential therapeutic targets for RNA BEs, either in the cytoplasm or in the nucleus, could also be targeted by DNA BEs. Since the BE is directed to the error nucleotide and reverts it to the reference one, we refer to this approach as direct editing. However, unlike RNA BEs, DNA BEs can also correct C>T and A>G SNVs by targeting the complementary strand, termed complementary editing. As DNA BEs act in the nucleus by definition, for this analysis, we considered all C>T and A>G SNVs located in genes, regardless of whether they are located inside or outside exons.

Figure 4 depicts all the direct and complementary editing possibilities and data regarding the variants relevant to each at the DNA level. We identified 22,333 G>A and 6803 T>C SNVs suitable for direct editing and 21,228 C>T and 7446 A>G SNVs suitable for complementary editing.

Fig. 4: All direct and complementary base-editing possibilities at the DNA level.figure 4

I: Direct A·T-to-G·C base-editing, as a method to revert G>A SNVs. a An arbitrary example. b The amount of G>A pathogenic SNVs that could be corrected by DNA A·T-to-G·C base-editing, and the subset of those possessing NGG PAM sequences. c The distribution of this subset of SNVs based on their MIT specificity score, which summarizes all genomic off-targets into a single numerical value. A score above 50 is indicative of a unique sequence and is considered acceptable for therapeutic purposes. d The distribution of this subset of SNVs based on the count of bystander changes within the NGG PAM editing window, with a focus on those predicted to be likely pathogenic. II: A·T-to-G·C base-editing directed at the complementary strand, as a method to revert C>T SNVs. III: Direct C·G-to-T·A base-editing, as a method to revert T>C SNVs. IV: C·G-to-T·A base-editing directed at the complementary strand, as a method to revert A>G SNVs.

When designing a DNA BE, it is essential to consider the presence of the desired protospacer/PAM sequence near the targeted nucleotide. Numerous motifs are available, based on the specific Cas9 variant in use.

In this study, we focused on the most common PAM motif generated from Streptococcus pyogenes (PAM: NGG), which is required to be present 12–16 bases away from the target nucleotide25. The motif was present in 17,119 (30%) of the editable SNVs (specifically in 32% of G>A, 28% of T>C, 28% of C>T, and 28% of A>G variants). It is important to remember that many other motifs are being developed and used, thus these numbers underestimate the actual potential of DNA BEs.

Concerning bystander edits, the calculation again depends on the chosen PAM and its associated window. We found that for the G>A SNVs in which the NGG motif is present, an average of 0.9 surrounding A nucleotides were identified in the respective 5-base editing window, of which 0.2 on average was anticipated to have a pathogenic impact. For the T>C SNVs, 1.0 surrounding C nucleotides were found on average, and 0.1 on average were anticipated to be pathogenic. In the same manner, for the C>T SNVs the numbers were 0.6/0.2 respectively, and for the A>G SNVs 1.2/0.3, respectively.

Regarding distant off-target sites, we referred to the UCSC Genome Browser CRISPR track (“CRISPR/Cas9 Sp. Pyog. target sites”)26, which utilizes the CRISPOR prediction tool for MIT specificity score27,28. This score summarizes all CRISPR/Cas9 genomic target sites into a single number ranging from 0 to 100. A guide with an MIT score above 50 is recommended for ensuring off-target safety. Out of the 17,119 variants, a score above 50 was detected in 13,183 (77%).

Table S1 in the supplementary section summarizes all ClinVar pathogenic SNVs including our added data regarding the relevant editing options, bystander and off-target hits, and ADAR and NGG motif detection.

Base editing opportunities when reverting the pathogenic SNV is not possible

Since 10 of the 12 mismatch types cannot be corrected by direct editing, a different approach is required. We suggest that the phenotype of a pathogenic disease could be improved by substituting the mutant (deleterious) AA with a novel AA that is more similar, though not identical, to the reference AA. This could be achieved by base editing the codon that encodes for the mutant AA. Guided on this assumption, we scanned all the possible options for editing any of the three nucleotides that encode for a given mutant AA, as well as all the options for editing more than one nucleotide of the same codon. (Of note, as the natural RNA editing mechanism tends to appear in clusters, cases of endogenous A-A or A-A-A editing are abundant, indicating that deamination of more than one Adenosine is highly feasible). For example, as shown in Fig. 5, by using A-to-G editing, a mutant codon AAC could be edited to GAC, AGC or GGC. We calculated for each deleterious AA the best option based on an AA substitution prediction tool (see the “Methods” section). See Fig. 5 for an illustrative example of applying the BLOSUM62 substitution matrix29 and all the possible editing options. In our AA-improvement analysis, we used the sorting intolerant from tolerant (SIFT)30 prediction tool to evaluate an AA’s substitution effect on a specified protein, by sequence homology and physical properties (Fig. 6). We investigated a total of 57,510 variants (these are all missense or nonsense non A>G or T>C pathogenic SNVs) and found 4043 variants that could be improved, though not corrected to the original reference amino acid. Of these, 3095 could be modified by A-to-G editing, 900 by C-to-T editing, and 48 by combining A-to-G and C-to-T editing. For example, the missense variant (PTEN):c.464A>C (p.Tyr155Ser) causes Cowden syndrome, an inherited condition characterized by multiple non-cancerous growths (i.e., hamartomas), due to a translation of TCT (serine) instead of TAT (tyrosine). By C-to-T editing, the deleterious codon could be turned into TTT (phenylalanine), resulting in a SIFT score of 1, indicating that this conversion is highly tolerated. A more complicated example is the missense variant (HNF1A):c.441C>A (p.His147Gln), which causes an inherited type of diabetes (maturity-onset diabetes of the young—MODY type 3). This variant changes the CAC (histidine) codon into the mutant CAA (glutamine) codon, resulting in a SIFT score of 0.05. By applying both A-to-G and C-to-T editing to the mutant codon (three editing actions at once), it could be modified to TGG (tryptophan), thereby increasing the SIFT score to 0.25.

Fig. 5: The improvement algorithm in combination with the BLOSUM62 substitution matrix.figure 5

a An example of a variant improvement by the algorithm. Base-editing cannot revert the C>A SNV. However, by applying A-to-G editing, the mutant codon can be edited in three different ways, each of which results in a different amino acid. The algorithm chooses the option that leads to the amino acid with the highest score according to the BLOSUM62 substitution matrix. b The known BLOSUM62 substitution matrix. ce All amino acid substitution possibilities according to the algorithm for the cases of A-to-G editing only (c), C-to-T editing only (d), and A-to-G and C-to-T editing in the same BE (e). A gray arrow represents an alteration from the reference codon to the mutant codon. A black arrow represents the best editing option from the mutant codon to the novel codon. The thickness of the black arrow correlates to the difference between the novel score and the mutant score.

Fig. 6: The improvement algorithm in combination with the SIFT prediction tool.figure 6

The 4043 pathogenic variants that could be improved by the algorithm based on the SIFT score are presented according to the editing technique taken (a) and the mismatch type and molecular consequence (b), as is their score improvement by the post-editing score, colored according to the mismatch type (c).

On average, the improvement in the SIFT score was 0.22 per variant.

In the case of mutant stop-codons (either TAA, TAG, or TGA), these can only be converted to tryptophan (TGG) by direct-editing. We hypothesize, however, that this result is always preferable to pre-mature termination of the protein. For instance, the nonsense variant (NF1):c.4107C>G (p.Tyr1369Ter) causes Neurofibromatosis type1, one of the most common neurocutaneous syndromes, due to TAG (stop-codon) being translated rather than TAC (tyrosine). By A-to-G editing, the deleterious stop-codon could be modified to TGG (tryptophan), resulting in a SIFT score of 0.42. In total, 1195 nonsense variants that could be improved were found. A detailed list of each variant and the selected editing option is provided as part of Table S1 in the supplementary section.

Many of the SNVs that can be base edited represent common genetic conditions

We identified, in total, 57,810 variants that could be corrected and 4043 variants that could be improved by base editing. We next sought to identify the most clinically relevant variants on this list, a complicated endeavor as it is dependent on determining the variants’ frequency in the population, which remains an unresolved challenge, for two main reasons. First, the general population is genetically heterogeneous, and current databases do not fully represent human genetic diversity. Second, most SNVs are very rare and, thus, are barely found when sequencing samplings of the population. As a result, such estimations are not accurate, especially if diverse populations are not represented.

Bearing this in mind and aiming to still give such an initial account, we used the ClinVar parameter of number-of-submitters who reported each variant. Although far from an accurate reflection of the real frequency of variants, a high number of submitters is an indirect indicator that a given variant is more common. We defined a threshold of at least three different submitters for a variant to be considered high.

In total, 19,079 (19.4%) pathogenic SNVs were reported by a high number of submitters. According to our analysis of these pathogenic SNVs, 4998 are located in exons and thus can be corrected by cytoplasmic RNA editing, 13,558 by DNA editing, and 707 can be improved.

Next, we investigated the frequencies reported in GnomAD for the pathogenic SNVs31. Since this database is known to include individuals with no apparent genetic disease, it is not designed to detect rare pathogenic SNVs. Yet, it is reasonable to assume that variants that do appear in GnomAD are likely to be more frequent, acknowledging the limitation that this holds true only for the population that has sequence data available. In total, 23,599 pathogenic SNVs had reported frequencies in GnomAD. According to our analysis, of these pathogenic SNVs, 5919 can be corrected by cytoplasmic RNA editing, 17,430 by DNA editing, and 866 can be improved.

Lastly, we investigated the list of 70 most common monogenic diseases in the population published by Apgar et al.32 and found 12,366 reported pathogenic SNVs for 41 disorders of these phenotypes. Our analysis indicates that of these SNVs, 2579 can be corrected by cytoplasmic RNA editing, 6545 by DNA editing, and 488 can be improved. Ranking these disorders by the percentage of SNVs that could be edited revealed the common diseases autosomal-dominant polycystic kidney disease (ADPKD), Beta-thalassemia and Brugada syndrome among the top five editable disorders. In the same manner, ranking the disorders by the percentage of SNVs that could be corrected by a cytoplasmic endogenous ADAR revealed Osteogenesis imperfecta and Congenital adrenal hyperplasia among the top 10. The table of the sorted disorders is available in the supplementary section (Table S4).

Analysis of base editing’s suitability for the correction of liver and brain pathogenic SNVs

As of today, the base editing efforts are mainly focused on hepatic diseases, since delivering therapies directly to the liver is feasible by various approaches, based on intravenous injections. In the same manner, new approaches for targeting the brain tissue, based on intrathecal injections, are now emerging. Seeking to identify the pathogenic SNVs of relevance to hepatic and central nervous system diseases, we extracted from the genotype-tissue expression (GTEx) database the genes highly expressed in these tissues. This analysis revealed 581 genes that are highly expressed in the liver and 3242 in the brain (Tables S2, S3). In ClinVar, 4073 pathogenic SNVs were located in the 581 liver genes. According to our analysis, of these pathogenic SNVs, 961 can be corrected by cytoplasmic RNA editing, 2385 by DNA editing, and 194 can be improved. Of the 15,102 pathogenic SNVs located in brain genes, according to our analysis, 2950 can be corrected by cytoplasmic RNA editing, 8544 by DNA editing, and 703 can be improved. This further demonstrates the applicability of base editing to treating a variety of genetic diseases.

留言 (0)

沒有登入
gif