Over the past decade, significant progress has been achieved in the field of genome editing, largely attributed to the utilization of CRISPR (clustered regularly interspaced short palindromic repeats) and its Cas (CRISPR-associated) proteins (reviewed in [1]). The Cas9 protein creates double-stranded breaks (DSBs) that are subsequently repaired by the cell’s repair mechanisms, usually through non-homologous end joining (NHEJ), leading to the potential introduction of indels. One of the areas where these advancements have occurred pertains to gene silencing, with the primary objective being the selective inhibition of a specific gene without affecting others. Various methods for gene silencing using CRISPR exist, including expression inhibition through CRISPRi (CRISPR interference), the introduction of point mutations, and the insertion of premature stop codons. Another common approach involves the use of Cas9 proteins to induce mutations in start codons, thereby inhibiting translation initiation [2,3,4].
In most computational models predicting CRISPR activity, researchers have shown interest in the DSB’s location and likelihood, as well as the identity of the resulting mutation [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. This paradigm assumes that if a mutation is induced, then the gene’s expression will significantly decrease; however, this is not always the case. For example, a gene with a mutated start codon can still be translated due to an alternative in-frame start codon, so the resulting protein remains functional [23] (Fig. 1A). Furthermore, transcription and splicing are influenced by various properties of the DNA sequence [24, 25] and not every change to the original sequence will result in a measurable change in the gene’s expression. Therefore, mutated off-target genes, i.e., genes mutated by CRISPR even though the single-guide RNA (sgRNA) was not meant to target them, do not necessarily have their expression affected. On the other hand, even if each exon’s DNA sequence remains unmutated, the gene’s expression could be affected by intronic or intergenic mutations (Fig. 1B). Thus, mutations in all positions require careful evaluation to determine whether they exert a discernible effect on expression.
Fig. 1Overview of the tool’s rationale and structure. A A given gene’s canonical start codon (blue) is deleted by a CRISPR mutation to silence the gene. However, since an alternative ATG in the same reading frame exists (green), translation initiation still occurs, effectively keeping the gene unsilenced. B Part of intron 1 (red) is deleted by a CRISPR mutation; although each exon’s pre-mRNA sequence remains intact, the deletion causes a mis-splicing event, resulting in exon 2 not being spliced; thus, the mRNA includes only exons 1 and 3. C EXPosition pipeline. First, the tool either predicts the most likely mutations induced in a CRISPR target site or receives specific mutations from the user; it then searches for potentially affected genes and assesses the mutation’s effect on their splicing, transcription, and translation initiation. In addition, the VBC and GuidePro scores are calculated on the target site. These scores are fed along with the predicted cutting efficiency and the gene expression estimates into an SVM classifier that predicts whether the sgRNA will cause gene knockout. The user can also insert specific genes/transcripts for analysis, thus disregarding other genes/transcripts which are affected by the mutations. Green boxes represent EXPosition's main components
Recently, a few tools designed to predict gene knockout following CRISPR were created [26, 27]. These tools focus on sgRNAs that target the coding region of a gene and use amino acid composition, conservation and frameshift likelihood. However, a tool designed to address sgRNAs’ effect on gene expression, that can evaluate sgRNAs targeting non-coding regions of genes, is needed.
Here we present a tool that addresses these issues by explicitly considering CRISPR’s effect on the target site’s phenotype rather than the genotypic change. Our tool predicts the impact of CRISPR’s action on three aspects of gene expression: transcription, splicing, and translation initiation. It then incorporates these estimations with predicted cutting efficiency and predictions from tools designed to predict gene knockout to better assess gene knockout. Since not every mutation will significantly affect gene expression, researchers using our tool can save time and money when deciding which sgRNA is most likely to achieve the desired change in gene expression (e.g., silencing) without affecting the expression of off-target genes.
ImplementationGeneral pipeline of the toolOur tool, called EXPosition, accepts a CRISPR target site location and predicts whether targeting that site will silence the target gene. It accomplishes this by combining post-CRISPR gene expression estimations with gene knockout predictions from other models that rely on other features. Importantly, EXPosition differs from previous tools by also providing information about the likely phenotypical effect of CRISPR at that site, i.e., the effect of the induced mutation on gene expression. The tool’s modules are summarized in Fig. 1C. In short, EXPosition accepts an sgRNA and a gene of interest; predicts the likelihood of the guide’s cut and NHEJ-induced mutations using CRISPRedict [28] and Lindel [29], respectively; predicts the mutations’ effect on transcription, splicing, and translation initiation using Xpresso [24], Oncosplice [30], and TITER [31], respectively; and combines them with the scores from VBC [26], GuidePro [27], and CRISPRedict [28] to provide a prediction of gene knockout using that sgRNA. The source code for our tool is available at https://github.com/shaicoh3n/EXPosition [32].
How to run EXPosition is hereby described in more detail. Firstly, the user chooses which sub-models to run (Fig. 1C:a): transcription, splicing, and translation initiation. Then, the user inputs a target site location (Fig. 1C:b). Using CRISPRedict [28] and Lindel [29], the tool predicts the most likely mutations to be induced by CRISPR, along with their probabilities (Fig. 1C:I). Alternatively, the user can specify the mutations and their probabilities (Fig. 1C:c). In both cases, these mutations are then analyzed in the chosen sub-models. In each sub-model (Fig. 1C:III-V), we evaluate the effect of each mutation on one aspect of expression for each human gene, thereby finding the genes affected by the mutation. We then multiply each mutation’s phenotypic score by the probability of the mutation, and sum over all mutations to arrive at an expected value of effect for that aspect of expression for a certain gene (Fig. 1C:f–h). If multiple genes have been affected by the mutations, the gene that received the highest score (i.e., most detrimental effect on expression) will be the output of the sub-model. Thus, the final score for each sub-model would be:
$$\mathrm\;\mathrm\;=\max_\;\mathrm}\}}\sum_i^}}p_i\ast s_$$
(1)
where \(j\) ranges from 1 to \(_}\), which is the number of genes affected by the predicted mutations and is usually 1–2 (Fig. 1C:e); \(_}\) is the number of top mutations (i.e., most probable mutations) considered (by default, \(_}=4\)); \(_\) is the probability of mutation \(i\) occurring; and \(_\) represents the expected effect of mutation \(i\) on gene \(j\) according to the sub-model, i.e., its effect on transcription, splicing, or translation initiation. \(_\) is normalized to range between 0 and 1 in the following way:
$$s_=\min\left(1,\frac_}-\mathrm}}\right)$$
(2)
where \(}_\) represents the highest raw sub-model score for mutation \(i\) over all gene \(j\)‘s transcripts, and \(_-\text}\) is the maximal score predicted by the sub-model on all ClinVar mutations (see the section “Mutations with high gene expression scores are overrepresented in ClinVar-designated pathogenic mutations”).
Finally, the scores from EXPosition are combined with the predicted cutting efficiency (Fig. 1C:i), VBC score (Fig. 1C:j), and GuidePro score (Fig. 1C:k) into a Support Vector Machine (SVM) classifier (Fig. 1C:VIII). to provide a binary classification of whether the sgRNA would cause a gene knockout (Fig. 1C:I).
Importantly, instead of evaluating every human coding gene/transcript affected by the mutations, the user can input a gene or transcript of interest (Fig. 1C:d); if they do so, the mutation’s effects in the selected sub-models will be checked only against that gene or transcript. If a transcript of interest was set, \(}_\) is the score of that transcript (instead of the maximal score over all the associated gene’s transcripts).
In the transcription sub-model (Fig. 1C:III), the score (\(_\)) represents the predicted relative change in mRNA levels caused by a mutation; this change is predicted using Xpresso [24]. In the splicing sub-model (Fig. 1C:IV), the score signifies the predicted loss of functionality of the gene’s proteins (based on evolutionary conservation) caused by a mutation, while considering any mis-splicing events and changes in the position of their start codon; this is accomplished using Oncosplice [30]. In the translation initiation sub-model (Fig. 1C:V), the score denotes the predicted relative change in the start codon’s efficiency of translation initiation, while considering any mis-splicing events; this is achieved using TITER [31]. The following sections detail the different parts of EXPosition.
Predicting the genotypic outcome of CRISPR’s DSBAs a first step, based on the user’s cut site location, we extract the target site along with its flanking sequences to predict the DSB’s probability and resulting indels using CRISPRedict [28] and Lindel [29] respectively (Fig. 1C:I). CRISPRedict is a linear regression model that predicts the probability of cutting by sgRNAs; it takes as input a 30-nt-long sequence surrounding the cut site: 4 nt upstream to the cut site, 20 nt of the site, and the following 6 nt downstream to site. Lindel is a logistic regression model that predicts the likelihood of NHEJ mutations induced by CRISPR; its input is a 60-nt sequence centered around the cut site, and its output consists of the predicted probabilities of 557 possible mutations: deletions around the cut site of up to 30 nt, every possible insertion of 1–2 nt, and a single collective mutation for any insertions of ≥ 3 nt.
We analyze only the \(_}\) (by default \(_}=4\)) most likely mutations predicted by Lindel in our tool, as checking all possibilities is not feasible timewise. We normalize the probabilities of these mutations so that their sum equals 1. We also exclude insertions longer than 2 nt, as Lindel does not provide explicit mutations for such cases, which have been demonstrated to be exceedingly rare [22, 29].
Thus, the probability for each mutation is calculated as follows:
$$p_i=\frac}}_i}}}\;p_}_}}\ast}}_i$$
(3)
where \(_}}_\) and \(_}_}\) represent the probabilities from Lindel and CRISPRedict of mutation \(i\), respectively; and \(_}\) is the number of most probable mutations taken from Lindel.
The predicted mutations and their probabilities serve as input for the three sub-models described in the following sections. Alternatively, users can manually input specific mutations of interest along with their probabilities, which can include both indels and substitutions.
Transcription sub-modelTo predict the effect of a mutation on transcription (Fig. 1C:III), we employ Xpresso [24], a fast and accurate deep learning model (Additional file 1: Fig. S1) that predicts mRNA steady-state abundance based on the nucleotide context around a transcription start site (TSS). While more accurate models exist such as Enformer [33], which can consider mutations up to 100k base pairs away, they are too slow and heavy computationally to be incorporated into EXPosition. Enformer can take ~ 5 min to evaluate a mutation as compared to less than a second needed by Xpresso. More details about Xpresso can be found in supplementary Sect. 1. Its input consists of a 10.5-kb context around the TSS, while the output is the log10 mRNA expression level of the respective gene (Fig. 2A).
Fig. 2Overview of the gene expression sub-models. A The transcription model takes as its inputs the sequences surrounding the transcription start site (TSS), which include both the mutated and unmutated sequences. Specifically, it considers 7000 base pairs upstream and 3499 base pairs downstream of the TSS. The mutated and unmutated sequences are each fed into Xpresso, which then predicts the mRNA levels for the mutated and unmutated genes. The model’s output is the relative change in mRNA levels compared to the unmutated gene. B The splicing model receives the gene’s predicted mutations and uses them in SpliceAI to identify potential aberrant splicing events. Accounting for any mis-splicing events detected, variant mRNAs are generated. These variant mRNAs are subsequently assessed by our translation initiation model to identify a suitable start codon. The variant mRNA sequences are translated into amino acids and employed to calculate the gene’s conservation score. C The translation initiation model accepts both the mutated and unmutated transcripts as inputs. In an iterative procedure, the mutated transcript is searched for in-frame codons within a defined window. These codons’ initiation capabilities are predicted by TITER. If a suitable codon is identified, defined as one with better efficiency than the canonical start codon or ranking within 5% of its efficiency when compared to all human transcripts, it is selected and returned. If no suitable start codon is found in the initial window, the iterative process continues with an expanded window size. This process continues until a suitable start codon is discovered or until a maximum window size is reached, at which point the best available codon is returned. Simultaneously, the unmutated mRNA’s start codon undergoes analysis by TITER, and its efficiency ranking is used to calculate the relative change in initiation efficiency ranking between the canonical start codon and the best new start codon identified in the mutated sequence
For a given mutation \(i\) and gene \(j\), we examine whether the mutation could potentially impact the gene’s transcription levels by considering its transcripts’ TSSs and checking if the mutation falls within 7 kb upstream or 3.5 kb downstream of them (i.e., in the region that Xpresso considers when evaluating a TSS). For each potentially impacted transcript, we calculate the Xpresso score of the 10.5 kb sequence around its TSS before and after the mutation (denoted \(_}\) and \(_}\), respectively). Each transcript’s final transcription score reflects the relative change in mRNA transcription levels following the mutation:
$$s_}=\frac}-r_}\right|}}}$$
(4)
\(}_\) in Eq. 2 is the maximal \(_}\) caused by mutation \(i\) over gene \(j\)’s transcripts. If a specific transcript of interest was provided, the output score pertains solely to that transcript.
Finding mRNA isoforms following a mutationBoth the splicing model (Fig. 1C:IV) and the translation initiation model (Fig. 1C:V) analyze isoforms of transcripts following mutations and potential aberrant splicing events. We obtained splice site annotations by Ensembl [34]. To predict mis-splicing events, we utilize SpliceAI [25], a deep learning tool (Additional file 1: Fig. S2) that predicts the change in a position’s probability to function as a splicing donor/acceptor site following a mutation. More information can be found in supplementary Sect. 2. We use these annotated splice sites and the splicing changes predicted by SpliceAI to generate all possible mRNA isoforms following the mutation by concatenating donor and acceptor splice sites (Additional file 1: Fig. S3). Further details are available in supplementary Sect. 3.
Splicing sub-modelThe splicing sub-model (Fig. 1C:IV) assesses the impact of a mutation on a gene’s viability by examining the isoforms generated for each of the gene’s transcripts. To gauge the effect of a mutation on a protein’s functionality, we use Oncosplice [30]. This model receives a mutation and a gene as input and predicts how much the mutation disrupts the gene’s protein function. This disruption is scored using evolutionary conservation information (Fig. 2B).
For each isoform, a sliding window is employed to identify the most conserved area that is affected by the mutation; the window’s length is set to the average domain length of all human proteins. The score of each transcript is the average score of its isoforms (Additional file 1: Fig. S4); and finally, the gene’s score is the maximal transcript score, i.e., the transcript whose function was most significantly disrupted. If a specific transcript of interest was provided, the output score pertains solely to that transcript. For further information, please refer to supplementary Sect. 4.
Translation initiation sub-modelThe translation initiation sub-model (Fig. 1C:V, Fig. 2C) assesses the ability of mutant variants to initiate translation by searching for suitable start codons within the isoforms identified by the splicing model (see the section “Finding mRNA isoforms following a mutation”). The translation initiation score of the suitable codons is determined using TITER [31], a deep learning tool (Additional file 1: Fig. S5) that integrates a deep learning algorithm with known codon compositions of translation initiation sites (TISs) to predict TIS functionality. For additional information, please consult supplementary Sect. 5.
For each isoform, we locate the start of the coding sequence through local alignment with the original transcript’s coding sequence’s start. An iterative process is then initiated where TITER examines all in-frame NUG and ACG codons (where N can be any nucleotide) within a window surrounding the coding sequence’s start, with the window size increasing in each iteration. The process concludes when either the best new codon is discovered (with a TITER score sufficiently close to that of the canonical start codon) or when the maximum window size is reached. Further information can be found in supplementary Sect. 6. We then calculate the WT start codon’s TITER score rank, compared to the TITER scores of all human canonical start codons; this rank is denoted as \(_}\). We repeat this calculation for the best new start codon found for the isoform, whose rank is denoted \(_}\). The isoform's score is defined as the relative change in the isoform start codon's TITER score rank:
$$s_}=\frac}-i_}\right|}}$$
(5)
Like the splicing sub-model, we calculate the transcript’s initiation score as the average of the isoform scores and the gene’s initiation score (i.e., \(s'_\) from Eq. 2) as the highest score among all its transcripts (Additional file 1: Fig. S4). Likewise, if a specific transcript of interest is provided, we provide the score exclusively for that transcript.
Vienna Bioscore CRISPR (VBC)VBC [26] (Fig. 1C:VI) is a tool that predicts gene knockout given an sgRNA and the position of the target site. It outputs its score using linear regression with the following features: (A) indel formation predictions from inDelphi [12]; (B) a “Bioscore” calculated using protein features like Pfam domains, DNA and amino acid conservation, amino acid identity, and gene structure; and (C) sgRNA activity prediction obtained using predictors like Azimuth and similar models. Together, these components form a comprehensive score that captures key processes in CRISPR–Cas9 mutagenesis and can be used to estimate gene knockout effectiveness following CRISPR. For more information about this tool and its performance, please review the original paper and the results in this paper.
GuideProGuidePro [27] (Fig. 1C:VII) is another tool that predicts gene knockout of sgRNAs targeting protein-coding exons. Knockout efficiency is governed by three key factors: (A) sgRNA activity score attained with DeepHF [35], Azimuth [36] and SSC [37]; (B) frameshift probability acquired with inDelphi [12], Lindel [29], and FORECasT [14]; and (C) amino acid sensitivity score which is evaluated using conservation, Pfam domain annotations, post-translational modifications (PTMs), and secondary structures. Each of these scores is created with an SVM, and these scores are then fed into another SVM to estimate gene knockout. For more information about this tool and its performance, please review the original paper and the results in this paper.
SVM sgRNA gene knockout classifierEXPosition utilizes an SVM classifier (Fig. 1C:VIII) with an RBF kernel using EXPosition’s gene expression estimations, along with the predicted cutting likelihood and the scores from VBC and GuidePro as features. The model was trained using all sgRNAs from all the datasets in this paper. In cases when the scores from VBC and/or GuidePro are not available for an sgRNA, EXPosition uses one of similarly trained SVM classifiers that don’t require GuidePro and/or VBC.
Mutations with high gene expression scores are overrepresented in ClinVar-designated pathogenic mutationsWe wanted to further validate each of our gene expression sub-models, even though their main components were already validated elsewhere, and demonstrate our ability to predict functional effect of mutations. Thus, we used our tool’s gene expression component to analyze mutations from the ClinVar dataset [38], which contains mutations and their phenotypes accumulated from laboratories and researchers globally. We analyzed ~ 325k mutations tagged as benign (192k) or pathogenic (133k). Although using these data is not ideal, and not every mutation would lead to a functional effect, we expect a monotonic association between the score of the model and the number of pathogenic mutations.
For each of EXPosition’s gene expression sub-models, we examined the 3254 mutations in each percentile range (i.e., 99–100%, 98–99%, etc.) and calculated the fraction of pathogenic mutations in that set (Fig. 3; the score thresholds for each percentile are detailed in supplementary Sect. 7, Additional file 1: Table S1). We denote this fraction \(_}\). A higher fraction indicates better recognition of pathogenic mutations by the sub-model. All sub-models provided meaningful rankings on the ClinVar dataset from a certain threshold. The thresholds of the transcription/splicing/translation initiation sub-models correspond to the 92/67/97 percentiles. \(_}\) values for the higher percentiles exhibited significant enrichment of pathogenic mutations in almost all top 5 percentiles (\(^, ^\), \(^\) for the transcription, splicing, and translation initiation sub-models respectively using the hypergeometric test).
Fig. 3ClinVar analysis. \(}_}\) values of the transcription, splicing, and translation initiation sub-models (yellow, blue, orange lines, and dots, respectively) for different percentiles. Asterisks denote significant p-values (p < 0.05) and empty circles denote non-significant p-values (p ≥ 0.05). The highest p-value (i.e., closest to 0.05) calculated for each sub-model is listed in the legend. All p-values were corrected using FDR. Note that for each sub-model, the dots begin at the first percentile which corresponds to a non-zero score, meaning that 52/80/95% of mutations received a score of 0 in the transcription/splicing/translation initiation sub-models respectively
Following this analysis, each sub-model’s ClinVar threshold was set as the lowest percentile in which we observed a significant enrichment of pathogenic mutations (e.g., the 97th
留言 (0)