Genome-aware annotation of CRISPR guides validates targets in variant cell lines and enhances discovery in screens

Off-target effects are common in CRISPR libraries

CRISPR targets are specified by base-pairing between guide RNAs and endogenous DNA sequences. Ideal guides specify their targets uniquely to ensure that effects on phenotype are causally related to CRISPR knockout at a single genomic locus. Off-target effects occur when guides result in CRISPR knockout of an unintended gene. We used Exorcise to assess the extent of off-target effects in 55 commercially available CRISPR screen libraries (Fig. 2A) and found that all inspected libraries contain guides with off-target effects.

Fig. 2figure 2

Assessment of Addgene pooled CRISPR-spCas9 libraries for human and mouse. A Library guides were re-annotated with Exorcise using genome assembly GRCh38 (human) or GRCm38 (mouse). Re-annotations were compared with original annotations to identify off-target effects, missed-target effects, and boundary effects. B Mis-annotations identified by Exorcise. Other-locus off-target effects are caused by multiple targeting of guides to multiple loci. Same-locus off-target effects are caused by multiple knockouts by a guide due to overlapping features at a single locus. False non-targeting effects are missing annotations of a valid guide target. Missed-target effects are caused by a library guide being mis-annotated as targeting. Boundary effects are caused by mis-annotations with adjacent exons or genes. C Performance of Addgene libraries. Library guides were analysed with Exorcise with RefSeq exons. Bars indicate the proportion of guides which have any off-target effects (peach), any missed-target effects (cyan), or only on-targets (green). D Distribution of VBC scores among Addgene libraries. Dashed line indicates the median VBC score across all guides in all libraries. E Off-target effects of library guides in Addgene libraries. Distribution of number of off-target effects per guide by library. Inset: distribution of other-locus and same-locus off-target effects by library. F Missed-target effects in Addgene libraries. Distance between Cas9 cut site, if any, and the nearest exon in linear distance in nucleotides. Inset: zoom plot between 0 and 50 nucleotides

We defined off-target guides as those guides which target exons of more than one gene with perfect complementarity (Fig. 2B). We found that off-target effects account for up to 7.4% of library guides within RefSeq exons (Fig. 2C, Additional file 2: Table S1), rising to 12.9% after Exorcise with GENCODE Comprehensive (Additional file 1: Fig. S1A, Additional file 2: Table S33). Since both overlapping gene features (such as gene bodies with readthrough transcripts and antisense RNAs) and gene duplication events would both be counted as off-target effects, we next decomposed off-target effects into “other-locus” and “same-locus” effects: the former resulting from Cas9 recruitment to multiple exonic loci and the latter resulting from Cas9 recruitment to a single exonic locus with more than one feature, for example, a readthrough transcript. Other-locus off-target effects did not exceed 5.1% of total library guides in either RefSeq (Fig. 2E, Additional file 2: Tables S2 and S3) or GENCODE (Additional file 1: Fig. S1B, Additional file 2: Tables S34 and S35) annotations; but same-locus off-target effects account for up to 2.2% and 9.3% of library guides for RefSeq and GENCODE, respectively. This is due to the permissive nature of GENCODE Comprehensive annotations, which include lower-confidence transcripts that are not included in RefSeq, resulting in more features at loci but not more loci with features. In general, off-target effects are expected due to the repetitive nature of sequences that underwent gene duplication.

Since guide re-annotation by Exorcise is agnostic to prior annotations and is determined by genome alignment and exome specification alone, it is meaningless to refer to an on-target gene when considering a guide’s off-target effects—all targets would be equally valid, barring guide efficiency differences [5,6,7]. For this reason, off-target guides in CRISPR libraries for which only one of all the possible valid targets is annotated consequently have missing annotations for all the other valid targets. We term these missing annotations “false non-targeting effects”.

Therefore, we tolerate the design of guides which target more than one gene provided that the analysis is aware of all targets. Omitting a guide simply because it has off-target effects is not a valid strategy given the non-random nature of genome evolution and the restrictive nature of CRISPR guide design, which requires consideration of a PAM. Including a guide that has off-target effects but omitting annotations with its off-targets introduces false non-targeting effects in which a guide exists that targets a gene, but its signal is ignored. Exorcise re-annotates guides with all the targets that it detects, thereby eliminating false non-targeting effects.

Exorcise does not assess off-target potential of guides and instead uses sequence alignment to identify off-targets. This approach is much faster than computing off-target metrics by considering mismatches and nucleotide chemistry. We compared Exorcise’s capability of decomposing off-targeting guides into “same locus” and “other locus” against the off-target method CRISPRoff [61] and found that Brunello other-locus off-target guides had consistently lower CRISPRspec scores compared to same-locus off-target guides and on-target guides (Additional file 1: Fig. 2B). This indicates that Exorcise’s sequence alignment approach is sufficient to identify problematic off-target guides.

Missed-target effects are more prevalent in libraries using permissive design strategies

Next, we asked whether any targeting guides in the 55 commercially available libraries miss their targets. We defined missed-target guides as those guides that do not target any exons. Exorcise with RefSeq revealed missed-target effects account for up to 16.1% of library guides, falling to 9.6% after Exorcise with GENCODE Comprehensive. This difference is expected because of the additional lower-confidence annotations available in GENCODE but not RefSeq. We found that this fraction falls most sharply in libraries using permissive design strategies. For instance, the EKO library was designed on putative protein-coding regions in AceView and GENCODE, and so Exorcise of EKO against GENCODE recovered guides that are missed targets in RefSeq. However, even after Exorcise with GENCODE, EKO still contains 7.8% missed-target guides.

To test whether missed-target guides were due to mis-design, we measured the distance between the computed cut site for each missed-target guide target, if any, and the nearest exon. We found a tendency for missed targets to have cut sites within 100 nucleotides from the nearest exon, with marked enrichments flush against and one nucleotide away from an exon boundary (Fig. 2F, Additional file 2: Table S4). This was most strongly observed in the MinLibCas9 and TKOv1 libraries and the observation was retained after Exorcise with either RefSeq or GENCODE (Additional file 1: Fig. S1C, Additional file 2: Table S36). The presence of missed targets within this interval indicates either inconsistent exon boundaries between references, or a design decision to accept cut sites outside but adjacent to an exon boundary as opposed to within it. Cut sites not explicitly within the bounds of an exon enable the possibility that indels acquired by repair at the cut site might leave the exon boundary, and therefore the sequence, the transcript, and the protein product, intact, thereby violating the assumption that successful CRISPR targeting results in a genetic knockout.

The presence of missed-target effects violates the assumption that introduction of targeting guides causes a change in the coding sequence of a gene. Missed-target effects can occur due to design of guides with excessively permissive reference sets or by accepting cut sites at exon boundary positions where the repair outcome is unclear. In both cases, targeting may be successful but would not translate to a detectable phenotype. Since subject cell lines, especially those with genomic instabilities such as cancer cell lines [62], are unlikely to reflect every CRISPR target identified in permissive reference exomes with low-confidence transcripts, we recommend that CRISPR targets be verified in the cell line to be investigated. Exorcise helps with this by re-annotating missed-target guide RNAs given a user-defined cell line genome and exome, thus assuring the assumption that guides in the library possess valid CRISPR targets in the cell line under investigation.

Choice of reference sequence affects CRISPR hit calling

Exorcise supports re-annotation of guides with the user’s desired reference. Owing to differing guide design strategies, it follows that Exorcise is most conservative when used with references most similar to that used to design the library. We asked whether Exorcise reference choice has an impact on the outcomes of a CRISPR screen analysis. To address this, we re-annotated Brunello guides using Exorcise with GRCh37, GRCh38, and T2T [63] references (Additional file 1: Fig. 2A), and repeated DrugZ analysis on published CRISPR data using the Brunello library [64] (Additional file 1: Fig. 2C). We found that the results were largely unchanged when moving between GRCh37 and GRCh38. Results of the analysis after Exorcise with the T2T reference varied but retained the strongest hits. Using Exorcise with a specified reference enforces the assumption that the cell line investigated is well represented by that reference. This shows that the outcomes of a CRISPR screen depends on the cell line under investigation.

De novo library design with re-annotation retains favourable on-target efficacy

Re-annotation by Exorcise identifies missed-target effects that are removed from the library in the re-annotation. When using Exorcise for library design, missed-target guides should be replaced with on-target guides to ensure constant numbers of guides per gene. We therefore asked whether a library designed with Exorcise would retain on-target efficacy. We obtained and re-annotated the top 20 guides per gene by Vienna Bioactivity CRISPR (VBC) score, an on-target metric for CRISPR guide design, in human GRCh38 and mouse GRCm38 genomes, separately, using RefSeq exomes. Exorcise revealed very few missed-target guides due to VBC itself being designed on RefSeq exons; we removed them as well as all other-locus off-target guides. From the remaining guides, we accepted the top six guides per gene by VBC score into new libraries, which we designate “VBC Ideal Human” and “VBC Ideal Mouse”, respectively (Additional file 2: Tables S5 and S6). These libraries had distributions of VBC scores and off-target fractions comparable with the other libraries we assessed (Fig. 2D). Because we explicitly removed other-locus off-target guides in the design of the libraries, this fraction decreased after moving from 20 guides per gene to six. Subsequent Exorcise of the final libraries with GENCODE Comprehensive exomes retained freedom from missed-target guides, while off-target fractions increased. However, this is expected due to the GENCODE reference being more permissive. Taken together, we show that Exorcise is an attractive method for validating CRISPR guide targets for library design due to its ability to identify and correct off-target and missed-target effects. Combining Exorcise with library design yields balanced libraries with a uniform number of guides per gene while validating CRISPR guide targets.

Simulations of mis-annotation reveal impacts on CRISPR screens

To appraise the effects of common mis-annotations in CRISPR guides on screens, we generated a synthetic chemo-genetic dataset with prescribed gene-drug interaction values (Fig. 3A, Additional file 2: Table S7). We assigned guides to genes with either the ground truth or one of four mis-annotated exome schemes (Additional file 2: Tables S8 and S9) and challenged each scheme to capture the prescribed interactions from simulated CRISPR data (Additional file 2: Table S10). The ground truth scheme mapped guides to their intended gene uniquely, prescribing three targeting guides per gene. A “false non-targeting” scheme randomly switched targeting guides as non-targeting, resulting in between one and three guides being assigned per gene. A “missed targets” scheme assigned the same three correct guides per gene plus up to seven additional non-targeting guides per gene. A “boundary” scheme randomly shifted the exon boundaries into adjacent gene bodies such that each gene may be assigned up to seven guides, of which up to four may be off-target guides. Finally, a “random” scheme was designed where mappings were randomly created between guides and genes, even if this resulted in discontinuous gene bodies.

Fig. 3figure 3

Simulated CRISPR screen on synthetic data, DrugZ analysis. A Upper: chemo-genetic interaction values were randomised for each gene-drug pair across 4000 genes and 12 drugs. A control drug was defined with chemo-genetic interaction value 1. Essential genes were modelled with chemo-genetic interaction value 0.1 with all drugs and the control drug. Lower: a synthetic genome was constructed with ten guides per gene, three of which targeting exons. A ground truth exome was constructed that reflected the exon structure of each gene correctly. Mis-annotated exomes were constructed: a false non-targeting exome excluded targeting guides from exons; a missed-targets exome included non-targeting guides into exons; a boundary effects exome included guides from adjacent genes into exons. Red regions in the mis-annotated exomes indicate differences from the ground truth. CRISPR read counts were simulated according to chemo-genetic interactions and targeting guide designations. Simulated read counts were annotated with Exorcise using the synthetic genome and the ground truth or mis-annotated exomes. B Upper: bundle analysis. DrugZ normZ scores for simulated gene knockouts in the simulated drug ausostam versus control. Each point is one gene. Small points indicate essential genes. Shown are genes annotated with the ground-truth exome versus those annotated with the false non-targeting exome (labels 1 and 2), the ground truth exome (label 3), or missed-targets exome (labels 4–10). Labels indicate number of guides per gene: 1 and 2 indicate missing targeting guides (false non-targeting); 4–10 indicate additional non-targeting guides (missed targets). Lower: example rank plots for genes represented at 1, 3, and 9 guides per gene, plotted on identical scales. C Simulated CRISPR screen as in B. Upper: rank plots for each exome annotation. Lower: biplot of normZ scores of genes with each exome annotation versus ground truth. NQZU3 (ground-truth resistance signal) and NTIN60 (ground-truth hypersensitivity signal) are shown in all plots. Points are coloured by chemo-genetic interaction value: red, resistance; blue, hypersensitivity. Small points indicate essential genes. Ausostam shown; representative results across 12 independent simulations. D Left, summary statistics of the receiver-operator characteristic (ROC) curve analysis. Right, ROC curves showing the performance of mis-annotated schemes to recover the discoveries made by the ground truth scheme. AUC, area under the ROC curve. Ausostam shown; representative results across 12 independent simulations

As expected, the ground truth scheme captured the correct chemo-genetic interactions after DrugZ (Fig. 3C, Additional file 1: Figs. S3 and S4, Additional file 2: Tables S15–S24) and MAGeCK (Additional file 1: Figs. S5–7, Additional file 2: Tables S37–S46) analysis of the dataset. The boundary scheme only captured the strongest interactions and discovery of weaker interactions was impaired. The missed targets and false non-targeting schemes recovered chemo-genetic interactions best when the number of guides per gene was similar to that in the ground truth scheme; that is, three. Mis-annotation by introduction of non-targeting guides into exons or removal of targeting guides from exons impaired recovery of chemo-genetic interactions. Finally, as expected, the random scheme did not successfully capture chemo-genetic interactions.

In missed targets and false non-targeting schemes, we found that super-numerary and sub-numerary guides per gene affected discovery strength but preserved discovery direction (that is, conferring drug hypersensitivity or resistance) (Fig. 3B, Additional file 2: Tables S11–S14). This was observed on plots of ground truth versus mis-annotation normZ scores as a “bundling” effect (Fig. 3C, Additional file 2: Tables S21 and S22) created by loci of normZ scores falling on straight lines with different gradients. Individual straight lines within a bundle represented genes with a distinct number of guides per gene. Bundle line gradient, and therefore discovery strength, increased as guides per gene approached the ground truth value of three. Deviations from three guides per gene due to addition of non-targeting guides or removal of targeting guides attenuated discovery strength but largely preserved direction and order.

We next quantified discovery strength by considering receiver-operator characteristic (ROC) curves for mis-annotated schemes to act as classifiers for the ground truth. We defined actual positives by using the ground truth scheme and challenged each mis-annotated scheme to recover the actual positives and exclude false positives (Fig. 3D). The false non-targeting scheme performed the best, with an average area under the ROC curve (ROC-AUC) of 0.932 over 12 independent simulations. It recovered actual positives almost to the exclusion of false positives (precision = 0.994) but did not recover all the actual positives (recall = 0.720). The missed targets scheme performed worse, recovering only the very strongest actual positives before supernumerary guide dilution impaired exclusion of false positives. The boundary scheme performed the worst (ROC-AUC = 0.721) apart from random control, although it did recover the most actual positives (recall = 0.824). Taken together, we found that mis-annotations that introduce additional guides per gene—that is, missed targets and boundary effects—represent the largest penalty to discovery strength. Mis-annotation by omission of targeting guides (false non-targeting) reduces the number of discoveries but does not impair discovery precision, and this demonstrates why libraries designed with few guides per gene, such as Gattinara, still perform well.

Taken together, as demonstrated by simulations, all common mis-annotations have an adverse effect on discovery. Mis-annotations that introduce incorrect guides have a larger adverse effect than mis-annotations that remove correct guides. We therefore recommend that CRISPR libraries be validated for the cell line under investigation to ensure that valid CRISPR targets in the cell line genome are considered appropriately to avoid false non-targeting, missed target, and boundary effects. Exorcise validates CRISPR library guides by alignment to a user-defined genome.

Re-annotation of DepMap cancer cell line CRISPR screens

Next, we demonstrated the applicability of Exorcise on generating personalised re-annotations of library guides on cancer cell lines in the Cancer Dependency Map (DepMap) [60]. Cancer cells undergo genomic rearrangement and instability, so it is inadequate to assume that all CRISPR targets designed on standard genome assemblies such as GRCh38 are valid in a cancer genome. We addressed this issue by deducing cancer cell line exomes from RNA-seq data and re-annotating based on transcript abundance (Fig. 4A). We assumed that transcripts expressed to at least one transcript per million reads (TPM) were present in the exome, and if so, then we included the associated RefSeq exons for those transcripts into the exome. We compared this TPM-based strategy with a parallel strategy using all RefSeq exons regardless of transcript abundance and compared both with the published CRISPR dependency scores on DepMap.

Fig. 4figure 4

Re-annotation of DepMap CRISPR screens with transcriptomes. A Schematic of exome inference from transcriptomic data and re-annotation with Exorcise. B Representative example of re-annotated DepMap CRISPR screens. Values indicate normalised gene dependency scores (DepMap) or normalised normZ scores (Exorcise). Differential genes are highlighted. Points are coloured by normalised gene dependency score on the y-axis. C ROC curves of screens in B showing the performance of Exorcise to recover discoveries made without Exorcise. AUC, area under the ROC curve

Both TPM-based and RefSeq exome strategies agreed strongly with DepMap dependency scores, indicating retention of discoveries regardless of strategy (Fig. 4B, Additional file 2: Tables S25 and S26). For both strategies, we saw strong retention of discoveries at the extreme tails, as indicated by initially steep curves on the ROC plot (Fig. 4C). Exorcise with RefSeq enabled some additional discoveries in the intermediate range as shown by a tapered curve after the initial steep section, but this was less pronounced when using the TPM-based strategy. Uplift in the magnitude of intermediate normZ scores indicates correction of missed target effect guides that were absent in the cancer cell line genome as evidenced by low TPM expression.

Taken together, we demonstrate concordance between gene dependency scores computed by Exorcise and published on DepMap. We further demonstrate that exome estimation for Exorcise re-annotation is possible from transcriptomics where genomics is not available. We posit that Exorcise corrects missed target effects by excluding guides in the library when evidence from TPM expression suggests that the target is not expressed and therefore is not in the exome. However, since lack of transcript expression does not directly indicate whether the target exists in the genome, whole genome sequencing is required to validate this assumption.

Re-annotation of published DDR CRISPR screens identifies improved signal in intermediate hits

Finally, we explored whether existing DNA damage response (DDR) CRISPR screens would benefit from reanalysis with a library re-annotated by Exorcise. We subjected all of the CRISPR screens covered in the DNA damage response CRISPR screen portal (DDRcs) [4] to Exorcise with RefSeq GRCh38 and compared DrugZ normZ scores with and without Exorcise. Across all screens, we found 105 genes that exhibited an absolute normZ improvement of at least 3 (Fig. 5A, Additional file 2: Table S27). Among these 105 genes were paralogues and multiple members of gene families, for example, NBPF, NPIPA, PRR20, TBC1D3, and USP17L, whose appearance is not surprising given the extent of sequence identity among family members, leading to numerous false non-targeting mis-annotations and increased number of guides per gene after re-annotation. However, we also observed benefit in genes that presented in the absence of family members, such as Aicda, EIF3C, Polr3k, and TAF9.

Fig. 5figure 5

Re-annotation of DDRcs with select examples. A DrugZ normZ score shift (Exorcise normZ − original normZ) by gene. Each point indicates one experiment in the DDRcs. Shown only are genes in which at least one experiment has a normZ shift greater than 3 or less than − 3. Blue points indicate experiments in which that gene had a normZ shift greater than 3 or less than − 3. B Biplots of select re-annotated experiments in the DDRcs. Top to bottom: talazoparib from experiment start, Gattinara library, DeWeirdt (2020) [16]; hydroxyurea (acute) at matched timepoints, TKO v3 library, Olivieri (2020) [65]; DMSO day 22 versus no treatment day 14, Yusa Mouse v2 library, Lloyd (2021) [66]; MCL1 inhibitor S63845 at matched timepoints, DeWeirdt secondary library, DeWeirdt (2020) [16]. Blue points indicate more negative normZ shift (decreased with Exorcise); red points indicate more positive normZ shift (increased with Exorcise). Arrows indicate re-annotations by Exorcise: blue arrows mean guides added to the gene; red arrows mean guides removed from the gene. C ROC curves of screens in B showing the performance of Exorcise to recover discoveries made without Exorcise. AUC, area under the ROC curve. D DrugZ normZ score shift as in A but by library. Blue points indicate gene/experiment pairs using that library with a normZ shift greater than 3 or less than − 3 in the same direction as without Exorcise. E Genes that became significant (DrugZ FDR ≤ 0.05) with Exorcise but not without. Pseudogenes, readthrough transcripts, antisense transcripts, and uncharacterised genes are excluded

Polr3k, for instance, showed improved negative normZ score in the untreated case after Exorcise in the Yusa Mouse V2 library due to two of its guides being re-annotated from targeting to non-targeting (Fig. 5B, Additional file 2: Tables S28–S31). Our original analysis had identified Polr3k as an essential gene where depletion resulted in a fitness defect [66]. Analysis after Exorcise re-annotation suggested a stronger fitness defect.

TAF9 in the TKO v3 library was re-annotated to include additional guides for TAF9B that also target TAF9 (a false non-targeting mis-annotation) to improve the discovery of TAF9 hits. In a screen for resistance and sensitivity to acute hydroxyurea treatment, the authors identified mostly non-DDR genes as hits in their original analysis [65], but not the accessory transcription factor gene TAF9. After Exorcise, analysis revealed TAF9 depletion as hypersensitising to acute hydroxyurea treatment, consistent with hydroxyurea’s role in inducing transcriptional changes related to activation of the DDR [67,68,69].

In Gattinara, guides targeting POLR2J and POLR2J2 also target POLR2J3. Exorcise corrected these false non-targeting mis-annotations so that they also target POLR2J3. Furthermore, 14 guides targeting H2AC18, H2BC21, H3C14, and H4C14 were re-annotated to correct false non-targeting mis-annotations for H4C15. In a screen for sensitisers and resistance to talazoparib, using the Gattinara library [16], both POLR2J3 and H4C15 exhibited more negative normZ scores after Exorcise. Although neither of these genes were hits in the original screen, their relevance in the re-annotation should be considered with care due to the large increase in guides representing each gene compared to the two guides per gene design of Gattinara.

By the same authors, another screen inspecting the effect of CRISPR knockouts in MCL1 inhibited cells with a custom secondary library in which intermediate sensitising hits in Meljuso and OVCAR8 cells included genes involved in ribosome biogenesis, cell cycle checkpoints, or ubiquitylation [16]. Exorcise re-annotation revealed a new intermediate hit, RFC5, for which one guide targeting WSB2 in the DeWeirdt secondary library was also a false non-target for RFC5. The appearance of this gene resulting in moderate sensitisation to MCL1 inhibition is consistent with the other hits originally identified.

Among all screens, the strongest hits were consistently retained after Exorcise, as indicated by an initially steep curve on the ROC plot (Fig. 5C). Additional Exorcise hits were obtained in the intermediate range, shown on the ROC plot as plateaus. This is in line with simulations modelling false non-targeting and missed-target mis-annotations (Fig. 3D), which exist in the libraries concerned (Fig. 5B, Additional file 2: Tables S28–S31).

Across all re-annotated experiments in the DDRcs, we were able to identify genes that in multiple experiments became significant hits only after Exorcise (Fig. 5D, Additional file 2: Table S32). Among these genes were multiple members of the same family (for example, TBC1D, SPDYE, and NPIPB families) for which re-annotation enabled correction of false non-targeting errors among family members, thereby increasing the number of guides representing the same gene in the analysis. The appearance of genes in the absence of other family members (for example, SPOUT1, DERPC, and TIMM23B) indicated a benefit of Exorcise not related to correction of false non-targeting errors between family members.

We also investigated whether some libraries benefitted more than others after Exorcise by inspecting the distribution of normZ shifts across all experiments in the DDRcs with library (Fig. 5E, Additional file 2: Table S27). We found that almost all libraries had at least one experiment in which at least one gene had a normZ shift of at least three units in the original direction after Exorcise, indicating universal applicability of the algorithm. We also found a bias towards discovery of hits on the hypersensitising side of the analysis. This was an effect also seen in our simulations, where the strongest hits and their ordering on the hypersensitising side were more sensitive (Fig. 3C) and less susceptible to missed discovery due to mis-annotation (Fig. 3D). We believe that this is an artefact of the DrugZ analysis tool selected, as we do not see this bias in analysis of the same simulations using MAGeCK (Additional file 1: Fig. S4).

Taken together, our re-annotations of published screen data indicate a benefit by Exorcise for the enhanced discovery of intermediate hits. We have demonstrated that those hits may hold relevance in the context of stronger hits, which are retained after Exorcise, and we posit that they should not be ignored. Exorcise is able to reveal these intermediate hits.

留言 (0)

沒有登入
gif