Functional analysis of structural variants in single cells using Strand-seq

NO classifies cell types and predicts gene activity changesStrand-seq data reveals NO

We hypothesized that NO patterns derived from MNase fragmentation during Strand-seq library preparation could represent a readout to allow functional characterization of SVs (Fig. 1a and Extended Data Fig. 1). To test this, we evaluated whether Strand-seq data revealed nucleosome positioning through comparison with bulk MNase-seq data. We used the NA12878 lymphoblastoid cell line (LCL), which has both datatypes available, and pooled 95 Strand-seq libraries (sequenced to a median of 540,379 mapped nonduplicate reads per single cell31; Supplementary Table 1), into a ‘pseudobulk’ track, allowing direct comparison with the corresponding MNase-seq dataset (sequenced to 19-fold genomic coverage32). We measured NO along the genome (Methods) and found Strand-seq and MNase-seq were highly concordant in terms of uniformity of coverage and inferred nucleosome positions at DNase I hypersensitive sites (Spearman’s r = 0.68) (Fig. 1b,c). Nucleosome positioning near the binding site of CTCF26,28 (a key chromatin organizer) closely matched between both assays (Fig. 1d and Supplementary Fig. 1), and estimated nucleosome repeat lengths28 were highly concordant (Supplementary Fig. 1). In addition, both assays measured NO in all 15 chromatin states identified by the Roadmap Epigenome Consortium33. Among these chromatin states, Strand-seq and MNase-seq revealed the highest NO signals on average for the polycomb-repressed state and the bivalent enhancer state, whereas the lowest average NO signals were consistently seen for the active transcription start site (TSS) state (Extended Data Fig. 2). This indicates that Strand-seq enables direct measurement of NO to reveal a ‘molecular readout’. We thus developed the scNOVA framework, which harnesses Strand-seq to measure NO genome-wide and couples this with SVs discovered in the same sequenced cell (Fig. 1a).

Fig. 1: Haplotype-aware single-cell multiomics to functionally characterize SVs.figure 1

a, Leveraging Strand-seq, scNOVA performs SV discovery and then, using phased NO tracks, identifies functional effects of SVs locally (via evaluation of haplotype-specific NO) and globally (clone-specific NO). Orange, Strand-seq reads mapped to the Watson (W) strand; green, reads mapped to the Crick (C) strand. b, Strand-seq-based NO tracks in NA12878 reveal nucleosome positions well-concordant with bulk MNase-seq, depicted for a chromosome 12 locus with relatively regular nucleosome positioning92. Red, NO tracks mapping to haplotype 1 (H1); blue, H2; black, combining phased and unphased reads; gray, MNase-seq. The y axis depicts the mean read counts at each bp in 10 bp bins. c, Correlated NO at consensus DNase I hypersensitive sites33 for NA12878. d, Averaged nucleosome patterns at CTCF binding sites34 in NA12878, using pseudobulk Strand-seq and MNase-seq. e, FCs of haplotype-resolved NO in gene bodies plotted for chromosome X and chromosome 7 (a representative autosome) in NA12878. FCs of haplotype-resolved RNA expression measurements are shown to the right. f, Pseudobulk haplotype-phased NO track of exons of the representative chromosome X gene SH3KBP1 based on Strand-seq. Boxplots comparing H1 and H2 use two-sided Wilcoxon rank sum tests followed by Benjamini–Hochberg multiple testing (FDR) correction (boxplots defined by minima = 25th percentile – 1.5 × interquartile range (IQR), maxima = 75th percentile + 1.5 × IQR, center = median and bounds of box = 25th and 75th percentile; n = 47 single cells). Bar charts show haplotype-specific RNA expression of SH3KBP1 (two-sided likelihood ratio test followed by FDR correction; n = 4 biological replicates; data are presented as mean values ± s.e.m.). g, Inverse correlation of NO at gene bodies and gene expression. NO is based on pseudobulk Strand-seq libraries from RPE-1. Gene bodies were scaled to the same length. h, Cell-typing based on NO at gene bodies (AUC = 0.96). Cell line codes: Blue, RPE-1; Purple, BM510; Magenta, C7; LV, latent variable. i, Receiver operating characteristics for inferring altered gene activity by analyzing NO at gene bodies, using pseudobulk Strand-seq libraries from in silico cell mixing.

As Strand-seq resolves its measurements by haplotype31, we considered that haplotype-specific differences in NO (haplotype-specific NO) resulting from random monoallelic expression, germline SNPs and local effects of SVs could be harnessed for scNOVA. To assess the utility of haplotype-resolved NO, we phased 24,652,658 of 49,205,197 (50.1%) of the NA12878 Strand-seq read fragments, and pooled these reads to generate pseudobulk NO tracks for each chromosomal haplotype (denoted ‘H1’ and ‘H2’, respectively; Fig. 1b). Using the female-derived NA12878 cell line, we compared haplotype-specific NO to haplotype-resolved gene expression measurements from bulk RNA-seq data34 (Methods). We identified a significant increase of NO in gene bodies mapping to H1 compared with H2 across the X chromosome (adjusted P = 0.0012; Wilcoxon rank sum test), suggesting that H1 represents the inactive X chromosome. These data were consistent with haplotype-resolved gene expression measurements at loci subject to X-inactivation35, whereas genes escaping X-inactivation did not exhibit haplotype-specific NO (Fig. 1e,f and Supplementary Fig. 3). We also investigated whether Strand-seq data is informative of haplotype-specific NO at cis-regulatory elements (CREs), and identified a 1.4-fold enrichment for allele-specific CRE binding on the X chromosome (P = 0.015; hypergeometric test; based on 718 CREs with haplotype-specific NO genome-wide; 10% false discovery rate (FDR)) (Supplementary Fig. 2). Moreover, CREs with haplotype-specific NO were significantly over-represented near genes showing allele-specific expression in the genome (P < 0.0018, hypergeometric test; Supplementary Fig. 2). These data suggest that haplotype-specific NO, a signal directly obtained from Strand-seq datasets, reflects biological gene regulation patterns in the genome.

Cell-typing

Since NO within gene bodies reflects gene activity in MNase-seq data28, we hypothesized that Strand-seq based NO patterns could be used to infer gene expression. To investigate this, we tested whether NO globally reflects cellular gene expression patterns in the retinal pigment epithelium-1 (RPE-1) cell line, for which we previously generated both Strand-seq and RNA-seq data24. To profile NO globally, we pooled 33 million read fragments (including phased and nonphased reads) from 79 Strand-seq libraries into pseudobulk NO tracks. We identified an inverse correlation between NO at gene bodies and gene expression (P < 2.2 × 10–16; Spearman’s r of up to −0.24; Fig. 1g and Supplementary Fig. 4), where highly expressed genes showed significantly lower NO within their gene bodies (and vice versa). We next explored the utility of NO for cell-type inference (‘cell-typing’), based on the activity of lineage-specific genes, by implementing a multivariate dimensionality reduction framework. We performed in silico mixing of Strand-seq libraries from different LCLs and RPE cell lines, and built a classifier that separates distinct cell types by partial least squares discriminant analysis (PLS-DA). We used a training set of 179 mixed libraries, and initially considered 19,629 features, which reflect ENSEMBL36 genes with sufficient read coverage (Methods). After feature selection, 1,738 features were retained. We then used a nonoverlapping set of 123 cells to assess performance, all of which scNOVA classified accurately (area under the curve (AUC) = 1; Extended Data Fig. 3). Our framework also discriminated between cells from three related RPE cell lines derived from the same donor, which exhibit distinct SV landscapes24,37 (AUC = 0.96; Fig. 1h) indicating that scNOVA enables accurate cell-typing.

Gene activity changes between cell populations

Having established that scNOVA can use the expression of lineage-specific genes for cell-typing, we evaluated if it could predict gene expression differences between defined cell populations, such as subclones bearing distinct SVs. We devised a module that integrates deep convolutional neural networks and negative binomial generalized linear models (Supplementary Figs. 5 and 6), to measure differential gene activity between two defined cell populations. To benchmark this module, we mixed Strand-seq libraries from different cell lines in silico, creating ‘pseudoclones’, and evaluated the predicted changes in gene activity between defined pseudoclones (each composed of cells from one cell line) by analyzing NO at gene bodies (Supplementary Fig. 7 and Extended Data Fig. 4). We first compared RPE-1 to the HG01573 LCL line, and defined the ground truth of expression using RNA-seq. We found that the differential gene activity score of scNOVA (Methods) was highly predictive of the ten most differentially expressed genes, where analyses of pseudoclones comprising 156 RPE-1 and 46 HG01573 libraries revealed an AUC of 0.93 (we observed a similar performance when analyzing the 50 most differentially expressed genes; Fig. 1i). Gene activity changes inferred included well-known markers of epithelial (for example, EGFR, VCAN) and lymphoid (for example, CD74, CD100) cell types (Supplementary Table 2). The scNOVA predictions were informative also when we simulated minor subclones present with clonal frequency (CF) = 20%, CF = 5% and CF = 1.3%, resulting in AUCs of 0.92, 0.79 and 0.68, respectively (Extended Data Fig. 4). We obtained similar results when applying scNOVA to pseudoclones derived from different (genetically related) RPE cell lines (Supplementary Fig. 7). These benchmarking exercises suggest that scNOVA can accurately infer gene activity changes between defined cell populations, suggesting that this framework can be used to functionally characterize subclonal SVs.

Functional outcomes of SVs in cell lines

To test this, we set out to investigate the functional outcomes of somatic SV landscapes in a panel of LCL samples38 (N = 25) from the 1000 Genomes Project39 (1KGP). Single-cell SV discovery in 1,372 Strand-seq libraries generated for this panel (Supplementary Table 1) discovered 205 somatic SVs, with 24 of 25 (96%) LCLs showing at least one SV subclone—a sevenfold increase compared to a previous report40 (Supplementary Table 3 and Supplementary Data). Of all the cell lines, 13 (52%) contained an SV subclone above 10% CF. This included the widely used NA12878 cell line34,39, in which we discovered a subclonal 500 kb deletion at19q13.12 (CF = 21%) that was mutually exclusive with two 22q11.2 deletions seen at CFs of 21% and 57%, respectively (Supplementary Figs. 9 and 10). The 22q11.2 SVs mapped to the well-known site of IGL recombination occurring during normal B cell development41. We hence focused on the 19q13.12 event, which resulted in the loss of a copy of ZNF382—a tumor suppressor and repressor of c-Myc42. Application of scNOVA measured significantly increased activity of ERCC6—a target gene of the c-Myc/Max transcription factor (TF) dimer43—and decreased activity of PIEZO2 and TRAPPC9, in cells harboring this deletion (10% FDR; Supplementary Table 2).

To validate these findings, we reanalyzed Fluidigm and Smart-seq single-cell RNA-seq (scRNA-seq) datasets generated for NA12878 (refs. 44,45). We employed several established tools for SCNAs discovery from scRNA-seq data46,47,48 (Supplementary Table 4), all of which failed to discover any of the SV subclones seen in this cell line (Supplementary Table 4). Yet, upon directly inputting the respective SV breakpoint coordinates into the CONICSmat tool46, we succeeded in identifying the 19q13.12 deletion (denoted ‘19q-Del’) through ‘targeted SCNA recalling’. We next pursued differential gene expression analyses by scRNA-seq, comparing 19q-Del cells to unaffected (‘19q-Ref’) cells, and verified overexpression of ERCC6 in 19q-Del cells (10% FDR; Supplementary Fig. 10). For PIEZO2 and TRAPPC9, the scRNA-seq-based expression trends were consistent with scNOVA (Supplementary Fig. 10), but did not reach the FDR threshold. A search for the over-represented TF targets amongst the differentially active genes identified c-Myc and Max as the most over-represented TFs in 19q-Del cells (10% FDR; Supplementary Fig. 10). These results indicate that scNOVA can functionally characterize SVs inaccessible to scRNA-seq-based SCNA discovery.

We next focused on NA20509, the LCL with the most abundant SV subclone (85% CF). Somatic SVs in NA20509 arose primarily through the breakage-fusion-bride-cycle (BFB) process24,49 involving a 49 Mb terminal duplication on 5q, and a 2.5 Mb inverted duplication on 17p with an adjacent terminal deletion (terDel) (Fig. 2a). The 5q and 17p segments became fused into a derivative chromosome of around 115 Mb (Supplementary Fig. 13), which probably stabilized the BFB. We searched for global gene activity changes in this ‘17p-BFB’ subclone compared with the nonrearranged cells (‘17p-Ref’) and identified 18 dysregulated genes (Fig. 2b). Testing for gene set over-representation50 (Methods) revealed an enrichment of the target genes of c-Myc/Max heterodimers (10% FDR; Fig. 2c), that is, the same TFs we observed in the 19q-Del subclone in NA12878. Consistent with this, we identified somatic copy-number gain of MAP2K3, which encodes a gene activating c-Myc/Max51, resulting from the BFB (Fig. 2a).

Fig. 2: Linking subclonal SVs to their functional consequences in LCLs.figure 2

a, Complex SVs in NA20509, with BFB-mediated rearrangements (17p) and a terminal dispersed duplication (5q) present with CF = 85%, shown for representative single cells. Ref, cells lacking complex SVs; InvDup, inverted duplication; terDel, terminal deletion. Reads are mapped to the W (orange) or C (green) strand. Gray, single cell IDs. b, Heatmap of 18 genes with altered gene activity amongst subclones, based on scNOVA (‘17p-BFB’, SV subclone; ‘17p-Ref’, 17p not rearranged). Asterisks denote TF targets of c-Myc and Max. c, Gene set over-representation analysis for TF target genes showing significant enrichment of c-Myc and Max targets in the 17p-BFB subclone. Padj., adjusted P value. Right, Model for c-Myc/Max target activation in NA20509 based on scNOVA, combined with previous knowledge. d, Mean RNA-seq expression Z scores of c-Myc/Max target genes across 33 LCLs. e, Fishplot showing CF changes over long-term culture from 23.3% (7 of 33 cells; p4) to 100% (30of 30 cells; p8). f, qPCR verifies clonal expansion of the BFB clone in p8 compared with p4 (P value based on FDR-corrected two-sided unpaired t-tests; n = 3). HG1505, control cell line with a somatically stable MAP2K3 locus. Note that for both NA20509 and HG1505 the germline copy number of the MAP2K3 locus was consistently estimated to be three. Data are presented as mean values ± s.e.m. g, RNA-seq shows significant increase of MAP2K3 at p8 versus p4 (FDR-corrected two-sided Wald test, based on DESeq2; n = 5 and three biological replicates for p4 and p8, respectively). h, Mean RNA expression Z scores of c-Myc/Max target genes in NA20509 (differences between p4 and p8 were evaluated using a two-sided Wilcoxon rank sum test; n = 5 and three biological replicates for p4 and p8, respectively). Boxplot was defined by minima = 25th percentile – 1.5 × IQR, maxima = 75th percentile + 1.5 × IQR, center = median and bounds of box = 25th and 75th percentile (gh).

We performed several orthogonal analyses to validate these findings. First, we verified all somatic SVs using deep WGS data generated for the 1KGP sample panel52 (Supplementary Fig. 13). Second, we analyzed RNA-seq data38 for this LCL panel, which revealed that NA20509 exhibits the highest MAP2K3 expression and the highest c-Myc/Max target expression (Supplementary Fig. 14 and Fig. 2d). Third, we followed the 17p-BFB subclone in culture, by subjecting early (p4) and late passage (p8) cells to Strand-seq, which revealed outgrowth of the 17p-BFB subclone (CF = 23% at p4, CF = 100% at p8; P < 0.00001, Fisher’s exact test; Fig. 2e), suggesting these cells have a proliferative advantage. Quantitative real-time PCR experiments verified this clonal outgrowth pattern (Fig. 2f).

Since the functional impact of SVs on clonal expansion is unexplored in LCLs, we more deeply characterized the molecular phenotypes of 17p-BFB cells by pursuing RNA-seq in p4 and p8 cultures. We observed increased MAP2K3 expression (1.39-fold, 10% FDR) at p8, consistent with MAP2K3 dysregulation as a result of copy-number gain in the 17p-BFB subclone (Fig. 2g and Supplementary Note). Pathway-level analysis showed deregulation of c-Myc/Max target genes following clonal expansion (P = 0.036; Wilcoxon rank sum test; Fig. 2h and Supplementary Fig. 14). Collectively, these data link the outgrowth of SV subclones to the deregulation of c-Myc/Max targets, which could represent a common driver of clonal expansion in LCLs.

Local effects of copy-balanced driver SVs in leukemia

To deconvolute the effects of driver SVs in patients, we applied scNOVA to analyze the local consequences of balanced SVs, which are widespread in leukemia3,53. We analyzed primary cells from a patient with acute myeloid leukemia (AML) (32-year-old male; patient-ID = AML_1) bearing a balanced t(8;21) translocation that results in RUNX1-RUNX1T1 gene fusion54. We sorted CD34+ cells from AML_1 (Supplementary Fig. 15), and sequenced 42 Strand-seq libraries. SV discovery revealed a 46,XY,t(8;21)(q22;q22) karyotype (Fig. 3a, Supplementary Fig. 16 and Supplementary Table 3) consistent with clinical diagnosis. We fine-mapped the translocation breakpoint to intron 1 of RUNX1T1 and intron 5 of RUNX1 (Supplementary Fig. 17), and subsequently identified haplotype-specific NO at 11 genes, genome-wide (10% FDR; Supplementary Table 2). This included RUNX1T1, which showed reduced NO on the derivative (H2) haplotype (Fig. 3b), consistent with increased gene activity mediated as a local effect of the translocation55. The remaining genes did not reside near a detected somatic SV, suggesting other factors (such as germline SNPs; Supplementary Fig. 17) may have affected their NO.

Fig. 3: Haplotype-specific NO analysis shows local effects of a copy-neutral driver SV in AML.figure 3

a, Balanced t(8;21) translocation in AML_1, discovered based on strand cosegregation (P value = 0.00003 for translocation discovery using strand cosegregation24, FDR-adjusted Fisher’s exact test; Supplementary Fig. 16). The SV breakpoint was fine-mapped to the region highlighted in light blue. Composite reads shown were taken from all informative cells in which reads could be phased (WC or CW configuration; Methods). b, A violin plot demonstrates haplotype-specific NO at the RUNX1T1 gene body (10% FDR; two-sided Wilcoxon rank sum test followed by Benjamini–Hochberg multiple correction; n = 17 single cells; boxplot was defined by minima = 25th percentile – 1.5× IQR, maxima = 75th percentile + 1.5× IQR, center = median and bounds of box = 25th and 75th percentile), consistent with aberrant activity of the locus on der(8). c, Haplotype-specific NO around the SV breakpoint. FCs of haplotype-specific NO, measured between the RUNX1-RUNX1T1 containing derivative chromosome (der(8)) and corresponding regions on the unaffected homolog (Ref), are shown in black, and –log10(P values) in light blue. Enhancer-target gene physical interactions based on chromatin conformation capture56,93 are depicted in orange (interactions involving RUNX1 and RUNX1T1) and gray (involving other loci). d, Significant CREs located within the distal peak region, demonstrating haplotype-specific absence of NO on der(8) at 10% FDR, suggesting increased CRE accessibility on der(8). Within the segment around 0.8 to 1.1 Mb upstream of RUNX1, which showed pronounced haplotype-specific NO, we tested 69 CREs for haplotype-specific NO, which identified two significant CREs. e, Haplotype-specific NO measured between der(8) and corresponding regions of the unaffected homolog. Red, regions corresponding to the fused TAD. f, A beeswarm plot shows that the fused TAD (red) is an outlier in terms of haplotype-specific NO on der(8) (P values based on Kolmogorov–Smirnov tests; n = 83 TADs in der(8); boxplot was defined by minima = 25th percentile – 1.5× IQR, maxima = 75th percentile + 1.5× IQR, center = median and bounds of box = 25th and 75th percentile).

To systematically investigate potential local effects, we used a sliding window (Methods) to measure NO on both sides of the translocation breakpoint. We observed decreased NO, suggesting increased chromatin accessibility, from the breakpoint junction up to the respective nearest topologically associating domain (TAD) boundaries (Fig. 3c). This signal was most pronounced in an enhancer-rich region around 0.8 to 1.1 Mb upstream of RUNX1 originating from chromosome 21 (P < 0.003; likelihood ratio test, adjusted using permutations; Fig. 3c), found to physically interact with the RUNX1 promoter in CD34+ cells56. Within this segment, we identified two CREs with significantly reduced NO (10% FDR; Exact test) (Fig. 3d and Supplementary Table 5), which may foster RUNX1-RUNX1T1 expression. Chromosome-wide analysis showed haplotype-specific NO patterns were restricted to the fused TAD (Fig. 3e,f), in line with these patterns resulting from the translocation.

We also revisited Strand-seq datasets with previously reported copy-neutral SVs, including the BM510 cell line in which copy-neutral interchromosomal SVs resulted in TP53–NTRK3 gene fusion24. In agreement with the oncogenic role of TP53–NTRK3 (ref. 24), scNOVA identified NTRK3 upregulation as the only significant local effect (10% FDR), consistent with allele-specific TP53–NTRK3 expression measured on the rearranged haplotype (Extended Data Fig. 5). Second, we revisited a 2.6 Mb inversion mapping to 14q32 in a T-cell acute lymphoblastic leukemia (T-ALL) patient-derived xenograft (T-ALL_P1)24. scNOVA discovered downregulation of BCL11B, a known haploinsufficient T-ALL tumor suppressor57, as a significant local effect of this balanced inversion, supporting allele-specific silencing of BCL11B on the rearranged haplotype as measured by RNA-seq24 (Extended Data Fig. 6). These data collectively show that scNOVA allows linking balanced SVs to their local functional consequences—a functionality not provided by any previous single-cell multiomic method20.

Dissecting functional effects of heterogeneous somatic SVs

We next set out to functionally dissect a leukemia sample with unknown genetic drivers, by characterizing B-cells from a 61-year-old patient with chronic lymphocytic leukemia (CLL) (CLL_24)58. Analysis of 86 Strand-seq libraries revealed an unprecedented level of somatic SVs, with 11 different karyotypes represented by 13 SVs occurring in subclones with CFs of 1–5% (Supplementary Table 3). This vastly exceeds intrapatient diversity estimates for CLLs from the Pan-Cancer Analysis of Whole Genomes (PCAWG), where maximally three subclones were reported59, highlighting how Strand-seq provides access to SVs escaping discovery by WGS3,24. Chromosome 10q showed especially pronounced subclonal heterogeneity; we identified seven partially overlapping deletions ranging from 2 to 31 Mb in size, and residing proximal to the fragile site FRA10B60 (Fig. 4a and Supplementary Fig. 18). These SVs clustered into a 1.4 Mb ‘minimal segment’ at 10q24.32, arising independently from both haplotypes (Fig. 4b). While previous studies reported somatic 10q24.32 deletions in 1–4% of CLLs61,62,63, molecular analysis of this recurrent somatic SV has so far been lacking.

留言 (0)

沒有登入
gif