Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Ethics statement

All experiments were conducted in accordance with relevant National Institutes of Health (NIH) guidelines and regulations related to the Care and Use of Laboratory Animals. All animal procedures were approved by the Institutional Animal Care and Use Committee of Weill Cornell Medicine and were in accordance with the 2011 Eighth Edition of the NIH Guide for the Care and Use of Laboratory Animals. Human tissue samples were acquired through the NIH NeuroBioBank and were compliant with research ethics stated by the NIH. All donors completed University of Maryland Institutional Review Board-approved consent documents. They were informed via these consent documents that the donated tissue would be used for distribution to qualified researchers and that such distributions could be made at any time in the future. These consent documents also assured that the identity of the donor would remain unknown to any tissue recipients and those reviewing the results of their work.

Experimental design and analysis considerations

The null hypothesis of the study was that brain regions and developmental time points of wild-type and healthy animals had no impact on alternative splicing patterns. No experimental manipulations of mice were performed. The study design was hence observational (known samples collected at different time points) and did not require randomization of experimental or control groups. Additionally, data collection and analysis were not performed blind to the conditions of the experiments. No animals were excluded from the analysis; however, in cases where adequate data points were not available (for example, minimum number of reads per gene), then those points were excluded. No statistical methods were used to predetermine sample sizes (for example, cell number in single-cell experiments), but our sample sizes are similar to those reported in previous publications65,66. Statistical tests used in this manuscript largely involve Fisher’s exact test and Wilcoxon rank sum test, which do not make assumptions about the underlying data distribution. Some analyses used χ2 tests, for which the χ2 criterion was checked before testing32.

ScISOr-Seq2: single-cell isoform RNA sequencing from adolescent and adult mouse brainTissue acquisition

C57BL/6NTac mice (male, P14: N = 2, P21: N = 2, P28: N = 2, P56: N = 6; see Supplementary Table 1) were housed in groups of three to four per cage with a 12 h-light/12-h-dark cycle and ad libitum access to food and water. Ambient temperature and humidity were centrally regulated. Mice were perfused with 25 ml of ice-cold and carbogen-treated 1× partial sucrose cutting solution containing 5 µg ml–1 actinomycin D. The remaining 1× partial sucrose cutting solution and Earle’s balanced salt solution were oxygenated. Dissection of specific brain regions was conducted by using the mouse three-dimensional coronal sections from Allen Brain Atlas as a reference map for coordinates. The brain slices were collected on a vibratome (Leica) at a thickness of 300 µm per slice and kept in ice-cold 1× partial sucrose cutting solution. For the hippocampus, eight to ten mouse coronal slices (300 µm) were collected from the caudal region of the brain after removing the cerebellum. The hippocampus region was dissected out based on the mouse coronal sections (images 62~89). Note, each image section on Allen Brain Atlas is spaced at 100-µm intervals. Eight to ten slices can cover almost the whole hippocampus region. The visual cortex was collected based on images 79–100. The first one or two slices from the caudal region of the brain were discarded, and subsequently five to six continuous slices were collected. For the striatum, five to six mouse coronal slices were collected from the rostral region of the brain after removing the olfactory bulb. The striatum was dissected out based on the mouse coronal sections (images 39~59). Dissection of the cerebellum does not require any vibratome sectioning. The cerebellum was dissected based on the location and structure with forceps and minced into small pieces with a scalpel. Slices were transferred to a slicing chamber with bubbling 1× partial sucrose containing small-molecule mix B at room temperature, and slices were allowed to recover for ~30 min. The 1× cutting solution contained the following components: 93 mM N-methyl-d-glucamine (Acros Organics, AC126841000), 2.5 mM KCl (Sigma, 44675), 1.2 mM NaH2PO4 (Sigma, S5011), 30 mM NaHCO3 (Sigma, S5761), 20 mM HEPES (Gibco, 15630106), 25 mM glucose (Sigma, G7021), 5 mM sodium ascorbate (Sigma, A4034), 2 mM thiourea (Alfa Aesar, AAA1282822), 3 mM sodium pyruvate (Gibco, 11360070), 10 mM N-acetyl-l-cysteine (Alfa Aesar, AAA1540914), 0.5 mM CaCl2 (Sigma, 223506) and 10 mM MgSO4 (Sigma, M2643), pH 7.2–7.4.

Single-cell disassociation

Tissue sections were dissociated by using a previous protocol with modification from the Smit lab at Vrije Universiteit, the Netherlands. Regions of interest were dissected on a Sylgard-coated plate with a dark background in 1–2 ml of carbogen-treated cutting solution. Tissue pieces were transferred to 5 ml of 2 mg ml–1 activated papain (Worthington, LK003150) and incubated for 15–25 min at 37 °C with gentle mixing. After the incubation, the tissue was cut into tiny pieces and gently triturated 15–20 times using large- to small-sized Pasteur pipettes until no obvious chunks were observed. Pasteur pipettes with different opening sizes (large: 0.6–0.7 cm, medium: 0.3–0.4 cm, small: 0.15–0.2 cm) were created by flame polishing disposable glass Pasteur pipettes (Thermo Fisher) and assembled with rubber bulbs. After undissociated tissue chunks settled down, the supernatant was taken and filtered using a 30-μm cell strainer (Miltenyi Biotec, 130-041-407) into a nuclease-free collection tube. The supernatant was then centrifuged at 300–400g for 5 min at room temperature. After discarding the supernatant, the cell pellet was resuspended in 3 ml of 10% ovomucoid protease inhibitor solution (150 µl of DNase I, 300 µl of ovomucoid protease inhibitor solution and 2.55 ml of Earle’s balanced salt solution; Worthington, LK003150). Next, the cell suspension was slowly and gently added to the top layer of 5 ml of ovomucoid protease inhibitor solution (Worthington, LK003150) without interfering with the bottom layer. The cells were spun down by centrifugation at 70–100g for 6 min at room temperature. After removing all the supernatant, cells were suspended in 1 ml of fluorescence-activated cell sorting (FACS) buffer (1× HBSS (Gibco, 14175079) containing 0.2% bovine serum albumin (Thermo Scientific, 37525), 25 mM glucose (Sigma, G7021), 3 mM sodium pyruvate (Gibco, 11360070) and 0.2 U µl–1 RNase inhibitor (Ambion, AM2682)). After incubation for 15 min in FACS buffer with 0.1 µg ml–1 DAPI (Sigma, D9542), viable cells were collected as a DAPI-negative population using a Sony MA900 sorter with FlowJo version 10 software. Sorted viable cells were centrifuged and subsequently diluted to 1,000–1,500 cells per μl in FACS buffer for capture on the 10x Genomics Chromium controller.

SnISOr-Seq: single-nucleus isoform sequencing in frozen human tissueSample acquisition

Six healthy human brain samples (hippocampus, three male and three female; Supplementary Table 5) used for this study were requested through the NIH NeuroBioBank and were obtained from the University of Maryland Brain and Tissue Bank according to Institutional Review Board-approved protocols. No donors had pre-existing neurodegenerative or neurological diseases. Tissues were flash-frozen and maintained at −80 °C until processing.

Single-nucleus isolation

Approximately 30 mg of frozen tissue from each sample was dissected in a sterile dish on dry ice and transferred to a 2-ml glass tube containing 1.5 ml of nuclei pure lysis buffer (MilliporeSigma, L9286) on ice. Tissue was completely minced and homogenized to a nuclei suspension by grinding samples with with Dounce homogenizers (MilliporeSigma, D8938-1SET) with 20 strokes with pestle A and 18 strokes with pestle B. The nuclei suspension was filtered by loading through a 35-µm-diameter filter, followed by centrifuging for 5 min at 600g and 4 °C. The nuclei pellet was collected and washed with cold wash buffer (1× PBS (Corning, 46-013-CM), 20 mM DTT (Thermo Fisher Scientific, P2325), 1% bovine serum albumin (New England Biolabs, B9000S) and 0.2 U µl–1 RNase inhibitor (Ambion, AM2682)) three times. After removing the supernatant from the last wash, the nuclei were resuspended in 1 ml of 0.5 µg ml–1 DAPI (MilliporeSigma, D9542) containing wash buffer to stain for 15 min. The nuclei suspension was prepared for sorting by filtering cell aggregates and particles out with a diameter of 35 µm. After removing myelin and fractured nuclei by sorting, the nuclei were collected by centrifuging for 5 min at 600g and 4 °C and resuspended in wash buffer to reach a final concentration of 1 × 106 nuclei per ml after counting in trypan blue (Thermo Fisher Scientific, T10282) using a Countess II cell counter (Thermo Fisher Scientific, A27977).

Linear/asymmetric PCR (LAP) to remove nonbarcoded cDNA

The first-round PCR protocol (95 °C for 3 min, 12 cycles of 98 °C for 20 s, 64 °C for 30 s and 72 °C for 60 s) was performed by applying 12 cycles of linear/asymmetric amplification to preferentially amplify one strand of the cDNA template (30 ng of cDNA generated by using a 10x Genomics Chromium Single Cell 3′ GEM kit) with the primer ‘Partial Read1’, and the product was purified with 0.8× SPRIselect beads (Beckman Coulter, B23318) and washed twice with 80% ethanol. The second-round PCR was performed by applying six cycles of exponential amplification under the same conditions with forward primer ‘Partial Read1’ and reverse primer ‘Partial TSO’, and the product was purified with 0.6× SPRIselect beads, washed twice with 80% ethanol and eluted in 30 µl of buffer EB (Qiagen, 19086). The following primer sequences were used: 5′-CTACACGACGCTCTTCCGATCT-3′ (Partial Read1) and 5′-AAGCAGTGGTATCAACGCAGAGTACAT-3′ (Partial TSO). KAPA HiFi HotStart PCR Ready Mix (2×; Roche, KK2601) was used as the polymerase for all the PCR amplification steps in this paper, except for the 10x Genomics 3′ library construction.

Exome capture (CAP) to enrich for spliced cDNA

Exome enrichment was applied to the cDNA purified from the previous step by using a SSELXT Human All Exon V8 probe kit (Agilent, 5191-6879) for human samples or SureSelectXT Mouse All Exon (Agilent, 5190-4641) for mouse samples. The reagent kit SureSelectXT HSQ (Agilent, G9611A) was used according to the manufacturer’s manual. First, the block oligonucleotide mix was made by mixing an equal amount (1 µl of each per reaction) of Partial Read1 primer (5′-CTACACGACGCTCTTCCGATCT-3′) and Partial TSO primer (5′-AAGCAGTGGTATCAACGCAGAGTACAT-3′) with a concentration of 200 ng µl–1 (Integrated DNA Technologies), resulting in 100 ng µl−1. Next, 5 µl of 100 ng µl–1 cDNA diluted from the previous step was combined with 2 µl of block mix and 2 µl of nuclease-free water (New England Biolabs, AM9937), and the cDNA block oligonucleotide mix was incubated on a thermocycler under the following conditions to allow the block oligonucleotide mix to bind to the 5′ end and the 3′ end of the cDNA molecule: 95 °C for 5 min, 65 °C for 5 min and 65 °C on hold. For the next step, the hybridization mix was prepared by combining 20 ml of SureSelect Hyb1, 0.8 ml of SureSelect Hyb2, 8.0 ml of SureSelect Hyb3 and 10.4 ml of SureSelect Hyb4 and kept at room temperature. Once the reaction reached 65 °C on hold, 5 µl of probe, 1.5 µl of nuclease-free water, 0.5 µl of 1:4 diluted RNase Block and 13 µl of the hybridization mix were added to the cDNA block oligonucleotide mix and incubated for 24 h at 65 °C. When the incubation reached the end, the hybridization reaction was transferred to room temperature. Simultaneously, an aliquot of 75 µl of M-270 Streptavidin Dynabeads (Thermo Fisher Scientific, 65305) was prepared by washing three times and was resuspended with 200 µl of binding buffer. Next, the hybridization reaction was mixed with all the M-270 Dynabeads and placed on a Hula mixer for 30 min at room temperature. During the incubation, 600 µl of wash buffer 2 (WB2) was transferred to three wells of a 0.2-ml PCR tube and incubated in a thermocycler on hold at 65 °C. After the 30-min incubation, the buffer was replaced with 200 µl of WB1. The tube containing the hybridization product bound to M-270 Dynabeads was put back into the Hula mixer for another 15-min incubation at low speed. Next, WB1 was replaced with WB2, and the tube was transferred to the thermocycler for the next round of incubation. Overall, the hybridization product bound to M-270 Dynabeads was incubated in WB2 for 30 min at 65 °C, and the buffer was replaced with fresh preheated WB2 every 10 min. When the incubation was over, WB2 was removed, and the beads were resuspended in 18 µl of nuclease-free water and stored at 4 °C. Next, the spliced cDNA, which bound with the M-270 Dynabeads, was amplified with primers Partial Read1 and Partial TSO by using the following PCR protocol: 95 °C for 3 min, 12 cycles of 98 °C for 20 s, 64 °C for 60 s and 72 °C for 3 min. The amplified spliced cDNA was isolated from M-270 beads as supernatant, followed by a purification with 0.8× SPRIselect beads.

10x 3′ library preparation and sequencing

A single-cell/single-nucleus suspension containing 10,000 cells/10,000 nuclei was loaded on a Chromium Single Cell B Chip (10x Genomics, 1000154). Specifically, 75 µl of master mix + nuclei suspension was loaded into the row labeled 1, 40 µl of Chromium Single Cell 3′ Gel Beads (10x Genomics, PN-1000093) was loaded into the row labeled 2, and 280 µl of partitioning oil (10x Genomics, 2000190) was loaded into the row labeled 3. This was followed by GEM generation and barcoding after GEM-RT cleanup and cDNA amplification. Then, 100 ng of purified cDNA derived from 12 cycles of cDNA amplification was used for 3′ Gene Expression Library Construction by using Chromium Single Cell 3′ GEM, Library & Gel Bead kit v3 (10x Genomics, 1000092) according to the manufacturer’s manual (10x Genomics, CG000183 Rev C). The barcoded short-read libraries were measured using a Qubit 2.0 with a Qubit dsDNA HS assay kit (Invitrogen, Q32854), and library quality was assessed on a Fragment analyzer (Agilent) using a high-sensitivity NGS Fragment kit (1–6,000 base pairs (bp); Agilent, DNF-474-0500). Sequencing libraries were loaded on an Illumina NovaSeq6000 with PE 2 × 50 paired-end kits by using the following read lengths: 28 cycles read 1, 8 cycles i7 index and 91 cycles read 2.

Library preparation for PacBio

HiFi SMRTbell libraries were constructed according to the manufacturer’s manual by using a SMRTbell Express Template Prep kit 2.0 (PacBio, 100-938-900). For all samples, ~500 ng of cDNA obtained by performing linear/asymmetric PCR followed by exome capture (LAP-CAP) from the previous step was used for library preparation. The library construction includes DNA damage repair (37 °C for 30 min), end repair/A tailing (20 °C for 30 min and 65 °C for 30 min), adaptor ligation (20 °C for 60 min) and purification with 0.6× SPRIselect beads.

Library preparation for ONT

For all samples, ~75 fmol of cDNA was processed with LAP-CAP and underwent ONT library construction by using a Ligation Sequencing kit (ONT, SQK-LSK110), according to the manufacturer’s protocol (Nanopore Protocol, Amplicons by Ligation, version ACDE_9110_v110_revC_10Nov2020). The ONT library was loaded onto a PromethION sequencer by using a PromethION Flow Cell (ONT, FLO-PRO002) and was sequenced for 72 h. Base calling was performed with Guppy by setting the base quality score to >7.

Short-read assignment of cell types (mouse)

Fastq files were obtained from the Illumina sequencing reads by running bcl2fastq. Gene × cell matrices processed with CellRanger V3.1.0 were loaded into Seurat V3.2.3 (ref. 36) and preprocessed individually using cutoffs described in Supplementary Table 4. After filtering for high-quality cells, they were scaled and normalized using default parameters and clustered using the Louvain algorithm. Doublet clusters were discarded. Subsequently, all samples from the hippocampal developmental lineage were processed together, as were the samples from the visual cortex lineage. After combining the data without any integration approaches and using the Seurat merge function, the data were scaled and normalized, and variable genes were identified. Integration of the data to control for sample-specific batch effects was performed using Harmony. Cell types were assigned using marker genes in the following three levels of granularity: broad, cell type and cell subtype. These were then assigned to each single cell along with the information on replicate, brain region and age. For the spatial axis, that is, the cerebellum, striatum and thalamus, the two replicates were integrated with Harmony38 before assigning cell types in the same three levels of granularity as described above. Finally, the entire dataset was merged together into a single object for visualization purposes and to obtain summary statistics in Fig. 1. This was also done using Harmony while controlling for region-specific differences in gene expression.

Short-read assignment of cell types (human)

Fastq files were obtained from the Illumina sequencing reads by running bcl2fastq. Gene × cell matrices processed with CellRanger v3.1.0 were loaded into Seurat 3.2.2 and preprocessed individually. After filtering for high-quality cells, they were scaled and normalized using default parameters and clustered using the Louvain algorithm. Doublet clusters were discarded. Subsequently, all samples were processed together. After combining the data without any integration approaches and using the Seurat merge function, the data were scaled and normalized, and variable genes were identified. Integration of data to control for sample-specific batch effects was performed using Harmony38. Cell types were assigned using marker genes in the following three levels of granularity: broad, cell type and cell subtype. These were then assigned to each single cell along with the information on sample ID.

Generation of PacBio circular consensus reads

Using the default SMRT-Link (v8.0.0.78867) parameters, we performed circular consensus sequencing (8.0.0.80529) with IsoSeq3 with the following modified parameters: maximum subread length of 14,000 bp, minimum subread length of 10 bp and minimum number of passes of 3.

Preprocessing of long-read sequencing data with PacBio

Subread fastq files were obtained from circular consensus sequencing performed using the parameters described above. Data were processed using the scisorseqr32 pipeline by first aligning to the genome using STARlong (v2.7.0). Cellular barcodes were assigned using the cell-type and sample information from the short-read analysis as input to the GetBarcodes() function. Subsequently, uniquely mapped, spliced barcoded reads were obtained using the MapAndFilter() and InfoPerLongRead() functions in scisorseqr.

Preprocessing of long-read sequencing data with ONT

Reads were basecalled using MinKNOW Core (v4.0.5), Bream (v6.0.10) and guppy (v4.0.11) on the PromethION machine. Reads were aligned using minimap2 (ref. 67; v2.17-r943-dirty), and data were preprocessed using the scisorseqr (v0.1.9)32 package. Cell barcodes were assigned using the cell-type and sample information from the short-read analysis.

PacBio transcript assignment using IsoQuant

IsoQuant (v2.3.0) was run using default PacBio parameters on an aggregate of all barcoded PacBio reads with GENCODE v21 as annotation. Multiexonic transcripts classified as ‘novel in catalog’ were then used to create an enhanced annotation

ONT transcript assignment using IsoQuant

This enhanced annotation gtf file was used on each of the ONT samples to correct incorrectly assigned splice sites in multiexonic barcoded reads. Isoquant (v3.1) was run using default parameters for ONT data. These corrected splice sites were then reassigned to reads in the AllInfo file, which was then filtered for unique molecular identifiers and used as input in subsequent analyses. More information is available in Supplementary Note 3.

Obtaining full-length isoform variability across three axes

To find regional variability, we first calculated the percent inclusion (Π) for each isoform in a region by summing the inclusion values over all cell subtypes and time points for that region. A Π value is only calculated if the minimum number of reads in a gene for that region is at least 10. If at least two regions have Π values, we obtain an isoform × region matrix of Π values. We then define the variability per isoform as the max(Π) – min(Π). The brain region contributing the most to the region-specific variability is recorded as having the Π with the highest divergence from the median Π value.

The same procedure was performed to calculate age and subtype variability of a given cell type provided that the minimum read number requirements per age and subtype in a gene were met. More formally, for a cell type (CT), all possible clusters can be represented as a combination of brain region (a1), age (a2) and subtype (a3).

$$\begin},\; }\; } & = & },\\ & & }_}}} \sim a_}.a_}.a_}a_}\in \left\}\,}},}}.\,.\right\},\\ & & \qquad\qquad\qquad\quad a_}\in \},},.\,.\}\\ & & \qquad\qquad\ }}\,a_}\in \left\}},}},.\,.\right\}\end$$

Per gene and cell type (CT), a matrix (G) can be constructed with i rows containing the isoforms of the gene and j columns containing the cell clusters. So,

$$G \sim _}\;}}\,}}\,j\in }_}}}\,}}\,i\,}}\,}}\,}}\,$$

So, for a brain region (x), a subset of the matrix can be obtained containing only the subtypes originating from that brain region,

$$^ } \sim _},\;}}^ }}}}\,j \sim x.a_}.a_}$$

Thus, for a cell type (CT), the brain region (BR) variability for isoform (i) is defined as

$$\begin}}_}}}}}}}}=\max \left(\overline_}_1}}\right)-\min \left(\overline_}_1}}\right)\,}}\end$$

$$\dot_}_1=}}}=\frac}\in } } }_},}}}} }_}}}\,$$

Similarly, the age variability is defined as

$$\begin}}_}}}}}}}}=\max \left(\overline_}_2}}\right)-\min \left(\overline_}_2}}\right)\,}}\end$$

$$\dot_}_2=}}}=\frac}\in } } }_},}}}_}}}$$

and the subtype variability is defined as

$$\begin}}_}}}}}}}}=\max \left(\overline}}}}\right)-\min \left(\overline_}}}\right)\,}}\end$$

$$\dot_}_3=}}}=\frac}\in } } }_},}}}_}}\,$$

Thus, the raw isoform variability for a cell type (CT) is

$$}}_}}}}}=[}}_}}}}}}}},}}_}}}}}}}},}}_}}}}}}}}]$$

For each isoform, we thus had a raw value for age, region and subtype variability. If an isoform did not have a reported variability value for any of the three axes, that isoform was excluded from further analysis. If any of these values were at least 0.1, that is, exhibiting at least a 10% change in isoform usage across an investigated axis, then the values were normalized to add up to 1 and represented in the ternary plot. For values where the normalized regional variability was greater than 0.5 but the age and subtype variability was less than 0.5, then the isoform was considered to be region specific for a particular cell type.

Bootstrapping to evaluate cell-type differences in isoform variability

To reduce bias in our comparisons between cell types, reads were downsampled per gene for every cluster considered within an axis to have exactly ten reads. Thus, although it is possible to have a variable number of cell subtypes and time point clusters for different cell types, the Π calculation is performed on an equal number of reads. Variability per isoform and axis was performed as described above, and a matrix was obtained per iteration. One hundred such iterations were performed. To test if excitatory neurons exhibited variability not recapitulated in other cell types, that is, to test if the number of genes with isoforms in multiple (greater than or equal to two) triangles was greater than that for other cell types, we performed a Fisher’s exact test by constructing a 2 × 2 contingency matrix of counts for each iteration. This was performed per cell type compared to excitatory neurons. We then counted the number of iterations wherein the P value was significant and reported the summarized statistic.

Obtaining full-length isoform variability across three axes in pseudobulk

A similar analysis to the one described above was performed, except that the cell subtype axis was replaced by a cell-type axis. Therefore, to find regional variability, we first calculated Π for each isoform in a region by summing the inclusion values over all cell types and time points for that region, and the variability values were obtained by subtracting the min(Π) from the max(Π). Age and cell-type variability were calculated similarly.

Differential isoform expression analysisBetween-replicate variability estimation using downsampling

For each of the five brain regions, we obtained Π values of isoform inclusion per cell type and replicate. For each cell type, we selected genes wherein there were at least 100 reads per gene for both replicates. For each gene, we subsampled 50 reads from both replicates. To perform differential isoform expression analysis, we used the two-sample framework using χ2 tests of abundance, as described in scisorseqr32, between replicates. P values from a χ2 test were reported per gene, along with a ΔΠ value per gene. The ΔΠ value was constructed as the sum of change in percent isoform (Π) of the top two isoforms in either the positive or negative direction. After these numbers were reported for all testable genes for a comparison, the Benjamini–Hochberg correction for multiple testing with a false discovery rate of 5% was applied to return a corrected P value. If this false discovery rate P value was ≤0.05, then the percentage of significant genes was reported for ΔΠ values ranging between 0.1 and 0.5. For each value of ΔΠ, the average percentage of significant genes across all five brain regions was reported. This process was repeated 100 times to obtain a confidence interval. This gave us an idea of within-sample variability in isoform expression.

To obtain intersample (that is, regional) differences, we repeated the same process (downsampling followed by testing) for each cell type by considering one brain region and comparing it to the aggregated counts across the other four brain regions. This process was also repeated 100 times to get an estimate of intersample variability.

Cell types wherein the intrasample differences were much lower than the intersample differences were considered for further differential expression analysis.

Differential isoform expression for one brain region versus all others

We performed the same testing framework as described above by comparing one brain region to all others on all reads and reported the corrected P values and ΔΠ per cell type.

Differential TSS and poly(A) expression for one brain region versus all

For each sequenced long read, we assigned a known CAGE peak if the start of the read fell within 50 bp of an annotated peak. Similarly, we assigned a poly(A) site to a read if it fell within 50 bp of a known site. Counts for each known TSS and poly(A) site were obtained per cell type, and reads where a TSS/poly(A) site could not be assigned were discarded. Counts for brain region comparisons were aggregated in a similar fashion to the full-length transcript analysis described above to obtain an n × 2 matrix. Differential isoform expression was performed using scisorseqr, and the ΔΠ was recorded for each gene.

Obtaining exon counts using corrected splice sites

Using all exons appearing as internal exons in a read, we calculated the following:

1.

The number of long-read molecules containing this exon with identity of both splice sites: Xin

2.

The number of long-read molecules assigned to the same gene as the exon, which skipped the exon and ≥50 bases on both sides: Xout

3.

The number of long-read molecules supporting the acceptor of the exon and ending on the exon: Xacc_In

4.

The number of long-read molecules supporting the donor of the exon and ending on the exon: Xdon_In

5.

The number of long-read molecules overlapping the exon: Xtot

Nonannotated exons with one or two annotated splice sites, ≥70 bases of nonexonic (in the annotation) bases, were excluded as intron-retention events or alternative acceptors/donors.

We then calculated

\(_}}}=\displaystyle\frac_}}}+_}}}}}}+_}}}}}}\,}_}}}+_}}}}}}+_}}}}}}+_}}}}\)

\(_}}}=\displaystyle\frac_}}}+_}}}}}}\,}_}}}+_}}}}}}+_}}}}\)

\(_}}}=\displaystyle\frac_}}}+_}}}}}}\,}_}}}+_}}}}}}+_}}}}\)

If

\(0.05\le _}}}\le 0.95\), where condition ∈

\(\displaystyle\frac_}}}+_}\_}}}+_}\_}}}+_}}}}_}}}}\ge 0.8\)

the exon was kept.

We then calculated the Ψoverall for each cell type from all long-read unique molecular identifiers for that cell type if and only if Xtot ≥ 10 for the exon and cell type in question. Otherwise, Ψoverall for the exon and cell type was set to ‘NA’.

Identifying hVExs

We first obtained a matrix of Ψ values for all major cell types for each of the 11 samples by summing the counts over replicates. This yielded 44 triads defined by the age, region and cell type of origin. We then considered four lineages to calculate differences of exon inclusion values (ΔΨ values) across the following:

For a matched cell type and region, the developmental time-specific changes in exon inclusion were obtained by calculating pairwise ΔΨ values. All comparisons that yielded ΔΨ ≥ 0.25, that is, showed a 25% change in inclusion for an exon between two time points, were reported.

For a given time point and region, the cell-type-specific changes in exon inclusion were obtained by calculating pairwise ΔΨ values. All comparisons that yielded ΔΨ ≥ 0.25 were reported.

For a matched cell type at P56, adult brain region-specific changes in exon inclusion were obtained by calculating pairwise ΔΨ values, and comparisons yielding ΔΨ ≥ 0.25 were reported.

For a given brain region at P56, the cell-type-specific changes in exon inclusion were obtained by calculating pairwise ΔΨ values, and comparisons yielding ΔΨ ≥ 0.25 were reported.

The exons obtained from the four lineages described above were classified as hVExs. The ΔΨ value for all pairwise comparisons of 44 triads, that is, for 946 comparisons, were then calculated for these hVExs and reported. To enable hierarchical clustering of this matrix of comparisons × exons, exons with too many NA values due to lack of depth for Ψ calculations in many triads were filtered out.

Identifying EVExs

For the exons classified as highly variable (see above) and for each of the four lineages considered, we retained the comparison with the highest ΔΨ value. This yielded a matrix with four columns, one for each of the lineages, and 5,931 rows, 1 for each hVEx. Of these, exons with ΔΨ ≥ 0.75 in any of the four columns were retained. These were defined as EVExs, wherein at least one comparison between triads across the four lineages displayed 75% or more change in exon inclusion.

Mapping orthologous exons in human data

The TransMap67 projection alignment algorithm was used to map exons between human and mouse assemblies. LASTZ68 genomics alignments between the human GRCh38 and mouse GRCm39 reference assemblies were used to map reference transcript annotations between assemblies. TransMap was used instead of University of California Santa Cruz Genome Browser liftOver69 as it produces base-level alignments, allowing observation of insertions/deletions and other differences between the LASTZ chain and net alignments files70. These were obtained from the University of California Santa Cruz Genome Browser site, along with the below-mentioned programs to process them. Syntenic genomic alignments were obtained by filtering the net files to obtain the syntenic nets using ‘netFilter -syn’ and then using ‘netChainSubset -wholeChains’ to obtain a set of syntenic chain alignments for mappings. GENCODE71 human v35 and mouse vM26 were mapped to the other assembly using the pslMap program.

Obtaining cell-type variability in the human hippocampus

We obtained Ψ values for each cell type in the human data using the same strategy as for the mouse data. Orthologous exons that were unambiguously and reciprocally mapped between humans and mice and shared sequence homology and length were selected. For EVExs in groups E1–E5 identified in mice, exon Ψ values were obtained per cell type for exons where sufficient depth for the exon’s ortholog was available. Because only one brain region (hippocampus) and time point (adult) were available, variability was defined as the max(Ψ) – min (Ψ) among the major cell types. Similarly, for all exons in the human data that had orthologs in mice, Ψ values were obtained per major cell type, and exons were classified as ‘highly’ variable if eVar ≥ 0.5 and invariable but alternative if eVar ≤ 0.2. To allow for a higher number of exons to be queried, we then obtained the Ψ values for the broad categories of neurons and glia from both mouse and human data and reported them.

Effects of RBP on exon inclusion

We accessed ENCODE III data from two human cell lines, K562 and HepG2, wherein short hairpin RNA or CRISPR was used to deplete individual RBPs, followed by RNA sequencing of both the KD and control samples56,57. A total of 263 RBPs were profiled in this study, and about one-third were annotated as being involved in splicing regulation and RNA processing. We obtained data processed with rMATs and identified cassette exons that were significantly altered (P ≤ 0.05 and change in percent spliced index ≥ 0.1) in their splicing profiles from all the RBP KD experiments and recorded both the exon and the associated RBP. We intersected these exons with the human orthologs of the mouse EVExs. We could thus characterize the number of exons that were influenced by RBP expression and the number of RBPs that influenced each exon. More information is available in Supplementary Note 5.

Identifying sQTLs associated with exons of interest

We accessed the GTEx v8 (ref. 58) genome-wide association study (GWAS) catalog and obtained a list of sQTLs that were associated with intron removal ratios for brain regions included in our study. From our mouse data, we isolated exons that either flank or are contained within these significant introns identified by GTEx. Finally, we cross-referenced the genes within which these single-nucleotide polymorphisms were located with the GWAS catalog and manually identified the list of genes that are associated with central nervous system traits.

Obtaining developmental modalities of variability

Considering the second of the four lineages above, that is, the cell-type-specific differences in exon inclusion for a given time point and brain region, we calculated exon variability. For exons where we had sufficient depth to calculate the Ψ values for at least two cell types, we calculated the exon variability as the max(Ψ) – min(Ψ) for each time point. Thus,

$$\begin}}}_}}}=\max \left(\overline}}\right)-\min \left(\overline}}\right)}}\,} \\\sim [_}},_}},_}},_}}]\end$$

We then considered the first lineage from the hVEx paradigm described above, that is, developmental time-specific changes. Here, we looked at the change in variability between time point transitions, that is, from P14 to P21, from P21 to P28 and from P28 to P56. If the change in eVar (ΔeVar) was less than 0.1 in all three transitions, indicating less than a 10% change in cell-type-specific variability over time, then the exon was classified as invariable (Supplementary Fig. 21).

$$\begin\Delta }}=}}}_}}-}}}_}}\,}}\,x > y\,}}\,x,y\in \}\}\\}}}_}}}\le 0.1\forall }}\in \left(}\to }\to }\to }\right) \\\sim }}\text\,}}}}\end$$

Otherwise, the exon was classified as variable. Z scores were calculated per row of the matrix of exon × eVarTP by centering and scaling the data to have the mean at 0 and standard deviation of 1. This normalized matrix was used for clustering and obtaining nine developmental modalities.

Getting protein superfamily annotations per exon

We developed a computational pipeline to identify protein superfamily annotations per exon. For each exon, we used the genomeToProtein() function of the ensembldb package and extracted the Ensembl ID, coordinates, and residue sequence of the protein identified. We filtered the obtained protein identifiers based on their corresponding Ensembl transcript IDs and limited the search to the principal isoforms from the APPRIS72 database. For each protein sequence, we ran the SUPERFAMILY73,74 tool that uses the hidden Markov model to identify the structural-defined SCOP protein domain families and the domain boundaries. The tool was implemented in InterProScan75,76,77. Protein regions not associated with domains were considered interdomain linkers. Subsequently, for each exon, the superfamily annotation associated with the protein residue for those coordinates was identified and used for further analysis.

Enrichment of protein superfamily annotations

We considered EVExs in clusters E1–E5 (Fig. 3) as well as a background set consisting of exons with low variability across the entire dataset. For exons that had a domain associated with the coordinates of the coding sequence, we extracted the superfamily rather than individual and similar domains, given that the broader classification would allow for better grouping. We then counted the number of exons associated per superfamily and group and reported a percentage. The superfamilies for which a value was obtained only for the background set were discarded, yielding a total of 55 superfamilies. For better interpretability, we retained superfamilies that were associated with at least 4% enrichment in any group, yielding a total of 15 superfamilies.

GO analysis for genes associated with variable exon categories

For exons in the four hVEx categories (H1–H4), genes to which these exons belonged were extracted per category. Only unique genes per category were retained, meaning that if two exons from a gene belonged to two different categories, the gene was discarded from the analysis. GO biological process enrichment analysis was performed using the function enrichGO() from the clusterProfiler78 package. GO terms with q values of ≤0.1 were reported, and the enrichment value was defined as the ratio of genes in the category being considered to those in the background. Similarly, for the invariable exons and the nine variable developmental categories (G0–G9), unique genes containing the exons in each category were identified. GO biological process analysis was performed as described above using a list of brain-expressed genes obtained from SynGO79 as the background set. GO terms with q values of ≤0.1 were reported, and the enrichment value was defined as the ratio of genes in the category being considered to those in the background. In the same vein, genes of adult brain region-specific EVExs (group E4) were identified, and the same steps were performed.

Pseudotime trajectory analysis

Isoquant v3.1 was run on the ONT data, and full-length isoforms were grouped by barcode to obtain an isoform × cell sparse matrix similar to the CellRanger pipeline across the dataset. Cell barcodes corresponding to astrocytes and oligodendrocytes from the hippocampus and visual cortex developmental lineage were isolated, and a subset of a matrix containing these cellular barcodes as columns was obtained. This matrix was then processed using Seurat36 (v3.2.3) to obtain a UMAP representation of the cells. Each cell was colored according to the original short-read cell-type assignments. Slingshot (v1.6)63 was then used to obtain a pseudotime trajectory along these clusters, with the initial point specified at OPCs. Lineages obtained were then reported.

Testing for exon coordination

Testing for exon coordination can be done at the pseudobulk level or at the cell-type level. For every exon pair passing the criteria for sufficient depth, a 2 × 2 matrix of association for a given sample (that is, cell type or pseudobulk) was generated. This matrix contained counts for inclusion of both exons (in–in), inclusion of the first exon and exclusion of the second (in–out), exclusion of the first exon and inclusion of the second (out–in) and exclusion of both exons (out–out).

The co-inclusion score of an exon was defined as the double inclusion (in–in) divided by the total counts for that exon pair. An exon pair deemed ‘coordinated’ was assessed using the χ2 test of association. The effect size was calculated as the | log10 (odds ratio) |. The odds ratio was calculated by setting 0 values to 0.5 and dividing the product of double inclusion and double exclusion by the product of single inclusion, that is, [(in–in) × (out–out)]/[(in–out) × (out–in)].

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

留言 (0)

沒有登入
gif