Mutational and splicing landscape in a cohort of 43,000 patients tested for hereditary cancer

The authors confirm that we have complied with all ethical regulations and that this study was carried out in accordance with the recommendations of the Western Institutional Review Board (Puyallup, Washington), which has granted an IRB waiver stating “research does not include human subjects” based on federal regulation 45 CFR 46.102(f) and associated guidance entitle, “Guidance on Research Involving Coded Private Information or Biological Specimens”, determining coded private information or biological specimens would not be considered to involve human subjects. All patients described were evaluated by genetic counselors and provided informed consent for testing. Individuals who declined participation in de-identified research were excluded from this study.

Paired DNA and RNA sequencing workflow is depicted in Supplemental Fig. 2. Genomic DNA was isolated from patient’s whole blood or saliva using Qiasymphony (Qiagen). Isolated DNA quantity and quality was assessed using absorbance at 260 nm. A total of 1 μg of genomic DNA was used as input into library prep for dual index sequencing on illumina platforms using a commercially available kit (KAPA Biosystems, Roche). Briefly, DNA was sheared enzymatically, end-repaired, and ligated to standard illumina dual index adapters (IDT). After subsequent library amplification (10 cycles), libraries were pooled together at equal concentrations and target enriched using hybrid capture. Custom-designed biotinylated probes (IDT X-Gen Lockdown) covering the coding regions of tested cancer predisposition genes were hybridized over night to capture libraries and captured with streptavidin beads (LifeTechnologies). Captured libraries were subsequently amplified and prepared for sequencing on NextSeq 500 or NovaSeq using the SP flow cell (illumina). Initial data processing and base calling is performed using the NextSeq Control Software (NCS) and RRTA 2.4.11 (Real Time Analysis NCS v2.0.2.1).

Sequences are aligned to hg19 reference genome and variants are called using the third-party software Genome Analysis Toolkit (GATK) followed by annotation via an internally developed pipeline. While DNA sequencing is routinely performed through a minimum of 30 nucleotides of each intron, masking is applied to variants generated by the bioinformatics pipeline based on the genes included in the test ordered and analytical reporting range (>5 nucleotides beyond each coding exon). Variants with a Q score ≤30 and an allele fraction <10% are filtered out. Regions with <20× coverage on NGS are followed up with Sanger analysis. Variants in regions complicated by pseudogene interference, variant calls not satisfying depth of coverage and variant allele frequency quality thresholds, and potentially homozygous variants are verified by Sanger sequencing. Single nucleotide variants and small insertions/deletions (≤3 nucleotides) that have an allele frequency of >35% and 100× coverage are not verified by Sanger sequencing. Large deletions and duplications are detected using a combination of a read-depth based machine learning method and split-read method, and/or targeted microarray or MLPA as needed.

Total RNA was isolated from an additional patient specimen (blood, PAX tube) using standardized methodology and quantified as described previously4. Briefly, total RNA is fragmented and undergoes first strand synthesis using random hexamers, with subsequent ribosomal RNA depletion (Kapa Biosystems, Roche). After second strand synthesis and amplification of libraries ligated with standard illumina dual index adapters. Sequence enrichment of the targeted coding exons and adjacent intronic nucleotides is carried out by a hybrid-capture methodology using long biotinylated oligonucleotide probes followed by a subsequent library amplification and Next-Generation sequencing (NextSeq 500 or Novaseq SP flowcell, illumina).

RNA calls detected on RNA-seq were confirmed via RT-PCRseq if the total coverage for an associated splicing event (PSI denominator) is <500X. In addition, RNA calls with conflicting data, inconsistency in PSI among carriers, or any other quality concerns were also confirmed by RT-PCRseq. For RT-PCRseq, total RNA is converted to complementary DNA (cDNA) by reverse transcriptase polymerase chain reaction (RT-PCR) using a one-step approach with custom-designed primers for the target region (Superscript IV one-step RT-PCR kit, Thermo). Primer sequences are available upon request. RT-PCR amplicons are then library prepped for standard illumina paired end sequencing on MiSeq (illumina) using a commercially available kit (Kapa Hyper Plus Kit, Roche).

RNA samples passed sequencing quality control if the percentage of Q30 bases >75%, mean base quality >30, percentage of perfect index >85%. Reads from samples passing QC were aligned using STAR 2.0 (CITE). An additional QC threshold was applied where ≥85% of exons from the 18 genes have average coverage ≥50×. DNA variants were evaluated for association with abnormal splicing events. Percent Spliced Index (PSI) and its comparison with a control pool of 345 healthy donors were calculated as previously described1,4. Relative to the canonical RefSeq transcript isoform annotation, the PSI value was defined as the number of reads supporting the alternative splicing event divided by the number of all reads in the region covering splicing event. Bar plots and box plots were generated using the ggplot2 package (v3.1.1) from R v3.6.1 with default settings. Violin plots were generated using Seaborn (v.0.11.0) within Python v.3.8.3. 5 types of splicing events were considered: Exon Skipping Full (ESF; skipping of at least one full exon), Exon Skipping Partial (ESP; i.e., an alternative 5′ or 3′ splice site that results in exclusion of part of an exon), Exon Skipping Full and Partial (ES; a combination of at least one ESF and ESP), Intron Inclusion Partial (IP; i.e., an alternative 5′ or 3′ splice site that results in inclusion of intronic sequence flanking the exon), and Intron inclusion Cryptic (IC; i.e., a cryptic exon). Technical limitations in our assay may have prevented the detection of full intron retention (IR) events. Association of a splicing event with a DNA variant was evaluated manually by a team of variant assessment scientists. The 18 genes included in RNA sequencing analysis pipeline are: APC, ATM, BRIP1, BRCA1, BRCA2, CDH1, CHEK2, MLH1, MSH2, MSH6, MUTYH, PALB2, PMS2, PTEN, NF1, RAD51C, RAD51D, TP53. Reference isoforms are listed in supplemental table 1. Additional considerations: (i) with regards to blood versus tissue samples, the assay has been previously calibrated to identify abnormal splicing in blood by contextualizing putative pathogenic splicing events with previously identified P/LP known to affect splicing4,5, additionally the strength of the evidence applied to RNA data for DNA variant curations varies based on blood/tissue expression and alternative splicing identified in controls; (ii) the assay has been previously calibrated to identify NMD-targeted transcripts in blood by contextualizing the PSI of putative pathogenic splicing events versus the PSI of transcripts identified in individuals heterozygous for P/LP variants known to affect splicing and to be targeted by nonsense mediated mRNA decay (NMD)4,5. Analyzes of allele skewing using SNPs within exons to indirectly assess NMD is also used depending on SNP availability. Sashimi plots depicted in figures only include reads in the PSI calculation for each abnormal splicing event described.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

留言 (0)

沒有登入
gif