SUMMER: an integrated nanopore sequencing pipeline for variants detection and clinical annotation on the human genome

SUMMER: a multi-functioning integrated pipeline

SUMMER is a Python-based (version 3.11.3) bioinformatics pipeline that integrates Docker and Singularity for comprehensive management and analysis of Nanopore data. The pipeline processes raw sequencing reads to generate outputs including sequencing coverage and depth, variant detection (SNVs, indels, SVs, TRs, and MEI), and SV clinical variant annotations. SUMMER employs a modular, multi-stage workflow incorporating several specialized variant callers and tools, including (i) alignment of the reads by Minimap2[30], (ii) evaluation of coverage and depth by PanDepth[31] (the percentage of genome that sequenced at a certain read depth), (iii) identification of SNVs and indels by Clair3[19], (iv) identification and combination of SVs by Sniffles2[20], SVIM[22], cuteSV[21] and combiSV[32], (v) genotyping TRs by Straglr[24], (vi) identification of mobile element insertions by TLDR[26], (vii) clinical annotation of SVs by SvAnna[33]. (Fig. 1) This structured workflow enables comprehensive variant detection and annotation from LRS data with a single command. SUMMER is also designed with flexibility and customization in mind. Each step can be executed independently or as part of the complete workflow, allowing users to input intermediate files, such as a user-defined alignment BAM file, to meet specific analytical objectives. Additionally, intermediate outputs can be downloaded for further downstream analysis or customization (see “Methods”). This versatility makes SUMMER an efficient and adaptable tool for LRS data analysis.

Fig. 1figure 1

Overview of SUMMER, from alignments to coverage evaluations, types of variant detections and SV annotation

SUMMER identifies high-accuracy structure variants

To demonstrate the performance of SUMMER in SV detection, we benchmarked the pipeline using Nanopore LRS data from GIAB HG002 [34]. Following alignment of the raw reads to the GRCh38 (hg38) reference genome, sequencing coverage and depth were assessed across all chromosomes. The coverages of all chromosomes were above 80%, with the exception of chr22 and chrY. Depths of the autosomal chromosomes ranged from 35x to 50x, while the sex chromosomes exhibited approximately half the depth of the autosomes, consistent with the male genomic characteristics of HG002 (Figure S1).

For structural variant (SV) detection, SUMMER comprises three state-of-the-art SV detection tools—Sniffles2, SVIM, and cuteSV—along with the SV merging tool combiSV, which refines variant outputs into a consensus call set. CombiSV combines the results from Sniffles2, SVIM, and cuteSV by leveraging the strengths of each tool and mitigating their weaknesses through a scoring-based system, resulting in statistically refined consensus SV calls. The default mode of SUMMER executes all four tools, while the users can customize the process to select specific tools.

Using the high-confidence set of SVs from HG002 as the ground truth, we compared the SV results obtained from three callers and combiSV. All tools successfully identified the majority of true-positive SVs (Fig. 2A). Additionally, 7679 SV counts concurrently identified by the four tools did not intersect with the truth set (Fig. 2A), These calls are not necessarily false positives; instead, they represent variants located outside the high-confidence regions defined by the HG002_SVs_Tier1_v0.6.bed file from GIAB (see “Methods”). These variants are expected to exhibit accuracy comparable to the 10,964 high-confidence SVs.

Fig. 2figure 2

Performance evaluation of different structural variant calling methods

A. Upset plot illustrates intersection of structural variants detected by SVIM, cuteSV, Sniffles2 and combiSV. The Truth represents GIAB HG002 high confidence callset. The largest intersection is observed among the all methods, with an intersection size of 10,964 variants. The Truth-only group left by four callers are 472 variants, with total false nagative rate 3.7%, indicating the combinatorial calling of SV methodology is legitimate.

B-D. The recall (B), precision (C) and F1 score (D) measurements of 4 tools in various depth are illustrated, based on HG002 GIAB high-confidence callset.

We further evaluated the precision, recall and F1-score of the tools in the HG002 high-confidence callset across varying sequencing depths. Sniffiles2 maintains a precision above 0.9 and the highest precision among the three tools across all depths. Additionally, it achieved the highest recall rates at 30× and 40× depths (Fig. 2B, C). Sniffles2 also exhibits the highest F1 scores across all depths, indicating its balanced performance in both precision and recall (Fig. 2D). The combiSV, which integrates results from the three individual callers, exhibited improved recall but at the expense of reduced precision across all sequencing depths (Fig. 2B, C). Given the goal of identifying pathogenic variants in genetic diagnosis, prioritizing high recall is critical to ensure that no true candidate variants are missed during the initial analysis stages Therefore, we recommend using combiSV in its default mode to maximize recall across all depths. Alternatively, for datasets with sequencing depths exceeding 30×, Sniffles2 can be selected to achieve a more balanced performance between precision and recall.

SUMMER’s high performance on tandem repeat identification

Over one million short tandem repeat (STR) loci have been identified in the human genome, collectively accounting for approximately 3–5% of its total content[35, 36]. Expansions in at least 60 of these loci are known to cause severe genetic disorders such as fragile X syndrome (FXS), myotonic dystrophy, and spinocerebellar ataxias[37]. Consequently, precise identification of STRs is critical for the diagnosis of STR expansion-related diseases. SUMMER has curated a list of 55 STRs associated with known disorders (Table S1), which serves as the default reference set. Additionally, a list provided by Straglr has been incorporated as an alternative option. For broader applications, a comprehensive genome-wide STR dataset—irrespective of known clinical associations—is also recommended (available at https://webstr.ucsd.edu/downloads). But, this comprehensive dataset is not enabled by default due to its long processing time.

Fig. 3figure 3

Detection of genomic tandem repeats using SUMMER

A. Detection of whole genomic STRs in reference sample HG002 and HG005. Representing a reference non-pathogenic profile of STRs on genome.

B. Detection of CAG repeats in HTT gene in targeted sequencing samples. There are four bars for each sample, two blue bars represent SUMMER detected repeat numbers in each allele, while two red bars represent original study reported repeat numbers in each allele.

C. Detection of GGC repeats in FMR1 gene in targeted sequencing samples.

We assessed the performance of SUMMER in identifying tandem repeats using Nanopore whole-genome sequencing data from HG002 and HG005, provided by the Genome in a Bottle (GIAB) consortium, as well as targeted long-read sequencing (LRS) of STR loci in four additional samples[18]. The numbers of repeats showed nonpathogenic distributions in HG002 and HG005 (Fig. 3A). The HTT CAG repeat numbers for samples NA06890, NA06893, NA13509, NA13512 were shown in Fig. 3B. The FMR1 GGC repeat numbers for samples NA06890, NA06893, NA13509, NA13512, and NA05131 show in Fig. 3C. The observed repeat counts closely matched those reported in the original studies, with minor discrepancies of one or two repeats (Fig. 3B, C), which are considered typical and acceptable in STR genotyping. Sample NA13509 and NA13512 were diagnosed with Huntington’s disease, and pathogenetic expansion alleles were also identified in HTT CAG unit by SUMMER (Fig. 3B), while pathogenic repeat identified in FMR1 in sample NA05131 with diagnosed Fragile X Mental Retardation. This demonstrates that SUMMER can accurately characterize STRs and thus aid in clinical diagnosis.

SUMMER’s performance of SNP and indel calling

Single-nucleotide polymorphisms (SNPs) and short insertions/deletions (indels), collectively known as small variants, represent the most abundant and common genetic variations across all species. In humans, they are frequently implicated in the pathology of genetic diseases[38]. To address this, SUMMER integrates Clair3 for SNP and indel calling. Clair3 leverages machine learning, utilizing models specifically trained for germline small variant detection, and has become increasingly popular for its accuracy and reliability.

Fig. 4figure 4

SUMMER/Clair3 small variants calling benchmarked results of HG002

A. SNP’s precision (upper) recall (middle) and F1 score (down). X-axis are various depth from 10X to 40X, y-axis are values from 0.0 minimal to 1.0 maximum.

B. Indel’s precision (upper) recall (middle) and F1 score (down). X-axis are various depth from 10X to 40X, y-axis are values from 0.0 minimal to 1.0 maximum.

We evaluated the SUMMER/Clair3 small variants results at various depth using HG002’s high-confidence callset, as shown in Fig. 4. Across varying sequencing depths, ranging from 10X to 40X, the results demonstrate that SNP precision (Fig. 4A) remain consistently high, achieving values between 0.9 and 1.0, regardless of depth. When the depth exceeds 30X, SNP recall achieves the same values. This indicates the robust reliability of SNP calling at various depths. Indels show relatively lower reliability, with precision stabilizing around 0.6 and recall at 0.3. F1 scores for both variant types closely mirror these trends, reflecting the balance between precision and recall. These findings highlight the strength of SUMMER/Clair3 in detecting SNPs accurately at adequate depths, while also underscoring the challenges in indel detection under lower sequencing conditions.

SUMMER’s integration of variants annotation function

Besides accurate variants identification, assessing the relevance of SVs for rare disease is also crucial for genetic diagnosis. SUMMER implements SvAnna[33] by default to provide medical interpretation and disease assessment of SVs.

Fig. 5figure 5

Top 20 pathogenic SVs annotated by SvAnna after integrating 5 pathogenic SVs to HG002. PSV: pathogenicity of structural variation

To evaluate SUMMER’s performance in annotating pathogenic SVs, we annotated the genome-wide SVs of HG002 with five known pathogenic SVs confounded[16, 39,40,41]. For this analysis, the HPO phenotype parameter was set to HP:0001250 (seizure), as all of the five pathogenic SVs cause seizure in samples. Since HG002 is a non-epileptic sample, we anticipated the absence of pathogenic SVs related to seizures except for the five artificially introduced into the genome. SvAnna filtered out SVs based on low coverage (indicating low quality) and high population frequency (indicating likely benign variants). SVs were then ranked using Pathogenicity of Structural Variation (PSV) scores, which reflect deleterious potential. The top 20 ranked SVs included all five introduced pathogenic SVs (Fig. 5, Table S2). It is worth noting that there is no definite cutoff for PSV score, and given the 87% possibility of deleterious SVs being ranked in the top 10 by SvAnna[33], we recommend prioritizing potentially deleterious SVs based on their PSV score ranking. Careful assessment of the alignment between variant phenotypes and patient phenotypes is critical, as pathogenic SVs are often found among the top-ranked annotations. It is important to acknowledge that, in clinical scenarios, a single monoallelic or biallelic pathogenic SV is typically observed in human samples, rather than five. The introduction of five pathogenic SVs in this analysis was specifically designed to evaluate SUMMER’s annotation performance.

留言 (0)

沒有登入
gif