De novo and somatic structural variant discovery with SVision-pro

SVision-pro methodologyOverall workflow of SVision-pro

SVision-pro initiates by searching the case genome for candidate SV loci, after which a sequence-to-image module encodes genome-to-genome image to visually compare the case and control genomes. Then, the neural-network-based instance segmentation framework recognizes basic SV component types from the encoded image and determines the genomic differences between the case genome and the control genome. Note that, if several control genomes (N and N > 1) are specified, SVision-pro works in a 1-to-N mode and generates representation images for the case genome and each control genome. Consequently, the instance segmentation framework outputs the SV differences between the case genome and each control genome.

Candidate SV locus searching from case genome

SVision-pro identifies candidate SV loci by collecting and clustering abnormal read alignments in a model-free way that avoids searching for specific aberrant patterns of read alignments (Extended Data Fig. 1). Specifically, SVision-pro converts each read into a series of signature symbols, which can be extracted directly from a BAM file: M indicates directly mapping of alignment to the reference genome, V indicates reversed mapping and I indicates an additional sequence in read. Moreover, several properties are allocated to each signature symbol, including its span on the reference sequence, span on the read sequence, subsequence length and read name. Typically, symbols M and V are converted from split read alignments (primary and supplementary alignments) according to their reference span (reference start and end position) and mapping orientation. The symbol I is derived from both intraread alignments, by examining the CIGAR string, and inter-read alignments, by retrieving unmapped sequence between split alignments (Extended Data Fig. 1a). Note that for I, if the unmapped sequence is aligned to a distal location on the reference sequence, SVision-pro marks it as a mapped I by recoding the additional source reference span. Finally, each read is converted into a series of symbols arranged in their read order. For example, if a read does not span any SVs, there will be only one symbol M (Extended Data Fig. 1b). If a read spans a deletion, the read will be converted into symbol series MM, where there is a gap between the reference end position of the first M and the reference start position of the last M (Extended Data Fig. 1c). For complex events, such as a deletion associated with an inversion, the event-supporting read is converted into symbol series MVM (Extended Data Fig. 1d). By adopting this convention, we are able to cluster similar read symbol series iteratively and identify any abnormal ones (Extended Data Fig. 1e). A read with the converted symbol series M is considered a normal read, otherwise, it will be marked as an aberrant one. If the number of reads supporting the same aberrant symbol series surpasses the minimum requirement (default ten reads), the genomic region covered by the aberrant symbol series is considered a candidate SV locus.

Image representation at candidate SV loci

To generate representation images, SVision-pro takes two main steps: structure sketching (Extended Data Fig. 2) and content rendering (Extended Data Fig. 3).

(1)

Structure sketching: for a candidate SV locus, the structure sketching step directly converts the 1D read symbol series into a 2D similarity image (Extended Data Fig. 2a), which uses segments and gaps to visually measure the mapping similarity between reference sequence (x axis) against variant feature sequence (y axis). The reference axis ranges from the start reference position of the first symbol to the end reference position of the last symbol. The read axis ranges from 0 to the length of the read. Typically, segments are derived from symbols M, V and mapped I, whereas gaps are derived from the unmapped symbol I and reference gaps between M and V symbols. Segments and gaps, excluding those converted from M symbols, are marked with aberrant flags for subsequent content rendering step (Extended Data Fig. 2b). This type of similarity image makes it easy for humans and machines to visualize SV structures.

(2)

Content rendering: SVision-pro fills the sparse region in the similarity image with ACTs originated from both case and control genomes.

Generating ACTs

Inspired by the regular coverage track commonly used in Integrative Genomics Viewer (IGV)29, SVision-pro introduces the ACT. In brief, the regular coverage track is a 2D grayscale barplot, where the x axis indicates reference positions and y axis indicates the coverage values, which are computed by counting the number of mapped alignments at each reference position (Extended Data Fig. 3a). The ACT in SVision-pro utilizes an RGB (red, green and blue) stacked barplot to encode additional genomic information that reflects SV signatures. Before constructing the ACT (Extended Data Fig. 3a), we count the number of alignments along with their mapping conditions. The mapping conditions of alignments include forward mapping, reversed mapping, duplicated mapping and reverse-duplicated mapping. Forward and reversed mapping conditions are retrieved directly from the aligner’s outputs and duplicated mapping is determined by checking whether an alignment is encompassed by other alignments from the same read (Extended Data Fig. 3a).

Next, we convert the count table into a three-channel RGB image. We use the RGB color values (135, 206, 255) to plot the coverage value of forward-mapped alignments. For the coverage value of reversed alignments, we subtract 100 from the color value in the second channel (Supplementary Fig. 1a). Likewise, for the coverage value of duplicated alignments, we subtract 100 from the color value in the third channel (Supplementary Fig. 1b). In cases of reverse-duplicated alignments, both the second and third channels undergo a subtraction of 100 (Supplementary Fig. 1c). In brief, we use the second image channel to depict the reverse signatures and the third image channel to depict the duplication signatures. By leveraging this RGB stacked barplot in the ACT, SVision-pro provides a more comprehensive representation of the coverage information, incorporating distinct color variations to depict different types of alignments and their contribution to the SV signature.

Filling ACTs into similarity image

Genome-to-genome comparison requires comparative representation features to contrast the SV differences between the case genome and the control genome. Therefore, we utilize the sparse regions within the similarity image to fill the two ACTs originating from the case and control genomes (Extended Data Fig. 3b). To accomplish this, we first create two fixed-height and empty tracks along these sketched segments and gaps: one track (upper track) above and one track below (lower track). The upper track is used to fill the ACT of the control genome whereas the lower track is used to fill the ACT of the case genome. For a sketched similarity image i, we generate ACTs in both case and control genomes by fetching all read alignments from i.reference_start to i.reference_end. This ensures that the reference span of the sketched similarity image matches that of the ACTs. Next, we fill ACTs into upper/lower tracks that surround aberrant segments and gaps by aligning the reference coordinates. Contrasting ACTs in upper and lower tracks show apparent SV differences between the case and control genomes. Moreover, this kind of similarity image and ACTs maintains readability for both human and machines for further analysis.

Insertion-associated SV representation

Insertions and insertion-related SVs involve additional sequence present in the read sequence that is not in the reference sequence, leading to vertical gaps in the sketched similarity images (Supplementary Fig. 2a). Therefore, for insertions, we create two empty tracks located on the left (used to fill the ACT of the control genome) and right (used to fill the ACT of the case genome) sides of these insertion-induced vertical gaps (Supplementary Fig. 2b). Unlike deletions, inversion and duplications, where we count the alignment mapping conditions against the reference genome, for insertions, we count the alignments at read-level to calculate the number of reads that contain the inserted sequence (Supplementary Fig. 2c). Then, we generate vertical ACTs for both case genome and control genome and fill them into the right and left empty tracks, respectively. For insertion-associated CSVs, such as insertion-associated inversion, alignments are counted at both read-level and reference-level (Supplementary Fig. 2d).

One-to-N mode

The genome-to-genome representation module in SVision-pro allows for the comparison of one case genome with one control genome within a single image. However, in certain applications, such as de novo SV discovery, several control genomes are involved. To accommodate such scenarios, SVision-pro employs a One-to-N mode to generate images between case genome and each control genome. For example, de novo SV discovery in a trio comprises three genomes: child, father and mother. For a candidate SV locus, SVision-pro generates one image that compares the child genome with the father genome, and another that compares the child genome with the mother genome. This process results in two images that can be utilized by the subsequent instance segmentation framework for further analysis. By employing the One-to-N mode, SVision-pro enables direct comparison of the case genome with several control genomes. Moreover, SVision-pro can identify any genome-specific SVs among several genomes by taking one genome as the case genome and all others as control genomes.

Flexible properties of representation image

The image sizes, colors and track heights are flexible and can be customized to meet various application scenarios. Currently, SVision-pro offers three optional image sizes for different sensitivity requirements, including 256, 512 and 1,024, whose track height for rendering contents is 25, 50 and 100 pixels, respectively. Thereby, the minimum representable (1 pixel) and detectable AFs (one per track height) of the three image sizes are 0.04, 0.02 and 0.01, respectively. Note that AF 0.01 is not the lowest detection limit of SVision-pro, and that the track heights and images sizes can be customized to meet lower AF detection requirements.

SV detection and genotyping by instance segmentation

The encoded representation images are directly fed into a neural-network-based instance segmentation framework without any manual or knowledge-oriented preprocessing. Since CSVs typically comprise several internal subcomponents, the instance segmentation framework in SVision-pro is designed to recognize five basic subcomponent types, including insertion (INS), deletion (DEL), inversion (INV), duplication (DUP) and inverted duplication (invDUP). In cases where there is no SV present in the control genome, a recognition type reference (REF) is included to denote that the control genome is identical to the reference genome. Specifically, the instance segmentation framework recognizes these six instance types in the encoded image and generates a segmentation mask. The mask assigns each pixel in the image to either a predicted specific type or the background type, segmenting the image regions and providing quantitative information about the presence and location of various SV subcomponents (Extended Data Fig. 4a). The horizontal span of the masked regions represents the breakpoint span of the subcomponents, while the vertical span represents the allele frequency (Extended Data Fig. 4b). Finally, in respective panels, we obtain the final SV type of the candidate locus by directly jointing together these subcomponents in their read order. By contrasting the lower and upper panels in the segmentation mask image, SVision-pro can determine whether a SV subcomponent is (Extended Data Fig. 4b) Germline, indicating that the SV subcomponent is present in the control genome with same allele frequency; (2) New allele, indicating that the SV subcomponent is present in the control genome at a different allele frequency; (3) New component, indicating that the SV subcomponent is absent from the control genome or (4) New breakpoint, indicating that the SV subcomponent is present in the control genome with a different breakpoint span. If several control genomes are provided, such as the father and mother genome in the scenarios for de novo SV discovery, SVision-pro will output the differences between the case genome and each control genome (Extended Data Fig. 4c).

Performance benchmarking methodologySSV detection benchmark in HG002 groundtruth

The groundtruth SSVs (HG002_SVs_Tier1_v0.6.vcf.gz, highly confident insertions and deletions) of HG002 (Ashkenazim Trio, son), were applied to benchmark the SSV detection performance of callers. The detailed data generation steps were identical to those described in cuteSV3 paper. Briefly, both raw HiFi and ONT reads were aligned to human genome GRCh37 using Minimap2 (ref. 30) with parameter ‘-x pacbio/ont’. Seven state-of-the-art callers, including SVision-pro, SVision6, Sniffles2 (ref. 15), cuteSV3, debreak4, pbsv and SVDSS5, were applied to the aligned reads with the minimum SV supporting read number set to ten. Truvari31 was employed to calculate precision, recall and F1-score between the groundtruth and the callset. Please refer to Supplementary Note 6 for the specific versions and parameters of each caller.

CSV detection benchmark in simulated data

The CSV simulation set, which contains 3,000 CSVs crossing ten frequently reported types, was obtained directly from our previous SVision paper6. We followed the same procedure described in this paper to generate both HiFi and ONT reads and performed subsequent alignment to GRCh38 by NGMLR2. The five highest-performing callers on the HG002 groundtruth dataset (SVision-pro, SVision, Sniffles2, cuteSV and debreak) were employed for the subsequent Truvari region-based comparison. Type-based comparison was performed by examining the CSV subcomponent accuracy. To accomplish this (Supplementary Fig. 3a), we first extracted the matched SV record pairs between the groundtruth and callset from Truvari output files, namely TP-base.vcf and TP-call.vcf, which respectively enumerated the groundtruth record and matched callset record, respectively. Then, for each matched record pair, if any SV component from the groundtruth record was absent from the called record, this record pair was marked as inaccurate (Supplementary Fig. 3b). Note that, only SVision-pro and SVision reported SV component types. For the remaining callers, since they only reported SSVs and limited number of CSV types, we treated their output type directly as a component type.

Mendelian consistency analysis in six families

We collected 19 Mendelian samples from six previously published families, including the Ashkenazim Trio, Chinese Trio, YRI Trio, CHS Trio, PUR Trio and Chinese Quartet (Supplementary Table 1). All six families were sequenced using HiFi reads, with the Ashkenazim Trio, Chinese Trio and Chinese Quartet also sequenced with ONT reads. All reads were aligned to GRCh38 genome using Minimap2. We utilized five callers, including SVision-pro, SVision, Sniffles2, cuteSV and debreak, and two merging approaches, including Jasmine and SURVIVOR. For SVision-pro, we considered the child sample as the case genome and parent samples as control genomes. Sniffles2 was employed in multisample calling mode, following official instructions. For the remaining three callers that required merging approaches, we first applied them independently to generate callsets for each sample, including child(ren), father and mother. Then, we merged these callsets (for example, for ChineseQuartet, there were four callsets) together by Jasmine and SURVIVOR with the default or recommended parameters (Supplementary Note 2). To measure the Mendelian consistency within each family, we extracted the child and parent genotypes from each SV record in the VCF. If the genotypes of child, father and mother adhered to the Mendelian Law, we marked this record as a consistent one. Finally, we computed the Mendelian consistency rate by dividing the number of consistent records by the total number of records.

Twin discordancy analysis in Chinese Quartet

A common assumption is that the genomes of monozygotic twins are almost identical32. Therefore, the monozygotic twins (termed as child1 and child2) in the Chinese Quartet were used to calculate the twin discordancy. In brief, if one SV was present in the child1 genome while absent from the child2 genome, we would consider this SV as a discordant one between the twins. As such, for each SV record, we extracted the outputted genotypes of both child1 and child2 and examined whether they were identical. Finally, we computed the twin discordancy by dividing the number of discordant records by the total number of records.

De novo SV analysis in six families

For SVision-pro, de novo SVs were extracted by checking whether the comparison results of child-to-father and child-to-mother were both ‘New Component.’ For Sniffles2 and the merging approaches, de novo SV records were extracted by checking whether the SUPP_VEC equaled 100, indicating this SV record presented only in the child genome. Moreover, we compared the de novo SVs between SVision-pro and Sniffles2. De novo SV calls from Sniffles2 were overlapped with all SV calls from SVision-pro using the BEDtools33 intersect option with reciprocal overlap fraction set to 0.5. Since merging approaches resulted in many more redundant de novo SVs, we verified manually only the de novo SVs called by SVision-pro and Sniffles2 using IGV29 (Supplementary Files 4 and 5).

Somatic SV analysis in tumor-normal paired cell line HCC1395

A previous study27 utilized several sequence technologies and established a consensus somatic SV callset of 1,788 SVs on cell line HCC1395 and its normal pair HCC1395BL. We download the published HiFi, ONT and PacBio CLR long reads of the two cell lines and aligned them to human genome GRCh38 by Minimap2 with parameter ‘-x pacbio.’ Three callers that could detect somatic SVs were employed on this tumor-normal paired cell line, including SVision-pro, Sniffles2 and nanomonsv. SVision-pro took the tumor cell line as the case genome and normal cell line as the control genome. Sniffles2 was employed in its nongermline mode and nanomonsv was employed according to official instructions. For the three callers, the minimum number of supporting reads was set to 2 and the minimum detectable AF was set to 0.01.

High-confidence region filter

The raw high-confidence regions (HG002_SVs_Tier1_v0.6.bed) were hg19-based. Therefore, following the instruction of SVDSS paper5, we first used liftOver to convert these regions into hg38-based coordinates. Then we applied BEDtools intersect option with reciprocal overlap fraction set to 0.5 to filter out SV calls that were not located within high-confidence regions.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

留言 (0)

沒有登入
gif