Localized assembly for long reads enables genome-wide analysis of repetitive regions at single-base resolution in human genomes

Sequencing data

We sequenced the genomic DNA of NA18943 from a B cell line using a single platform, MinION (ONT). After combining yields from 26 flowcells, the sequencing data totaled 231.6 Gbp (77 × coverage) (Additional file 1: Table S3). The downloaded data of NA19240 was 255.8 Gbp (79 × median coverage) [19].

Reduction in error rate and accuracy validation of LoMA

To evaluate the accuracy of LoMA, we performed extensive validations. First, we randomly selected 108 regions and constructed CSs using LoMA (Additional file 1: Table S1). Thirteen of the 108 regions were classified as heterozygous, and two CSs were generated for each heterozygous region. In total, 121 CSs were obtained. We aligned them to GRCh38 and assessed the error rate. The total alignment length was approximately 7.3 Mbp (Additional file 1: Table S4). The estimated error rate was 8.7% (SD = 0.72) for raw reads and 0.76% (SD = 0.67) for CSs by LoMA (Fig. 3A, Additional file 1: Table S1). Similarly, we attempted to construct CSs of matched regions using lamassemble for a comparison but failed to assemble three of the 108 regions, leaving 105 obtained sequences (Additional file 1: Table S1). The alignment with GRCh38 estimated the mean error rate of the lamassemble sequences to be 2.3% (SD = 4.1) (Fig. 3A, Additional file 1: Table S1).

Fig. 3figure 3

Accuracy assessment and simulated error rate of LoMA. A. A comparison with raw reads (blue) and sequences by lamassemble (orange). Consensus sequences (CSs) by LoMA (green) showed a sharp drop in the error rate compared with raw reads and lamassemble. The standard deviations were 0.67, 4.09 and 0.72 for LoMA, lamassemble and raw reads, respectively, demonstrating that the dispersion of LoMA is smaller than of lamassemble and that the instability of sequences in raw data is reduced by LoMA. B. A comparison of CSs constructed by LoMA with sequences by Sanger sequencing. The sequences from 121 insertions of NA18943 showed high accuracy (at least 99.7%). Individually, 66 out of 121 (54.5%) regions constructed by LoMA were error-free compared with poly(A) compressed sequences. C. The average error rate of simulated data (n = 100). Simulated reads were generated for one hundred randomly selected genomic regions and analyzed by LoMA. The average error rates were 2.6% (10×), 0.29% (20×), 0.076% (30×), 0.041% (40×), and 0.034% (50×) for each mean coverages from 10 to 50

We next performed an experimental validation for the CSs generated by LoMA. We found 14 SV candidates from 13 heterozygous regions (Additional file 1: Tables S1, S5). The primer design and PCR amplification were successful for eight heterozygous candidates, and the product sizes of all candidates were concordant with the expectation of LoMA (Additional file 1: Table S5, Additional file 2: Fig. S2).

Further, we performed Sanger sequencing for 121 homozygous insertions randomly selected from NA18943 (Additional file 1: Table S6). The total compressed length amounted to 66,126 bp, and 65,932 bp were matched to CSs by LoMA (sequence identity 99.71%) (Fig. 3B, Additional file 1: Table S6). Sixty-six CSs from the 121 insertions (54%) were perfectly matched (identity 100%) to the corresponding Sanger sequences (Additional file 1: Table S6).

To evaluate the computation performance of LoMA, we performed a simulation using the ONT read simulator, NanoSim [23]. The results of the reconstruction of randomly selected regions (n = 100) showed an error rate of 2.6% for 10×, 0.29% for 20×, 0.076% for 30×, 0.041% for 40×, and 0.034% for 50× (Fig. 3C, Additional file 1: Table S2, Additional file 2: Fig. S3). The computation time and peak RSS of LoMA were also measured. The computation time was generally linear for coverage up to 50× and target size up to 100 kbp, with a processing time of a few minutes (Additional file 1: Table S7, Additional file 2: Figs S4, S5). The peak RSS was approximately 1 GB for coverage up to 50× and target size up to 100 kbp, and the memory increase was more suppressive than linear (Additional file 1: Table S8, Additional file 2: Figs S4, S5). These results suggest that LoMA had sufficient performance for the sequence analysis.

Genome-wide analysis of SV polymorphisms using LoMA

To detect SVs using LoMA, we identified unclear regions based on the number of reads with clipped sequences or indels (Fig. 2A and Methods). In NA18943, we detected 27,562 bins as candidates, and they were combined into 16,544 unclear regions. In NA19240, 56,158 candidate bins were detected and combined into 18,928 unclear regions. The larger number of unclear regions in NA19240 reflected the high genetic diversity of African populations [37]. Both NA18943 and NA19240 showed a similar pattern of the landscape of unclear regions: they were localized in the pericentromeric and HLA region in both samples (Fig. 2B).

We next constructed the sequences from the unclear regions. After the LoMA analysis, we obtained the CSs of 13,822 and 15,220 regions for NA18943 and NA19240, respectively, including 3065 and 7542 heterozygous regions (see Additional file 2).

Comparison with a standard SV set and threshold for SV detection

The benchmark with a standard SV set of NA19240 indicated that the precision decreased under the coverage of 20×, although it became stable after more than 20 reads (Fig. 2C). The mean precisions of insertions and deletions with 20 or more reads were 0.82 and 0.93, respectively. For less than 20 reads, the mean precision was 0.52 for each. However, the percentage of SVs with less than 20 reads in the entire SV set was 21.9% for insertions and deletions, respectively; thus, excluding these SVs did not strongly affect the conclusions of this study. Accordingly, we focused on SVs in regions with 20 or more reads to define a more conservative callset of indels. After this filtration, we finally obtained 5516 insertions and 2687 deletions in NA18943, and 6542 insertions and 3475 deletions in NA19240 (the LoMA SV set).

Overview of insertions and deletions

The length distributions of the indels were consistent with a previous study (Fig. 4A) [3], suggesting that the genome-wide analysis using LoMA provided a reliable result. Sequence analysis at single-base resolution is effective for understanding the biological mechanism of insertions. To understand the characteristics of the detected insertions, we first classified them (Fig. 4B, C and Methods). The numbers of TRs were 2841 (51.5%) and 2922 (44.7%) in NA18943 and NA19240, respectively, TEs totaled 1819 (33.0%) and 2506 (38.3%), and TDs amounted to 71 (1.3%) and 88 (1.3%). Approximately 1.5% (NA18943) and 1.2% (NA19240) of the insertions were attributed to variants in satellite sequences.

Fig. 4figure 4

Overview of the variant detection. A. The length distributions of insertions (top) and deletions (bottom). The maximum length was capped at 1600 bp. B. The flowchart shows the breakdown of the insertions. Dotted-line boxes represent the conditions of classification, and solid-line boxes are each class of insertions. C. The pie charts show the proportion of each class of insertions of NA18943 (left) and NA19240 (right). Insertions were decomposed into multiple classes: TRs (tandem repeats), TEs (transposable elements), TDs (tandem duplications), dispersed duplications, processed pseudogenes, NUMT, “deletions” in GRCh38, alternative sequences, satellites, and others. D. A processed pseudogene (RPLP0) detected in NA19240 is shown (in black). The picture was extracted from Genome Browser. The BLAT identity was 99.9% compared with the reference gene, and the alignment pattern of 5′ UTR showed a closer resemblance with the splicing variant shown at the bottom. E. An insertion detected in NA19240 matching the alternative sequence is shown (in black). The picture was extracted from Genome Browser. The breakpoint was 15 kbp upstream of HLA-DQB1. LoMA constructed part of the alternative sequence accurately (99.8% in BLAT identity)

We manually reviewed the remaining insertions and identified other types of insertions (Fig. 4B, C): three processed pseudogenes from each of NA18943 and NA19240, five and six dispersed duplications, respectively, and 61 (1.1%) and 79 (1.2%) alternative sequences, respectively. For example, a processed pseudogene found in NA19240, sized 1115 bp (the breakpoint was at chr11:60,274,156), was precisely mapped to RPLP0 on chromosome 12 with 99.9% in BLAT identity (Fig. 4D). This processed pseudogene represented a specific splice variant. Notably, we found 233 (4.2%) and 287 (4.4%) variants in NA18943 and NA19240 mapped to panTro6, although these variants did not exist in GRCh38 (“deletions” in GRCh38), suggesting that they had been caused by deletion events in the genomes composing the GRCh38 assembly [3]. We also found one NUMT insertion, sized 532 bp in NA19240.

After our manual review, 398 (7.2%) and 574 (8.8%) insertions remained to be assigned in NA18943 and NA19240, respectively (Fig. 4B, C). Of those insertions, 167 (3.0%) and 214 (3.3%) occurred in segmental duplications and self-chains in NA18943 and NA19240. The numbers of unallocated insertions were 231 (4.2%) and 360 (5.5%) (Fig. 4B).

The insertions mapped to alternative sequences tended to be large (median length = 4049 bp). In NA19240, 12 insertions (15%) mapped to the alternative sequences were not found in the standard SV set (Additional file 1: Table S9). For instance, an insertion sized 7332 bp, derived from an alternative sequence (chr6_GL000254v2_alt), was found in the LoMA SV set (the breakpoint was chr6:32,687,972). This breakpoint was located 15 kbp upstream from HLA-DQB1 in the HLA region and showed a high sequence identity (99.8%) to chr6_GL000254v2_alt (Fig. 4E). However, the standard SV set by a previous study reported a translocation, not an insertion, in this region, suggesting an error call [25]. In NA18943, we found the longest insertion, sized 14,330 bp, at chr12:127,153,629 derived from an alternative sequence (chr12_KZ559112v1_alt). This insertion also showed a high sequence similarity (99.8% in identity) to the alternative sequence (Additional file 2: Fig. S6).

Tandem repeats (TRs)

Most TR insertions were found in the annotated TRs of GRCh38. The analysis of expansion rates showed that most TRs had low expansion rates (Fig. 5A), but high expansion rates were observed mainly in short TRs (< 100 bp) (Fig. 5A). The analysis of repeat elements showed that expansions of triplet repeats were rarer than 4- and 5-bp unit repeats in NA18943 and NA19240 (Fig. 5B). The expansions of (AT)n were major in 2-bp unit repeats, (ACC)n and (AGG)n in 3-bp, (AGGG)n and (AAGG)n in 4-bp, and (AGGGG)n in 5-bp (Fig. 5C). In each sample, 42% (NA18943) and 39% (NA19240) of TRs were found in genic regions, which was consistent with the size of the genic region in the human genome (Additional file 2: Fig. S7). Only 7 and 12 genes contained exonic TR expansions in NA18943 and NA19240, respectively (Additional file 1: Table S10). However, they were associated with various human traits and disease, such as the telomere length (Additional file 1: Table S10).

Fig. 5figure 5

The variation of TR expansions. A. The expansion rate is shown up to 20. Left panel, NA18943; right panel, NA19240. Both panels are separated by the reference TR length (< 100 bp, < 500 bp, ≥ 500 bp). B. The number of variants of STRs (2–6 bp). C. The patterns of STR expansions are shown individually by unit length. Elements in both samples with a frequency of at least 10 are displayed. D. Sites of TR expansions surrounding TEs. Red, NA18943; blue, NA19240. Each class is separated into the upper stream (< 100 bp), lower stream (< 100 bp) and inside from the left. In SINE (Alu and MIR), TR expansions often occurred downstream of TE elements. SVAs contained relatively many expansions inside considering the fewer number than in the LINE family

We also analyzed the association between TRs and TEs. We found that TR expansions around TEs were more likely to occur downstream of SINEs (Alu and MIR) than upstream (binomial test: P = 2.2 × 10–23 (Alu) and P = 0.032 (MIR) in NA18943; and P = 2.9 × 10–22 (Alu) and P = 0.0054 (MIR) in NA19240) (Fig. 5D). However, we did not see the same tendency in LINEs (L1 and L2) (Fig. 5D). In SVA elements, many TR expansions were observed inside compared with other families, such as LINEs, possibly because of the unique structure of SVA elements containing a VNTR region (Fig. 5D).

Target site duplications (TSDs) in Alu elements

In NA18943 and NA19240, 1579 non-redundant Alu insertions were detected. The motif search of flanking sequences detected a strong motif at the first nicking site, TTAAAAA (E-value = 2.5 × 10–315) (Fig. 6A), as shown in previous studies [33, 34, 38]. On the other hand, no clear motif was identified around the second nicking site. We next analyzed the length distribution of TSDs flanking Alu elements, and single-peak distributions (median length of 15 bp) were obtained (Fig. 6B). The overall shape of the distributions was slightly skewed left, showing TSDs longer than 15 bp were fewer than TSDs shorter than 15 bp.

Fig. 6figure 6

Analysis of TSDs among Alu insertions. A. The motif search using 1579 Alu elements by MEME Suite [35] confirmed the motif previously studied at the first nicking site. B. The length distribution of TSDs among Alu insertions, capped at 30 bp, binned per 2 bp. The distribution had a peak around 15 bp and was symmetrical in both samples

Tandem duplications (TDs)

Lastly, we detected and analyzed TDs (n = 71 in NA18943 and n = 88 in NA19240). The length distribution showed that TDs were relatively short (the largest duplication was ~ 600 bp in length) and that short TDs were dominant (Fig. 7A). A manual review using UCSC Genome Browser [39] showed that 50 (NA18943) and 54 (NA19240) TDs were observed in TEs and significantly enriched in TEs in NA18943 and NA19240, respectively (binomial test: P = 3.0 × 10–5 in NA18943 and P = 3.5 × 10–3 in NA19240) (Fig. 7B). We analyzed the association of TDs and genes. TDs were observed in 30 (NA18943) and 34 (NA19240) genes, and together, 54 non-redundant TDs in genic regions were identified (Additional file 1: Table S11). The genes with TDs showed higher expression levels than other genes (Wilcoxon test: P = 4.4 × 10–12) (Fig. 7C), although TDs were not enriched in gene regions (Additional file 2: Fig. S8).

Fig. 7figure 7

The characteristics of short TDs. A. The length distributions of TDs in NA18943 and NA19240. The number of TDs decreased as the length of TDs became longer. B. The enrichment of TDs in TEs. TDs were enriched in transposonic regions (79% in NA18943 and 76% in NA19240 when flanking regions are included). The gray line is the expected value of TDs in TE. C. The association of gene expression levels between genes with and without TDs. Genes containing TDs were expressed more than other genes. TPM: Transcripts Per Million

留言 (0)

沒有登入
gif