We present a computational approach to resolve intra-repeat variation in VNTRs from short-read sequencing data, and, in particular, in the LPA KIV-2 VNTR, which is a particularly difficult target due to the presence of highly homologous domains flanking the VNTR region. We leveraged our previous LPA work [2, 12, 14, 15, 22, 23] to provide a general, easy-to-use computational workflow to call base variation in VNTR repeats and to specifically resolve variation within the complex LPA KIV-2 VNTR. Our workflow isolates the reads of the targeted VNTR and we applied it without adaptations to 5 other protein-coding VNTRs (NEB, DMBT1, FLG, SPDYE3, and UBC) in 2504 samples from the 1000 Genome Project (1kGP). Our experiments in the LPA gene revealed that differences in the read extraction strategies can strongly affect the variant calling performance, with more stringent coordinates leading to the most accurate results. Moreover, we exemplify the use of SNPs in the flanking non-repetitive region to identify KIV-2 VNTR subtypes on an individual basis and dynamically adapt the VNTR reads identification.
Computational workflow for variant calling in VNTRsThe input of our workflow (Fig. 3) are files where all WES/WGS reads are pre-aligned to the human reference genome in BAM format. All reads mapped to a user-definable VNTR region-of-interest (VNTR ROI) are extracted and remapped to a reference sequence consisting of one single repeat unit. Variants in single VNTR units are then present only in a subset of reads, resembling somatic mutations, and are thus called using our mutserve variant caller [2, 24] with optimized settings for low-level variant detection (Methods). For a simplified and scalable application to user-defined VNTRs, the pipeline has been implemented using the Nextflow workflow manager [25].
Fig. 3Workflow of the variant calling approach in VNTR regions from short-read sequencing data: the input file is provided in BAM format, with reads aligned to the reference genome (step 1); the region of interest (ROI) defined by the user (provided as, e.g., BED file) is selected (step 2); the VNTR reads are extracted from the provided ROI (step 3); the extracted reads are realigned to one reference repeat (step 4) for variant calling with mutserve (step 5), resulting in a VCF output file that can subsequently be annotated (see GitHub repository in the Availability of data and materials section)
Construction and benchmarking of the variant calling workflow using the LPA KIV-2 VNTRWe generated WES data for two datasets (dataset A, dataset B), with known KIV-2 variant patterns from our previously established KIV-2 variation dataset [2] (referred to as gold-standard. See Methods for a brief description of the gold-standard generation). We performed two independent sequencing experiments (experiment-1 and experiment-2) using different exon enrichment chemistries (AgilentSure Select v6 kit and v8 kit). Dataset A (n = 8) was sequenced in both experiments and was used for the pipeline setup. Dataset B (n = 16) was sequenced only in experiment-2 and was used for benchmarking purposes (Fig. 4). The KIV-2 target region in our experiments spans 100 bp upstream and downstream of each of the two KIV-2 exons. In our gold-standard dataset, the number of coding KIV-2 variants per sample ranged from 4 to 24 in dataset A and from 2 to 17 in dataset B (Additional file 1: Table S1). About 80% of Europeans contain KIV-2B units, which can be recognized based on the distinctive haplotype at the canonical positions 14, 41, and 86 of KIV-2 exon 1 (corresponding to positions 594, 621, and 666 in our reference sequence). Two samples of dataset A and ten samples of dataset B were classified as non-KIV-2B samples. Datasets A and B served as data for setup and benchmarking of our workflow. In particular, the enrichment for non-KIV-2B samples in dataset B allowed the investigation of strategies to account for the presence/absence of KIV-2B subtypes in the variant calling workflow for the KIV-2 VNTR.
Fig. 4From the previously generated KIV-2 variation dataset with known KIV-2 variants [2] (I), we selected 8 samples (dataset A) for the setup of our KIV-2 variant calling pipeline in exome data and 16 samples (dataset B) for benchmarking (II). Subsequently, we generated WES data (III) in two sequencing experiments (experiment-1 and experiment-2), using two different exon enrichment kits, namely Agilent SureSelect v6 kit (experiment-1) and v8 kit (experiment-2). Dataset A was sequenced in both experiments, and the obtained WES data are called dataset A experiment-1 and dataset A experiment-2. Samples from dataset B were processed only in experiment-2 (v8 kit)
First, we tested a naive KIV-2 variant calling approach by aligning all available WES reads to a single KIV-2 repeat. However, due to the extensive homologies of the KIV-2, also many other reads from other kringles and other homologous regions (e.g., PLG, LPAL2) aligned unspecifically and generated spurious variant calls originating from PSVs [8]. In experiment-1 of dataset A, over 200 false positive variants per sample were called (Additional file 1: Table S2). This suggested that it would be necessary to restrict read extraction to the KIV-2 VNTR. However, the KIV-2 VNTR resembles a degenerated VNTR array, with a rather conserved core VNTR region (the KIV-2 domains), flanked by highly similar domains (KIV-1 and KIV-3). Additional complexity is added by the facts, that the exons of the KIV-2B units (which are located within the KIV-2 VNTR array) are identical to the KIV-3 (which flanks the VNTR array) and that KIV-2B units are not present in all human individuals. This has led to varying definitions of the VNTR region across literature [2, 3, 8, 26]. We therefore systematically assessed the KIV-2 variant calling performance of 9 different regions-of-interest to extract KIV-2 reads (ROI-1 to ROI-9) using both experiments of dataset A (Fig. 5). The ROIs where selected either from our own previous research [2], from literature [3, 8], or represented various curated combinations thereof (Additional file 1: Table S3). The combinations where designed manually taking into consideration the various homologies described in the introduction and resulting in possible read mismappings (see below).
Fig. 5Region of Interest (ROI) for KIV-2 read extraction from aligned WES data. The LPA gene is represented from KIV-1 to KIV-3 in transcription direction and not at scale for clarity. The KIV-2 VNTR is represented with 6 repeats as in the reference genomes hg19 and hg38, with the third repeat in the transcription direction being a KIV-2B unit. The other KIV-2 units are KIV-2A (in each unit, the first exon is represented in yellow, the second exon in blue). The KIV-2B unit in the VNTR is highly homologous to the KIV-3, which flanks the KIV-2 VNTR, showing the same first exon (purple) and 96% homology in the second exon. Additionally, all second KIV-2 exons are identical to the second KIV-1 exon (blue, see also the supplementary Fig. 12 of ref [2] for a detailed per-base identity matrix). In the first 200 intronic bp flanking each exon, 70–100% per-base identity is observed between KIV-1, KIV-2, KIV-2B, and KIV-3. Thus, minor differences in the ROI used for read extraction can extract highly homologous reads that create spurious variants if aligned to only one repeat as done in this variant calling pipeline. ROI-1 extracts reads from the complete LPA region (from KIV-1 to the protease domain, represented by an arrow). ROI-2 to ROI-9 progressively restricts the extraction region to the KIV-2 VNTR only. Since the KIV-2B appears fixed in the human genome reference, ROI-2 (previously published in [3]) and ROI-5 do not extract reads mapped to the KIV-2.2 exon 2 and KIV-2.3 subtype B, while ROI-6 and ROI-9 do not extract reads mapped to the KIV-2.3 subtype B. In combination with the presence or absence of KIV-2B units in the sequenced sample, this creates different spurious variant calls in a sample-specific manner. The size of the regions enclosed by each ROI is provided as bp between the start and end coordinates on hg38. Precise coordinates are provided in Additional file 1: Table S3
Variations in the extraction region impacted the F1-score (F1), which describes the harmonic mean of precision (true positives/(true positives + false positives)) and sensitivity (true positives/(true positives + false negatives)) [27]. Thus, F1 reflects both false negative and the false positive calls and performs well in unbalanced datasets. A higher F1 score indicates better performance. Overall, the F1 ranged from 5 to 100% for experiment-1 and from 4.5 to 100% for experiment-2 (Additional file 1: Table S4). Strategy ROI-1, where reads were naively extracted from the complete LPA region, led to a very low mean F1 of ≈15% in both sequencing experiments, mainly driven by the high rate of false positive variants. The KIV-1 and KIV-3 domains share the highest homology with KIV-2 [2] and are therefore more likely to sequester KIV-2 reads in the initial alignment step. Thus, some of our tested strategies (ROI-2 to ROI-7) include various combinations of KIV-1 and KIV-3 regions. The strategies ROI-2 and ROI-3 were previously used by Ebbert et al. [3] and by Mukamel et al. [8]. Compared to ROI-1, using ROI-2 to ROI-7 already considerably improved the variant calling by excluding all other kringles and the mean F1 in dataset A ranged from 31.0 to 70.1% (Table 1 and Additional file 1: Table S4). When extracting only reads strictly mapped to the KIV-2 VNTR (ROI-8), without flanking regions, the mean F1 further improved to 81.9% for experiment-1 and to 84.4% for experiment-2. This suggested that the region of extraction has a major impact on variant calling and that more stringent coordinates, limited to the VNTRs start and end coordinates, reduce the false positive calls and improve the variant calling performance. Nevertheless, the results of ROI-8 still showed a large variance at individual level in both experiments, with standard deviation (SD) of 21.5–24%. Therefore, we further investigated potential stratifications within our datasets.
Table 1 F1-score (%) in dataset A experiment-1 and experiment-2 for each tested ROI strategy expressed as mean ± standard deviation (SD) and as median (minimum–maximum). ROI-8 shows the highest F1 mean in both experiments. The high standard deviation is explained by the high rate of false positive variants in the non-KIV-2B samples, which range from mean = 0 with ROI-9 (both experiments 1 and 2) to mean = 152 and mean = 164 with ROI-1 (experiments 1 and 2, respectively). See Additional file 1: Table S4 for performance metrics at the individual levelAccounting for repeat subtypes improved variant calling in the LPA KIV-2 VNTRIn the reference assemblies hg19 and hg38, the KIV-2 VNTR region is represented with 6 KIV-2 repeats, which are 5 KIV-2A units but only one KIV-2B. Thus, the KIV-2B repeat subtype erroneously appears fixed and in a single copy, while at the population level the intra-individual frequency of KIV-2B units changes widely, and even many “non-KIV-2B individuals” exist, which carry zero KIV-2B repeat units [2, 9, 26]. We noticed that ROI-8 performed best only in KIV-2B individuals with a F1 between 94.2 and 95.0% for both experiments. In non-KIV-2B individuals, the mean F1 dropped to 45.0% for experiment-1 and to 52.3% for experiment-2 and resulted in a high number of false positive variants (Table 2). Compared to ROI-8, ROI-9 does not extract reads mapped to the first exon of the KIV-2B repeat. Since no true KIV-2B reads exist in non-KIV-2B individuals, the reads that are aligned in these individuals to the KIV-2B repeat of the reference genome must originate from highly homologous domains, most likely from KIV-3, which is nearly identical to the KIV-2B. Thus, in non-KIV-2B individuals, ROI-9 skips the KIV-2B domain in the reference during read extraction and so reduces false positive calls. ROI-9 improved the F1 in non-KIV-2B individuals to 95.5% and to 100% in the two sequencing experiments in dataset A (Discussion). We thus investigated strategies to minimize the false positive rate by accounting for the presence/absence of KIV-2B subtypes in a dynamic way within the Nextflow analysis pipeline.
Table 2 Comparison between the F1-score in each sample of dataset A in experiment-1 and experiment-2 using ROI-8 and ROI-9. ROI-8 performs best in KIV-2B samples and ROI-9 performs best in non-KIV-2B samples. The sign “- “ is used for “no calculation possible”Our experiments showed that ROI-8 and ROI-9 are the optimal strategies for variant calling of KIV-2B and non-KIV-2B samples, respectively. Therefore, we searched for a unique sequence feature to discriminate KIV-2B and non-KIV-2B individuals agnostically before read extraction and assign the best ROI dynamically. Although the first exon of KIV-3 is reported in the literature as identical to the first exon of KIV-2B, we identified a SNP at position 86 in KIV-3 exon 1 (referred to as “signature position”) that separates KIV-2B from non-KIV-2B individuals (Additional file 2). To this end, we sequenced the KIV-3 region in 66 samples with known KIV-2 variant patterns [2] using a Sanger sequencing protocol designed to differentiate KIV-3 from KIV-2 and KIV-2B, as well as other KIV domains (50 KIV-2B samples and 16 non-KIV-2B) (Additional file 2; for considerations regarding the specificity of the KIV-3 PCR and Sanger sequencing Additional file 3). All KIV-2B samples carried at least one T-allele (which is also the reference base in the genome) at the signature position, while almost 70% of the non-KIV-2B individuals (11 of 16 samples) carried a C/C homozygous genotype at this position (Additional file 1: Table S5). This indicates a correlation between the presence of KIV-2B repeat units and a T-allele at position 86 of KIV-3 exon 1 (T-signature), respectively the absence of the KIV-2B units and a C-allele at this position (Fig. 6A). Accordingly, we observed a strong LD between the bases at KIV-2 position 594 (used as proxy for KIV-2B absence/presence) and position 86 in KIV-3 exon 1 (Lewontin’s D′ = 0.9995; indicating absence of recombination between the loci). A detailed explanation about LD investigation for KIV-2 SNPs is provided as Supplementary Note in Additional file 4.
Fig. 6A Linkage disequilibrium between the KIV-2B haplotype and the genotype at position 86 of KIV-3 exon 1. For each KIV domain represented, exon 1 and exon 2 are shown. The bases shown above each exon 1 corresponds to positions 14, 41, and 86 (KIV-2B canonical positions). We identified a C at position 86 in KIV-3 exon 1 (CC genotype) in non-KIV-2B individuals. KIV-2B alleles present a T at position 86 in KIV-3 (corresponds to the reference sequence) while non-KIV-2B alleles would carry the C, resulting in a CC genotype in 70% of the non-KIV-2B individuals. B Workflow of the signature-based approach for KIV-2 variant calling in WES data: starting from the input aligned BAM file (step 1), the entire LPA region is screened in FASTQ format for the KIV-2B signature sequence and (step 2) each sample is redirected to the optimal ROI strategy for (step 3) KIV-2 read extraction. (step 4) The extracted reads are realigned to one reference KIV-2 repeat for (step 5) variant calling with mutserve, resulting in a VCF output file, which can subsequently be annotated (see Github repository in the Availability of data and materials section)
We therefore defined a KIV-2B signature sequence (CCACTGTCACTGGAA) that encompassed the T-allele at position 86 of KIV-3 exon 1 (Additional files 2–4 and sequence alignments in Additional file 5). We hypothesized that the presence or absence of this signature sequence in the sequencing reads allows discriminating between sample subtypes (KIV-2B and non-KIV-2B) and assigning each sample to the best ROI for read extraction. We used the available WES data from the non-KIV-2B samples (n = 12) to calculate an empirical threshold of signature-sequence occurrences to distinguish between the two sample subtypes. First, we counted the number of signature sequence occurrences within each FASTQ input file. Subsequently, the mean value plus 2 standard deviations of all occurrences across all non-KIV-2B samples was calculated and used as a threshold (Additional file 1: Table S6) to dynamically assign each sample to either ROI-8 (signature coverage above the threshold) and ROI-9 (signature coverage below the threshold).
We integrated our findings in an LPA-specific variant calling pipeline (here referred to as signature-based approach) that improves variant calling in the KIV-2 VNTR at scale by integrating a screening step for the KIV-2B signature sequence (Fig. 6B). The dynamic ROI assignment minimized the number of erroneously called mutations especially in non-KIV-2B samples, where a string of KIV-2B specific positions originating from KIV-3 would otherwise be detected as false positive variants (Additional file 1: Table S7). The mean F1 of the dynamic ROI assignment was ≈95% in both experiments. Dataset A includes two non-KIV-2B samples and the mean F1 obtained with the dynamic ROI assignment increased by 16.8% for experiment-1 (from 81.9 to 95.65%) and by 12.7% for experiment-2 (from 84.4 to 95.14%), also showing less variability among samples (Additional file 1: Table S8). Importantly, the signature sequence-based approach did not call false positive variants in the non-KIV-2B samples, while up to 12 false positive variants were called with the static ROI-8 approach.
For benchmarking the signature-based approach, we applied it to a second and independent dataset B (n = 16), including 10 additional non-KIV-2B samples and assessed its performance. The signature-based approach correctly classified all 6 KIV-2B samples and 9 of 10 non-KIV-2B samples. The single misclassification was one non-KIV-2B individual that presented a T/T genotype at the signature position. This was the only sample in our Sanger-sequenced dataset (n = 66) that presented a T/T genotype at KIV-3 exon 1 position 86 despite the absence of KIV-2B repeats. As a result, this sample was processed with ROI-8 despite being a non-KIV-2B sample. For all other samples, the signature-based approach did not detect any false positive variants in the samples.
For dataset B, when using the dynamic signature-based approach the F1 increased from 63.2 to 90.8% (+ 43.6%; compared to ROI-8 only) and showed less variability among samples (Fig. 7). After excluding the misclassified sample, the F1 reached 94.7% (Additional file 1: Table S9).
Fig. 7Benchmarking the signature-based approach in dataset B. The signature-based approach improved the F1-score in 9 out of 10 non-KIV-2B samples and showed the same accuracy for sample number 13. All KIV-2B samples (samples 9, 10, 12, 14, 17, and 21) resulted in the same F1-score using the signature-based approach
Variant calling in VNTRs in the 1000 Genomes ProjectThe ROIs experiments for LPA KIV-2 VNTR showed that, when the information about the repeats subtypes is not considered, not available or not relevant, bona fide VNTR start and end coordinates (ROI-8) led to the highest mean F1 score. In 2021, Mukamel et al. [8] identified 118 protein-coding VNTRs in medically relevant genes of the human genome and provided bona fide VNTR start and end coordinates identified with various VNTR finder tools [8]. Thus, these coordinates can be used as ROI for read extraction to call variants in other VNTRs. Therefore, we used our pipeline (Fig. 3, i.e., without the signature-based step) to analyze and annotate the protein-coding VNTRs in five other medically relevant genes (NEB, SPDYE3, FLG, UBC, DMBT1) in 2504 unrelated samples of the 1kGP [16]. A single VNTR repeat unit as defined in [8] was used as a reference. Since the LPA gene was also included in the list from Mukamel and colleagues, we first analyzed the LPA KIV-2 VNTR in 1000G data using as ROI the start and end coordinates generated by Mukamel et al. [8].
The final call set for each analyzed VNTR using the 1kG samples is provided at the pipeline repository (https://github.com/genepi/vntr-calling-nf) and represents the fully resolved call set of variants within the six analyzed VNTRs. In all analyzed VNTRs, we identified and annotated potentially functional variants, which can serve as a starting point for gene-specific validations (Table 3). Potential missense mutations have been identified in all genes (ranging from 7 to 321), nonsense mutations in 5 genes (LPA, DMBT1, NEB, FLG, SPDYE3, UBC), splice-site mutations and slicing region mutations in 3 (LPA, DMBT1, NEB).
Table 3 Analysis of the VNTRs in 6 genes in 2504 WGS samples from 1kGP. The regions of interest (ROI) for VNTR read extraction were defined as in [8]. VNTR variants detected at least once in the 1kGP samples are divided by mutation type. The percentage of paralogous sequence variants (PSVs), i.e., sequence differences among the repeats, was calculated as described in the Methods section based on the reference sequence of the human genome and excluding degenerated flanking repeats for LPA, DMBT1, and UBC. For each mutation category, we report the number of unique identified variants, excluding PSV positions. As the VNTRs in FLG, SPDYE3, and UBC are intra-exonic, no splice-site mutations nor splicing region mutations were involvedFrom our observations based on the presence or absence of the KIV-2B subtype in the LPA KIV-2 VNTR, we assumed that the position of the ROI and the existence of different repeat subtypes within VNTRs may generally increase the risk of false positive calls caused by PSVs. To exclude them from downstream analyses, we thus compiled a list of existing PSVs between the repeats for each investigated VNTR (https://github.com/genepi/vntr-calling-nf) and estimated the expected conserved bases for each VNTR (Table 3). In genes like LPA, DMBT1, and UBC, the high homology between the VNTR and the flanking regions (“degenerated repeats”) complicates the identification of the bona fide VNTRs start and end coordinates and considerably increases the percentage of PSVs detected. In these genes, the exclusion of such degenerated repeats minimized the PSVs expected as false positive (Table 3). Overall, the repeat length ranged from 228 bp to 10,553 bp with a PSV count of 0.02% (DMBT1) to 26.69% (FLG).
Eight known KIV-2 variants were used to benchmark our results in LPA (Additional file 1: Table S10). Using the ROI defined in [8] as a single region of extraction, independent of the KIV-2B presence or absence, we successfully detected biologically functional variants like the Lp(a)-lowering splice-site mutation 4925G > A [12] and 4733G > A [15] (carrier frequency of 9.7% and 12%, respectively, consistently with previous findings [2]). We also successfully called three other functional variants recently identified [8] and the three synonymous variants at positions 14, 41, and 86 in KIV-2 exon 1 which define the KIV-2 subtypes (named KIV-2A, KIV-2B, and KIV-2C [11]). However, the three KIV-2B canonical variants showed a much higher carrier frequency (~ 100%, ~ 100%, and 77.6%, respectively) than reported [2], suggesting an increased rate of false positive calls at these positions. In line with this assumption, the signature-based approach with dynamic assignment of the experimentally defined ROI-8 and ROI-9 lowered the KIV-2B canonical variants carried frequencies considerably to ≈54% (Additional file 1: Table S10) and reduced the noise (Additional file 6: Fig. S1). These frequencies are much more consistent with findings in the literature [2], further supporting an improved discrimination of false positive calls at these positions.
As a technical note, we were able to run the pipeline on the 1000 Genomes Project data in 33 min (total CPU hours: 7.9) assigning a memory requirement of 1 GB to each process (Additional file 7).
LPA KIV-2 variant detection in UK BiobankWe analyzed the KIV-2 VNTR in 199,119 samples from the UK Biobank whole-exome sequencing (UKB 200K WES release) by applying our defined signature-based workflow (Additional file 1: Table S11). The KIV-2B signature sequence was detected in 75% of the samples (n = 149,184), with individuals of Black ancestry showing the lowest proportion of KIV-2B individuals, which is consistent with previous findings [2, 26, 28] (Additional file 1: Table S12).
When excluding the PSVs [8] that defines the KIV-2B subtype in exonic and intronic sequences [2], our approach detected a total number of 707 unique mutations across all samples, including 256 missense mutations, 37 nonsense, and 8 splice-site mutations. Ninety-five intronic mutations were located within 25 bp around the exons (referred to as “splicing region”) and could affect intronic core splicing elements [15] (Table 4 and Additional file 6: Fig. S2).
Table 4 Unique LPA KIV-2 mutations within the target regions (KIV-2 exons 1 and 2 ± 100 bp) in each population. KIV-2 mutations are divided by mutation type for UK Biobank individuals (n = 199,119). Since the signature-based approach was developed and benchmarked in individuals of European ancestry, summary data are here filtered for White (n = 186,607), South Asian (n = 3465), and Black (n = 3200) ancestry. The PSVs defining the KIV-2B were not considered as mutations. The location “splicing region” was defined as within the 25 bp upstream and downstream KIV-2 exonsOur signature-based approach reported similar variant levels for all three KIV-2B canonical variants in the UKB cohort (Additional file
留言 (0)