Viruses, Vol. 14, Pages 2714: ViralVar: A Web Tool for Multilevel Visualization of SARS-CoV-2 Genomes

2.4. Mutational AnalysisThe “Mutational Analysis” module of ViralVar provides users with a suite of tools to visualize the genomic and structural context of SARS-CoV-2 mutations. The R package ggplot2 [34] is used to generate and annotate density plots. After data input, the data are displayed in tabular format in the “Data Overview” tab. Note that for this module, collection date, PANGO lineage, and amino acid (AA) substitution information are required. The “Genome Distribution” tab depicts mutation density among uploaded sequences across the SARS-CoV-2 genome. Briefly, the number of distinct mutation events at each genomic position or protein residue is determined relative to a reference sequence (NCBI: NC_045512.2) [35] and reported over a sliding 100 nucleotide window. Position counts are calculated separately for insertions, deletions, and substitutions. This method does not consider virus counts in its calculation (i.e., the number of uploaded genomes with a particular mutation) such that each mutation event is counted only once. This avoids potential biases in reporting mutational frequency due to unequal amplification or sequencing across the genome as well as bias sampling [16]. In the “Protein Distribution” tab, the frequencies of genomes (virus counts) with mutations at specific protein residues are visualized using the R package ggplot2 [34] and plotly R package (interactive visualization). Separate plots can be generated for all SARS-CoV-2 proteins, both structural and nonstructural. Protein domain boundaries are indicated as described in the literature [16,17]. The IEDB server (Bepipred Linear Epitope Prediction 2.0 at http://www.iedb.org/) [36] (accessed on 31 October 2021) was used to predict B-cell epitopes, which are indicated above the protein schematic. In the “3D Protein Structure” tab, the R library package r3dmol is used to visualize mutations in the context of 3D protein structures. The 3D coordinates were obtained from the Protein Data Bank (PDB) with PDB accession numbers provided for each structure [37]. For proteins with no available 3D structure, models, as predicted by AlphaFold, were used when available [38]. Alternatively, the positions of transmembrane helices for proteins with no available 3D structures were identified using the TMHMM 2.0 algorithm [39]. Lists of the top mutations along with their frequencies for each protein can be downloaded in the form of tab delimited tables. The 3D protein illustrations can be downloaded as portable network graphics (PNG) files. Each of the above tabs includes a date slider to allow users to restrict data to a specific date range and a “Select VOC/VOI” option to limit output to a specified VOC or VOI.The above mutational analysis tabs are further complemented by two tabs for statistical analysis and k-means clustering. In the “Statistical Analysis” tab, ViralVar utilizes the binomial test to identify individual proteins within the uploaded dataset that have significantly different mutation frequencies. The method has been previously applied to identify significantly under- and over-mutated SARS-CoV-2 proteins [16,17]. Briefly, the arguments for the binomial test are the observed number of distinct protein mutations in a certain protein (the “number of successes”), the total number of distinct protein mutations in all SARS-CoV-2 proteins (the “number of trials”), and the length of a given protein divided by the length of all SARS-CoV-2 proteins (the “expected probability of success”). An example of binomial calculations is provided below. For more details, please refer to [16].To simplify the calculations in this method, we hypothesize that each protein mutation is an independent event and that all SARS-CoV-2 proteins and all residues have the same probability of being mutated. Therefore, this method applies the binomial test to assess the null hypothesis: that protein mutations are distributed randomly across all SARS-CoV-2 proteins.

P(MP, MT)=(MTMP)P(p)MP(1−(P(p))MT−MP

MT = the total number of protein mutations observed for all proteins (for example, 325 mutations in user input data)

MP = the number of protein mutations in the target protein (for example, 66 mutations in Spike in user input data)

P(p) = length protein/length proteome (e.g., length Spike/total length = 1273/9930 = ~0.13)

P(MP, MT)=(32566)0.1366(1−(0.13)325−66=0.00046

Based on the null hypothesis, we expect only 42 mutations in Spike, given that 66 out of the total 325 mutations identified in SARS-CoV-2 proteins are located in Spike, the length of the Spike protein is of 1273 amino acids, and the entire SARS-CoV-2 proteome is 9930 long. However, the binomial test p-value (0.00046) suggests rejection of the null hypothesis and indicates a significantly higher number of mutations in the Spike proteins compared to the background (entire proteome). ViralVar conducts the above calculation for user input data; therefore, MP, MT, and P(p) will be different for each input dataset. An option to exclude clade signature mutations is provided to avoid bias in the binomial test across highly divergent clades. ViralVar also provides control options to customize binomial test parameters, including the option to adjust the p-values for multiple comparisons. As above, the tab includes a date slider to allow users to restrict data to a specific date range and a “Select VOC/VOI” option to limit output to a specified VOC or VOI. A results table of the analysis can be downloaded as a tsv file.

In the “Genome Clustering” tab, ViralVar employs k-means clustering to facilitate rapid investigation of emerging clusters of genomes with specific protein mutation. As the selection of mutations in SARS-CoV-2 evolution has been shown to be largely impacted by positive selection, driven by changes in SARS-CoV-2 protein structures and functions [16,17], targeting protein mutations could cluster genomes relative to the phenotype. For instance, a common feature of SARS-CoV-2 genomes with the N501Y spike mutation (e.g., Alpha, Beta and Gamma strains) was enhanced infectivity and transmissibility over the previous variants [14].The clustering of genomes based on pairwise distance-based methods is computationally intensive and might take days to run depending on the computational resources. The runtime for the first step of these approaches (the calculation of distance matrices for all pairs of genomes) increases exponentially with the increase in the number of genomes (Figure S3). In contrast, k-means clustering of SARS-CoV-2 genomes has been proposed in the recent literature as a rapid method to investigate emerging variants and tackle the computational challenges in large-data analysis [40,41]. Due to its simplicity and being computationally inexpensive, the k-means clustering of genomes, based on mutations in specific proteins, can be quickly and repeatedly run on large-scale genomic datasets (such as ~11.1 M SARS-CoV-2 genomes).ViralVar uses k-means to group genomes-based on protein mutations. To avoid the effects of spurious mutations (e.g., due to sequencing or assembly errors), the clustering of the genomes is calculated only from protein mutations with a default minimum mutation frequency (MMF) of >0.005, although this cutoff is user-adjustable. To determine the optimal number of clusters, ViralVar repeats k-means clustering for numbers of clusters (determined based on the number of variables in the input file) and calculates the average silhouette width (ASW) index using the R package NbClust [42]. In the calculation of the ASW, ViralVar uses unique genomes (duplicated genomes with identical mutational patterns are removed) to make calculations less computationally expensive. However, the final clustering is applied to all of the genomes in the input data to produce counts of the genomes in each cluster. As with the previously described functions, VOC/VOI and date range are selectable. The protein selection option allows for targeting mutations along a protein of interest. Tables and customizable figures in a PDF format are downloadable.

留言 (0)

沒有登入
gif