P(MP, MT)=(MTMP)P(p)MP(1−(P(p))MT−MP
MT = the total number of protein mutations observed for all proteins (for example, 325 mutations in user input data)
MP = the number of protein mutations in the target protein (for example, 66 mutations in Spike in user input data)
P(p) = length protein/length proteome (e.g., length Spike/total length = 1273/9930 = ~0.13)P(MP, MT)=(32566)0.1366(1−(0.13)325−66=0.00046
Based on the null hypothesis, we expect only 42 mutations in Spike, given that 66 out of the total 325 mutations identified in SARS-CoV-2 proteins are located in Spike, the length of the Spike protein is of 1273 amino acids, and the entire SARS-CoV-2 proteome is 9930 long. However, the binomial test p-value (0.00046) suggests rejection of the null hypothesis and indicates a significantly higher number of mutations in the Spike proteins compared to the background (entire proteome). ViralVar conducts the above calculation for user input data; therefore, MP, MT, and P(p) will be different for each input dataset. An option to exclude clade signature mutations is provided to avoid bias in the binomial test across highly divergent clades. ViralVar also provides control options to customize binomial test parameters, including the option to adjust the p-values for multiple comparisons. As above, the tab includes a date slider to allow users to restrict data to a specific date range and a “Select VOC/VOI” option to limit output to a specified VOC or VOI. A results table of the analysis can be downloaded as a tsv file.
In the “Genome Clustering” tab, ViralVar employs k-means clustering to facilitate rapid investigation of emerging clusters of genomes with specific protein mutation. As the selection of mutations in SARS-CoV-2 evolution has been shown to be largely impacted by positive selection, driven by changes in SARS-CoV-2 protein structures and functions [16,17], targeting protein mutations could cluster genomes relative to the phenotype. For instance, a common feature of SARS-CoV-2 genomes with the N501Y spike mutation (e.g., Alpha, Beta and Gamma strains) was enhanced infectivity and transmissibility over the previous variants [14].The clustering of genomes based on pairwise distance-based methods is computationally intensive and might take days to run depending on the computational resources. The runtime for the first step of these approaches (the calculation of distance matrices for all pairs of genomes) increases exponentially with the increase in the number of genomes (Figure S3). In contrast, k-means clustering of SARS-CoV-2 genomes has been proposed in the recent literature as a rapid method to investigate emerging variants and tackle the computational challenges in large-data analysis [40,41]. Due to its simplicity and being computationally inexpensive, the k-means clustering of genomes, based on mutations in specific proteins, can be quickly and repeatedly run on large-scale genomic datasets (such as ~11.1 M SARS-CoV-2 genomes).ViralVar uses k-means to group genomes-based on protein mutations. To avoid the effects of spurious mutations (e.g., due to sequencing or assembly errors), the clustering of the genomes is calculated only from protein mutations with a default minimum mutation frequency (MMF) of >0.005, although this cutoff is user-adjustable. To determine the optimal number of clusters, ViralVar repeats k-means clustering for numbers of clusters (determined based on the number of variables in the input file) and calculates the average silhouette width (ASW) index using the R package NbClust [42]. In the calculation of the ASW, ViralVar uses unique genomes (duplicated genomes with identical mutational patterns are removed) to make calculations less computationally expensive. However, the final clustering is applied to all of the genomes in the input data to produce counts of the genomes in each cluster. As with the previously described functions, VOC/VOI and date range are selectable. The protein selection option allows for targeting mutations along a protein of interest. Tables and customizable figures in a PDF format are downloadable.
留言 (0)