A predictive language model for SARS-CoV-2 evolution

Establishing a language model for the combinatorial mutations in SARS-CoV-2 Omicron sequences

There are mainly three main steps for our model construction, which are the derivation of “grammatical framework” to construct data’s regularity, the introduction of the “mutational profile” to incorporate random mutations, and a screening model to exclude the sequences with minimal likelihood to emerge in the real-world. Initially, we collected the sequences of the S1 peptide in SARS-CoV-2 Omicron variants from April 15th, 2022, to September 15th, 2022, and then performed multiple sequence alignments. Interestingly, September 2022 is the time when the first omicron waves ended, and April 2022 is the time when the BA.2 wave is at its peak. The Omicron S1 sequence data from the 14th to the 685th residues1 for those five months were subsequently extracted as the dataset-1 (Supplementary Fig. 1a). Subsequently, to separate the conservative and unconservative part of the S protein and further reduce the data dimension space, we defined the stability of an amino acid site in S1 sequences by Three Days’ Frequency (TDF), which is the percentage of amino acid for a residue site within three successive days. The residue with the highest TDF was defined as the dominant residue of each residue site (Supplementary Fig. 2a). Compared to other definitions of conservation such as the phylogenic tree, the concept of TDF allows us to include the latent information into account just as other techniques, and it is much easier and faster than constructing the phylogenetic tree since we need a validated model for the tree construction which costs time. The sites exhibiting significant variation ( > 0.09) in the TDF of the dominant residue over time were identified as the “hot spots” (Fig. 1a and Supplementary Fig. 2b, c). As a result, we defined 84 hot spots and 588 non-hot spots in the S1 sequences of SARS-CoV-2 Omicron (Supplementary Fig. 2d). To reduce further the dimension of the hot spot data and mimic the natural human language, we decided to employ a clustering approach. We initially grouped the related hot spots, forming the “word clusters,” and subsequently grouped the “word clusters” into “sentence clusters” and “sentence clusters” into “paragraph clusters” (Fig. 1a and Supplementary Figs. 35). The “word clusters,” “sentence clusters,” and “paragraph clusters” comprised the so-called “grammatical frameworks” of dataset-1 (Fig. 1a and Supplementary Fig. 6). Noteworthy, “words,” “sequences,” and “paragraphs” structures could capture the long-range interaction between nucleic acids as they are defined and clustered by shared latent pattern but not the physical distance in the sequences. The frameworks served as a simplified and structured representation of the hot spot data, enabling us to analyze the regularity of the combinatorial mutations in the S1 sequences of the Omicron variant more effectively and intuitively. Thus, the model can grab the data’s latent pattern and the conservative information more efficiently from the variants’ evolution, easing the model’s processing time.

Fig. 1figure 1

Modeling the combinatorial mutations in SARS-CoV-2 Omicron sequences. a The schematic diagram for determining the “grammatical frameworks” and modeling the sequences in dataset-1 based on the “grammatical frameworks”. The orange solid circles denote the hot spots. The black, green, and blue dashed circles denote the “word clusters,” “sentence clusters,” and “paragraph clusters,” respectively, and they formed the so-called “grammatical framework.” Monte Carlo simulation simulates the residues at each hot spot with the constraint of collocation of amino acids. For example, as the combination of amino acids I and G had never appeared together in one cluster, we would exclude the possibility of this collocation. b The schematic diagram for modeling the future sequences. Mutational profile is introduced to change the occurrence frequency of mutation at hot spots (for example, change the occurrence frequency of mutation from 0.3 to 0.5). Bi-LSTM language models are used for screening to exclude the generated sequences with low fitness based on the prior data. The input data were the amino acids within a “paragraph cluster”. The output layer exports a positive or negative score of the amino acid cluster, and only the sequences with positive scores are outputted

We construct the regularity based on the framework after deriving “grammatical frameworks” from the inputting dataset-1 (Fig. 1a and Supplementary Fig. 6). To simulate the amino acids at each hot spot, we employed the Monte Carlo (MC) simulation, known for generating random outcomes based on occurrence frequencies, to simulate the amino acids at each hot spot (Fig. 1a). In the dataset-1, we observed that each amino acid was frequently associated with a specific set of co-occurring amino acids within each cluster, forming the collocation of amino acids (Fig. 1a). This collocation is seen within each level of “word cluster,” “sentence cluster,” and “paragraph cluster”. The simulation is thus based on the constructed “framework” (See in Methods). To ensure that the generated amino acids align with the observed conditional frequency and adhere to the co-occurrence patterns in dataset-1, we constrained the collocations of amino acids within each cluster (Fig. 1a). For example, as the combination of amino acids I and G had never appeared together in one cluster, we would exclude the possibility of this collocation during the simulation. Considering co-occurrence patterns of mutations within the same cluster enhanced the accuracy of the generated amino acids within each cluster. By incorporating these constraints and utilizing the MC simulation technique, we grant our model with the regularity of combinatorial mutations (Fig. 1a).

Despite the regularity in viral evolution, combinatorial mutations also include random events, and without including the randomness, the model could only generate sequences that have already emerged. There needs to be more than just identifying the regularity pattern of viral evolution to build a prediction model. Thus, we aim to simulate the randomness of the mutations by introducing a new variable named mutational profile to generate combinatorial mutations (Fig. 1b). Mutational profile is defined as a variable that determines the frequency of mutations. It refers to the combined effect of all the mutational processes that resulted in the accumulation of the observed mutations, yet we might know what those latent effects are and how they interact to provide the final mutation frequency. The mutational profile drives the occurrence of random mutations and thus generates a wide variety of possible viral sequences that may or may not have appeared before. Subsequently, we assessed the evolutionary fitness of the combinatorial mutations to dataset-1 by a language model to identify and remove the sequences with lower fitness, that is, unlikely to be a natural sequence.21 For example, the sequence must obey the latent “grammar” of biological rules to achieve high fitness. The Bi-directional Long Short-Term Memory (Bi-LSTM) language model, capable of accessing the information forward and backward to capture more latent features compared to regular LSTM,21 was trained based on the inputting data, dataset-1 in that case (Fig. 1b). By inputting the amino acids, the Bi-LSTM model generated an emotional score representing how well the sequence aligns with the training data (Fig. 1b). The “paragraph clusters” with a negative emotional score, indicating lower fitness to the training data, were defined as sequences with “syntax errors” and removed as they were unlikely to represent the latent pattern of the viral evolution. This screening process allowed us to retain only the sequences with higher fitness, ensuring that the generated sequences adhered to the “grammatical frameworks” and accurately represented the regularity and randomness of combinatorial mutations. Thus, our evolutionary language model represents a potential tool for understanding the regularity and randomness of mutations.

Validating the language model with existing SARS-CoV-2 variants and predicting potential future SARS-CoV-2 variants

After establishing this evolutionary language model and before making the prediction, we need to conduct a comprehensive validation for our model using the available SARS-CoV-2 Omicron variants. The validation aims to determine whether our model could accurately represent the regularity of combinatorial mutations in sequences in dataset-1 after clustering into the “grammatical framework.” We wanted to check if we could simulate dataset-1 back using the “grammatical framework” after the dimension reduction to prove that our model retained most of the essential information of data. For this purpose, we performed 10,000 simulations at each hot spot using the model to generate sequences based on the framework built on dataset-1. During the validation process, the simulated sequences were compared to the actual variants observed in dataset-1 to evaluate how well the model-generated sequences aligned with the real-world viral sequences. The results showed that 67.7% of the simulated sequences matched the sequences in dataset-1, meaning only 32.3% of them did not match with any known lineages, indicating that the model successfully captured the underlying regularity of combinatorial mutations (Fig. 2a). For here, variants are not lineages as different variants might be included in the same lineage or sublineage. The identified variants, such as BA.5.2.1, BA.2.12.1, BA.2, BA.5.1.10, BA.4.1, and BA.2.3, were consistent with those observed in dataset-1 as they are also the dominant variants during the time-frame of dataset-1 (Fig. 2a, b and Supplementary Fig. 1a), further supporting the predicting capacity of this model. Moreover, the results showed that the top three simulated variants, BA.2.12.1, BA.5.2.1, and BA.2, were also the top three variants in the real dataset-1 (Fig. 2b and Supplementary Fig. 7a). This successful alignment between the simulated and observed variants in dataset-1 demonstrated that the virus evolutionary language model could accurately represent the regularity of combinatorial amino acid mutations present in the given dataset.

Fig. 2figure 2

The validation of virus evolutionary language model and the prediction based on dataset-1. a, b The validation of the virus evolutionary language model without introducing a variable mutational profile. a 10,000 variants were generated by simulation based on the “grammatical framework,” and we checked whether the simulated variants could represent the variants of real data. The larger the circle, the more numbers of this variant are simulated, and each circle denotes a unique Omicron variant that might belong to the same lineage. b The schematic pattern of the prevailed variants during the timeframe of dataset-1 represents the real distribution pattern of the variants in that era. The variants belonging to the same lineage/sublineages are endowed with the same color in (a, b). c, d Using the evolutionary language model with a variable mutational profile to make predictions. c The variants were generated by 10,000 simulations using the model, but now mutational profile incurs random mutations. d The frequencies of the top 10 variants predicted by the model during the timeframe of dataset-2 (September 16th, 2022, to May 10th, 2023), right after the time period of input dataset-1. The models in (ad) were developed based on dataset-1, and the prediction result is verified by the data in dataset-2

After validation, we conducted a retrospective study using only information available before the pandemic to make predictions. We thus evaluated whether our model could represent the randomness of combinatorial mutations by introducing and controlling the variable mutational profile, representing the occurrence frequency of mutation. That is, we want to examine the prediction capacity of our model. We still conducted 10,000 simulations at each hot spot, but now we changed their mutation frequency and used the Bi-LSTM language model for screening. This screening process is the best choice for our model after benchmarking (Supplementary Fig. 8). After we retrieved the simulated results using the model that only contained input data before September 16th, 2022 (dataset-1), we compared them with dataset-2, the actual S1 data of variants that emerged between September 16th, 2022, and May 10th, 2023 (Supplementary Fig. 1b). We collected the sequences of the S1 subunit in SARS-CoV-2 Omicron variants from that period and then performed multiple sequence alignments. After evaluating the precited sequences against the actual variants in dataset-2, we found that 70% of the simulated sequences could be assigned to specific Omicron subvariants, including BA.4, BA.5.1.10, BA.4.6, BA.5, BA.5.5, and BA.2 which emerge soon after the prediction (Fig. 2c). We can see, for the several variants we predicted most, which means they are the variants that are most likely to emerge according to our model, they take up about 50% proportion of all emerged variants during late 2022 (Fig. 2d). Additionally, we successfully predict the unseen BF, BE, and BQ subvariants, which were not present in dataset-1 but emerged after the timeframe of input data, is a significant validation of the predictive capabilities of our model (Fig. 2c and Supplementary Fig. 7b). The prediction of known and previously unknown subvariants indicated that our model could elucidate viral evolution and predict future SARS-CoV-2 variants. By introducing the mutational profile into the model, we significantly extended the ability to forecast the regularity and randomness in mutations, thereby indicating future viral variants.

Updating the SARS-CoV-2 variant data for predicting potential SARS-CoV-2 variants by the language model

Predicting future SARS-CoV-2 variants or influential amino acids’ residue mutation is of significance as it allows more time to develop proactive responses at earlier stages of viral spread, potentially mitigating the impact of new variants on public health.6 Building upon the previous success of our language model of SARS-CoV-2 evolution in shedding light on viral evolution and accurately predicting future variants, we next aim to update our prediction for potential future variants based on the sequences that emerged between September 2022 and May 2023 (dataset-2) and between May 2023 and October 2023 (dataset-3). Previously, we first predicted the future variants based on dataset-1, yet it might be outdated for the present epidemic status. Repeating the prediction based on dataset-2 and dataset-3 would grant us insight into the near future’s viral evolution, providing a practical application for our model and also further validating the model’s integrity. We employed the exact modeling and validation approach used previously to accomplish the prediction.

To begin, we still conducted 10,000 simulations based on the constructed “grammatical framework” built by dataset-2 without introducing the mutational profile, thereby focusing on validating the regularity of the mutations. Similarly, we compared our simulated results against the variants present in the raw data (dataset-2) to assess how well the model-generated sequences aligned with the actual variants. Approximately 40% of the simulated sequences matched the variants observed in dataset-2. Notably, the majority of these matched sequences were XBB.1.5, BQ.1.1, BA.5.2.1, BA.4.6, and BF.7 subvariants (Fig. 3a). In both the simulated results and dataset-2, XBB.1.5 emerged as the most frequent Omicron lineage, followed by the BQ and BF subvariants (Fig. 3b and Supplementary Fig. 7c). We thus successfully simulate back the dataset-2 from the constructed “grammatical framework”. These results prove that our language model could effectively represent the regularity of the amino acid mutations present in the sequences in dataset-2, keeping the vital information after dimension reduction.

Fig. 3figure 3

Updating the SARS-CoV-2 variant data for predicting potential future. a, b The validation of the virus evolutionary language model without introducing a variable mutational profile. a 10,000 variants were generated by simulation based on the “grammatical framework”. The larger the circle, the more numbers of this variant are simulated, and each circle denotes a unique Omicron variant. b The schematic pattern of the prevailed variants during the timeframe of dataset-2, representing the real distribution pattern of the variants in that era. The same variants are endowed with the same color in (a, b). c, d Using the evolutionary language model with a variable mutational profile to make the prediction. c The variants were generated by 10,000 simulations using the model but including mutational profile to incur random mutations. d The frequencies of the top 5 variants predicted by the model from May 11st, 2023, to July 1st, 2023, right after the time period of input dataset-2. e The frequencies of the amino acids at each hot spot generated by the model with a variable mutational profile. The models in (ad) were developed based on dataset-2, and the prediction result is verified by the data after dataset-2. f The mutations in Omicron variants and the predicted results by the virus evolutionary language model based on dataset-3. e, f del denotes the amino acid deletion

Likewise, after validation, we perform the prediction and test our predicted results against the variants that emerged after the timeframe of dataset-2. For this purpose, we collected the sequences of S1 peptide in SARS-CoV-2 Omicron variants from May 15th, 2023, to October 31st, 2023 (dataset-3), and then performed multiple sequence alignments (Supplementary Fig. 1c). We chose October 2023 as the end period of our dataset since it is time just before the emergence of current VOIs, JN.1 and BA.2,86. By comparing the predicted sequences based on dataset-2 with the actual sequence in dataset-3, the dataset including the sequence data after the period of dataset-2, we found that the XBB.1.5 Omicron subvariant was prevalent in both the simulated results and real data dataset-3 (Fig. 3c, d), and XBB.1.5 was known as the dominant strain for the summer of 2023. We found out that the top 5 variants predicted by our model based on dataset-2 take up about 40% proportion of all emerged variants from May 2023 to October 2023 (Fig. 3d). Furthermore, the model successfully simulated the emergence of previously unseen Omicron subvariants in dataset-3, including XBB.1.16, XBB.2.3, GB.2, FL.2.3, and EG.5 (Fig. 3c). XBB.1.16 and EG.5 later became the dominant strain and once VOC (now VOI).2 We successfully predict them before their emergence. The ability to predict Omicron subvariants not present in the training dataset (dataset-2) and emerged in the later real-world demonstrates the capacity of the model to forecast potential vital variants. To further investigate these predicted variants, we analyzed the amino acids at each hot spot for these unknown predicted variants (Fig. 3e). Notably, the frequency of some predicted mutations, such as E180V, V252G, and K478R, increased between May 16th, 2023, and July 1st, 2023 (Fig. 3d and Supplementary Fig. 9). These results validate the capacity of our predicted model for effectively predicting emerging mutations and variants.

As the COVID-19 pandemic turns into a new stage of co-existence, the evolution trend of the virus switches from disease severity into transmissibility. Several mutants with enhanced immune escape capacity become concerned, including BA.2.86 and JN.1. After examining our model from two different datasets extracted from two different time points, we have already proved and validated the efficacy and accuracy of our model. However, a most recent update is now required to provide a hint about the post-emergency stage. We thus ran our model for the third time but now on a more recent dataset-3, including the spike data before the emergence of either JN.1 or BA.2.86. As a result, we found that even though we did not predict the complete sequence of those two variants within the 10,000 output sequences, we did predict about half of all essential mutations for them, including D339H, G446S, L452W, and F486P (Fig. 3f). Consequently, before their emergence, we successfully predicted future concerned variants, including BF.7, BQ.1, XBB.1.16, and EG.5. For those variants that incur an extraordinary amount of mutation and unseen deletion or even insertion, which makes them hard to predict, our model still extracts most of their vital mutation. By introducing the mutational profile into the model, we simulate both the regularity and the randomness of the mutations.

Predicted variants show significantly enhanced infectivity

To further validate the robustness of our model, we conducted a wet experiment, a crucial step to test its efficacy in real-world scenarios. We selected our first prediction from dataset-1 for our viral infectivity and immune evasion studies. Based on the prediction result of the first timeframe, we carefully handpicked the top 100 sequences that were most likely to emerge in the future for our further experimental evaluation and residue mutation analysis (Fig. 4a). We chose 100 sequences, each of which had been generated at least 5 times, a number we deemed sufficient for our model to identify them as potentially highly transmissible sequences. These 100 sequences, denoted with numbers from 1 to 100, were then synthesized and cloned accordingly to generate S-expressing plasmids. After the cloning and plasmid propagation in the E. coli system, 100 pseudoviruses for each 100 predicted S1 protein were constructed by plasmid transfection in HEK-293T cells (Fig. 4a). The liquid spectrometry measuring the binding affinity of ACE2 receptor protein revealed that most of the constructed pseudoviruses with low luciferase luminescence (RLU) value also corresponded with a low binding affinity (Supplementary Fig. 10). Excluding the pseudoviruses with extremely low RLU and binding affinity, 83 out of 100 pseudoviruses were deemed suitable for our infectivity study.

Fig. 4figure 4

The experimental evaluation of the model-predicted SARS-CoV-2 spike variants. a The schematic diagram of the wet-lab experiments for plasmid synthesis, pseudoviruses construction, infectivity measurement, and neutralization assay. b The percent infectivity of predicted variants compared to that of BA.5 variant, which represented by TCID50 values. The TCID50 value of BA.5 is set as 100%. The PNAb titers measured by the relative ratio of EC50 values of predicted variants/BA.5 for 12 blood samples (c) and 40 blood samples (d). The EC50 value of BA.5 sequence is set as 1, and the measurement is geometric mean

Before proceeding with the infectivity analysis, we implemented p24 quantification to precisely standardize the concentration of all pseudoviruses. This meticulous step was essential to ensure that the same amount of pseudoviruses were cocultured with ACE2 receptors for measurement. The 50% tissue culture infectious dose (TCID50) values for the infection assay were obtained and compared according to the Reed-Muench method for each pseudovirus. The results revealed 10 out of 83 pseudoviruses with sequence numbers #2, #36, #37, #41, #56, #65, #73, #76, #82, and #98 that exhibited significantly enhanced infectivity when compared with BA.5 sequences (Fig. 4b). Among them, variant #98 demonstrated the highest relative infectivity with an average level of 5-fold compared to that of BA.5. These precise results underscore the potential impact of our research in predicting variants with greater infectivity, the ones that are more likely to pose a threat in the future. The amino acid sequences of those 100 variants are included in the supplementary.

Predicted variants exhibit increased immune evasion

Besides the viral infectivity of the predicted variants, another characteristic of concern that deserves further evaluation and relates to viral fitness is the neutralization escape of the variants. In our study, we recruited 48 volunteers for blood donation, all of whom shared a similar immune background of recovered patients vaccinated with a WHO-approved Ad5-vectored SARS-CoV-2 vaccine. We then performed a serum antibody neutralization assay to examine the immune evasion for each predicted sequence by coculturing the serum, pseudoviruses, and ACE2-expressing HEK-293T cells together (Fig. 4a). The pseudoviruses that failed to be neutralized by the anti-sera and infected more 293 T cells were the ones that desired more attention. The geometric mean of the ratios of each sequence to BA.5 for EC50 titer is thus set as a baseline control for comparison. In consequence, compared to that of the BA.5 variant, 15 potential variants were discovered with significantly lower half maximal effective concentration (EC50) value, including sequences #8, #21, #22, #50, #52, #55, #60, #65, #67, #82, #88, #91, #94, #96, and #98, yet those results were illustrated by preliminary studies adopting only 12 blood samples (Fig. 4c). When including all 40 eligible human blood samples, we discovered only 6 predicted variants #50, #52, #55, #65, #67, and #82 showed a significantly increased capacity of immune evasion (Fig. 4d). To be noticed, the difference in the immune evasion is below 2-fold, yet the difference is still significant. The immune escape capacities of variants #15 and #99 were also significantly higher for the latter case, yet their differences were below 1.3-fold.

Taking into account the sequence matching, we found that among those 6 sequences with enhanced immune evasion capacity, sequences #55 and #67 matched the lineages of BQ.1 and BQ.1.12, respectively. As BQ.1 and BQ.1.12 were dominant variants of late 2022, this finding further exemplified the accuracy and reliability of our prediction model. Interestingly, BQ.1 and BQ.1.12 variants were not included in our input dataset and only emerged after the construction of our model. These results verified that our model is able to predict not only enhanced infectivity but also immune evasion, providing a strong foundation for future research and understanding of SARS-CoV-2 variants.

Predicted variants share vital mutations with circulating Omicron variants

The previous evaluations on viral infectivity and immune evasion revealed several vital mutations in S1 sequence that might be related to viral fitness. As mentioned, we retrieved 10 predicted sequences of significantly enhanced infectivity and 6 sequences of increased immune evasion capacity. To further elucidate the evolutionary relationships and potential functional implications of SARS-CoV-2 variants, we constructed a comprehensive phylogenetic tree incorporating both prevalent and predicted variants (Fig. 5a). We found that sequences # 65 and #82 were included in both groups of immune escape and infectivity, illustrating their potential to emerge in the future. By analyzing the residue mutations that occurred in those variants, we discovered several common mutations, including R346T, K444T, N460K, R452Q, R685 deletion, and N658S. More precisely, we found that R346T, R685 deletion, L452Q, and N658S sites appeared more frequently in infectivity-enhancing sequences, while K444T and N460K were detected more often in antibody evasion-enhancing sequences. The structural analysis of the BA.5 S protein illustrated that most of them had only a minimum effect on the protein structure (Supplementary Fig. 11).

Fig. 5figure 5

The residue mutation analysis of the concerned sequences outputted by the prediction model. a The multiple sequence alignment and phylogenetic tree of the 10 viral infectivity enhanced sequences and 6 immune evasion capacity increased sequences with R346T, K444T, and N460K residue mutations labelled. b Average daily prevalence of SARS-CoV-2 variants with labeling of residue mutation sites 346, 444, and 460

Among the residue mutations we detected above, some of them were consistent with previous findings, which exemplified their importance to viral evolution and thus might require greater attention. For instance, previous studies have reported R346T, K444T, and N460K as convergent mutations with enormous growth advantages and were mutated in at least five independent Omicron lineages, the major COVID-19 lineages since 2022.27,28 Besides residue spots 346, 444, and 460, residue spot 452 was also included in this list of important mutation sites,28 so the detection of R452Q also backed the reliability of our model. For our study, among 6 predicted variants with enhanced immune escape, 83.3% of them (5/6) presented K444T and N460K mutations. Similarly, for 10 predicted variants of significantly enhanced infectivity, 50% of them (5/10) adopted the R346T mutation. These findings illustrate that N460K and K444T mutations are highly correlated to the immune escape, and R346T mutation is correlated with viral infectivity. Importantly, our model, built before September 2022, was able to accurately predict future mutations of interest in the Omicron S protein (Fig. 5b), demonstrating its potential for future surveillance efforts. Interestingly, sequence #65 contained all 3 mutations of greatest interest, which deserve more attention in future surveillance of Omicron variants.

K444T is one of the major Spike mutations of interest for SARS-CoV-2 Omicron variants BQ.1 and CH1.1. Due to their tremendous impact on global health, WHO once escalated BQ.1 as one of the variants of interest (VOI) and CH1.1 as one of the variants under monitoring.2 As there are currently no SARS-CoV-2 variants meeting the VOC criteria, VOI and variants under monitoring (VUM) at their time are the ones that raise the most attention, underscoring the importance of detecting this mutation by our model. Moreover, residue mutations R346T and N460K are more important as they are not only major Spike mutations of interest for BQ.1 lineage but also for the XBB lineage, including variants XBB.1.9.1, XBB.1.15, and XBB.1.16 (Fig. 5a).

Benchmarking shows the model’s unique advantages and comparable accuracy

The wet-lab verification might not be sufficient for validating the efficacy of our work. Without elaborate benchmarks and comparisons with other models, it is hard to exemplify the advantages and drawbacks of our model. In this case, we select three decent and respective works also focusing on SARS-CoV-2 evolution and a random generator for our comparison.

TEMPO

TEMPO, a transformer-based mutation prediction framework, is a fantastic model that utilizes phylogenetic tree-based sampling to capture the temporal information and transformer to predict the mutation probability of sites.29 There are similarities and differences in features between our model and TEMPO. For a common advantage, we could both predict mutation sites, and both models take the temporal information into account to increase the model’s accuracy. TEMPO has the unique advantage of including phylogenetic information in the model, which helps it understand what occurred throughout evolution,29 while our model defines “TDF” to incorporate the temporal variable. Our model, in such cases, provides an alternative choice for TEMPO and could be used in the initial stage of a pandemic when no phylogenetic information is available.

The most essential benchmark comparison is the comparison of performance. As the available dataset for TEMPO and our training data for our first prediction is about to be in a similar timeframe, we thus decided to compare the prediction sites generated by our first prediction and TEMPO. We successfully included 10 out of 16 high-probability (p ≥ 0.5) predicted mutation sites mentioned by TEMPO. This result is comparable to TEMPO’s own prediction of 12 out of 16.29 Furthermore, our model also identified several mutation sites not included by TEMPO, demonstrating its unique capabilities.

Brian’s LSTM

Based on two major components, grammar (or syntax) and meaning (or semantics), Hie et al. constructed an unsupervised NLP model to predict the viral escape capacity for different mutations and used three different viruses to examine the model’s validity.21 As outputs, this model could give out two scores for each mutation combination (semantic and grammar). The “semantic” component refers to the amplitude (or extent) of the change of the viral sequence, while the “grammar” component refers to the viral fitness. The mutations and the combinatorial mutations with high semantic scores represent a large shift from the original viral sequence, and the high grammar score represents high fitness resulting from the mutations. Hie believes that the mutations that score high for both components are likely to be the mutations that lead to immune escape.

To compare our model with Brian’s model, we used Brian’s model to test the mutation combinations of the top 10,000 sequences predicted from dataset-1 (which we believe to be the top 10,000 variants that are likely to emerge after the time dataset-1 is collected). As a result, we found 9566/10,000 (95.66%) of the sequences have higher semantic scores, 6388 (63.88%) of the sequences have higher grammar scores, and 6,034 (60.34%) of them have both higher semantic and grammar scores when compared to the dominant variants in the timeframe of the input data (dataset-1). This result helps to validate that our model is consistent with Brian’s LSTM model, and the essential variants predicted by our model are also the concerned variants for Brian’s model.

EVEscape

EVEscape, with its unique combination of deep learning and fitness predictions, quantifies the viral escape potential of given mutations at scale.30 Its distinct advantage over other models is its independence from information from surveillance sequencing, experimental scans, or three-dimensional structures of antibodies to make predictions.30 This advantage is particularly crucial in providing an early warning time critical for vaccine development. Likewise, our model, operating as an early alarm for potentially prevailing variants, could significantly impact vaccine development by making predictions solely based on sequence data.

As EVEscape does not generate new sequences or mutation sites but predicts the immune escape potential and antibody affinity for a given mutation, we could not compare our model with EVEscape directly. Instead, we tested the top 100 sequences predicted from dataset-1 and dataset-2 using EVEscape. Our model believed those sequences to have high immune evasion and low antibody affinity. As a result, 79% of sequences predicted from dataset-1 and 81% of sequences predicted from dataset-2 surpass the dominant variants in the timeframe of that dataset in immune escape capacity (Fig. 6a and Supplementary Fig. 12). For further verification, we increase the samples to the top 10,000 sequences and the consistent result is shown. About 65% of the sequences score higher than the prevailing strain at that period. This alignment with EVEscape further validates the accuracy of our model, instilling confidence in its predictive capabilities.

Fig. 6figure 6

The comparison of the predicted results with other models. a EVEscape’s escape score of the 100 predicted sequences from our model. The higher value represents the higher immune evasion capacity. b MLAEP’s escape score of the 100 predicted sequences from our model. The lower value represents the higher immune evasion capacity. c MLAEP’s ACE2 binding score of the 100 predicted sequences from our model. The higher value represents the higher ACE2 affinity. The 100 sequences predicted by our model in (ac) are based on dataset-1, and the dominant variant BA.5 (marked red) during the era of dataset-1 is used for comparison

MLAEP

The Machine Learning-guided Antigenic Evolution Prediction (MLAEP) is a novel approach that combines “structure modeling, multi-task learning, and genetic algorithms” to predict viral immune escape and antibody affinity.31 This unique model, similar in function to EVEscape, provides a score of immune evasion and, uniquely, a score of antibody affinity. By leveraging the existing RBD region sequence information and Deep Mutation Scanning (DMS) dataset, MLAEP enhances its accuracy, albeit with the need for additional information.

Similarly, we input the top 100 predicted sequences into MLAEP to examine our model’s performance. These sequences represent the most likely variants that could emerge in the future. For the sequences predicted by dataset-1, 74% of our predicted sequence achieve a higher immune evasion score, and 71% of them achieve a higher ACE2 binding score than that of BA.2, BA.4, and BA.5, the dominant strain in the timeframe of our input dataset (Fig. 6b, c). Moreover, for the sequences predicted by dataset-2, 82% of them achieve a higher ACE2 affinity and immune escape capacity with the dominant strain XBB.1.5 of the timeframe for dataset-2. That is to say, our prediction is also consistent with MLAEP, so the sequences with warning potential predicted by us are also acknowledged by other models including Brian’s LSTM, MLAEP, and EVEscape.

Our model stands out with its unique advantage of extensive wet-lab validation, a testament to its reliability. Though both MLAEP and EVEscape included pseudovirus neutralization assays, the assays were not performed by the researchers themselves. Unlike the three mentioned models, which utilized only a few wet-lab data from other studies, we took the initiative to synthesize 100 generated sequences and construct pseudovirus assays (Supplementary Table 1). Each sequence was rigorously examined with 50 serum samples from recovered patients, ensuring the highest level of reliability in our results and conclusions.

Our model also circumvents the requirement of sophisticated, hard-to-obtain data such as phylogenetic trees, deep mutational scanning data, and protein 3D structure (Supplementary Table 2). By leveraging a “grammatical framework” for dimension reduction, our model operates efficiently with a relatively small amount of sequence data ( ~ 3 months’ data) and minimal computational power (personal desktop). A more intuitional description of the difference in computational burden between our models and the above models is that our model only needs CPU for prediction, yet all other models likely utilized GPU for prediction. This user-friendly approach empowers researchers with a fast-responding, pioneer method for early-stage pandemic warning.

Random generator

One last interesting comparison would be that against a random generator. We generated 100 sequences using the random generator and compared them with the top 100 predicted sequences from dataset-1. As we’ve seen, our 100 sequences contained several convergent mutations (R346T, K444T, and N460K) with significant growth advantages, observed in later emerging variants like BQ.1, CH1.1, XBB1.15, and XBB1.16. For our 100 predicted sequences, we found that 40 of them contained R346T, 8 contained K444T, and 6 contained N460K. In contrast, none of the 100 sequences generated by the random generator contained R346T or N460K, and only one contained K444T mutation, likely by chance (Supplementary Fig. 13). Importantly, our generated sequences included 10 out of 16 high-probability (p ≥ 0.5) predicted mutation sites that appeared in mid-2022, while the random generator’s predictions included none of them. These comparisons underscore the accuracy of our model and validate that our prediction results are not coincidental.

In conclusion, our model stands out when compared to TEMPO, EVEscape, Brian’s LSTM, and the random generator. It demonstrates several unique advantages and consistent accuracy, exemplifying the efficacy of the model.

留言 (0)

沒有登入
gif