Computational peptide discovery with a genetic programming approach

Data

In order to verify that the dataset contains unique data and that certain sequences are not over-represented, we performed a pairwise sequence similarity calculation on the entire dataset (Fig. 5). The results were averaged and bins of ten percent, showing that most sequences (\(\sim \)80%) share less than 10% identity, demonstrating that the dataset used is heterogeneous. Only a very small portion of the data (1.22%) have more than 50% identity. Moreover, there are no completely identical sequences.

Fig. 5

Average pairwise sequence identity in the dataset in percent, with [i–j] indicating values from i (included) to j (excluded)

We then conducted a more detailed analysis of the dataset. Initially, we examined the frequency of occurrence of each amino acid (AA) in both the training and test sets, as illustrated in Fig. 6a. Our observations indicate that lysine (K), threonine (T), arginine (R), and serine (S) are among the most commonly occurring AAs in both sets. These AAs are polar and possess either hydroxyl, amine, or guanidine (3 amines) groups. In addition, K and R are positively charged, enabling them to accept protons and be soluble in water. Tyrosine (Y) and phenylalanine (F) are the least frequent AAs in the dataset. These AAs are relatively uncommon in natural proteins, accounting for only 2.92% and 3.86%, respectively. Their hydrophobic and aromatic nature may explain their low occurrence in the dataset.

Fig. 6

a Frequency of occurrence of each AA in both training (blue) and test (orange) sets. Molecules are illustrated for the four most prevalent AAs in the training set, and hydroxyl or amine groups are highlighted. b Comparison of the frequency of each AA in our dataset (yellow) and in the UniProtKB/Swiss-Prot database (green). The different values represent the percentage of occurrence. c Potential CEST value associated with each AA by occurrence method. The green box represents positively charged AAs, and the red box represents negatively charged AAs. d) Frequency of the 20 most observed motifs (size 2 to 6) in the training set with the associated CEST value

Upon comparing the frequency of AA occurrence in our dataset with UniProtKB/Swiss-Prot (release 2023_01) (Fig. 6b), we noted an over-representation of K, R, T, and tryptophan (W), which is consistent with our earlier results. Interestingly, while W is infrequently present in UniProtKB/Swiss-Prot proteins (at a frequency of 1.1%), it is present in our dataset at a frequency exceeding 5%, indicating that it could play a significant role. Previous studies have demonstrated that the indole ring NH protons of W contribute to CEST contrast at approximately 5.5 ppm [80]. However, the CEST values in our dataset were measured at 3.6 ppm, suggesting that the amide group in the backbone, which resonates at this frequency, may be responsible for generating a signal at 3.6 ppm. The AAs that are underrepresented in our dataset are alanine (A), phenylalanine (F), isoleucine (I), and leucine (L), which are non-polar and hydrophobic, lacking amine or hydroxyl groups in their side chains, as well as glutamic acid (E) and aspartic acid (D), which are negatively charged. Because a peptide with high CEST contrast is required to be soluble in water, it is not surprising to find fewer hydrophobic AAs in the dataset.

Next, we conducted an analysis of the impact each AA may have during the evolutionary process (Fig. 6c). Using the ‘occurrence’ method described in the Materials and Methods section, we calculated the potential CEST value associated with each AA. Our results indicate that AAs with the highest associated CEST values are K, R, S, Q, I, and W, while T, F, Y, and the two negatively charged AAs, E and D, have relatively low CEST values. However, it should be noted that these values may vary depending on the context in which the AA is present, as CEST values are measured on a global peptide sequence. For instance, while W has a potential CEST value of approximately 20, the ’KWR’ motif has a CEST value of 17.27, and the peptides containing this motif have CEST values of 18.46 and 16.08. This initial analysis has allowed us to identify two groups of AAs. Specifically, we have observed that six AAs have a CEST value >15, which could potentially guide the evolutionary process towards the production of REs with significant weight. Conversely, the other AAs have a CEST value < 10.

Subsequently, we conducted a similar analysis on the 20 most prevalent motifs (ranging in size from 2 to 6) in the training set, as depicted in Fig. 6d. Since the focus of this study is on predicting peptides as short as 12 AAs, it is important to consider motifs that consist of only 2 AAs. As anticipated, motifs of size 2 and 3 dominate in the MDB. Notably, the most frequently occurring motifs consist of K or T. Although present, the divide between motifs with a CEST value greater than 10 and those below 10 is less noticeable. Many motifs with a high CEST contain K, R, and S, whereas motifs with low CEST values comprise T, E, and D. These findings are consistent with our earlier analyses and provide valuable insights for scrutinizing the performance of the evolutionary algorithm.

Assigning random weights to REs in POET\(_\)

To confirm the effectiveness of the training step (i.e., weight adjustment) during the evolutionary process, we conducted two independent experiments, each comprising 50 replicates, using identical parameters to those in Table 2. In the first experiment, POET\(_\) was employed with weight adjustments (training step is active), whereas in the second, control experiment (called POET\(_\)), weights are randomly defined during the initialization step and randomly changed with a probability of p=0.1 for each rule (training step is inactive). After selecting a rule for change, the mechanism of POET\(_\) replaces the rule’s weight with a new random value, uniformly sampled from the interval -10 and 10. The remaining parts of the two algorithms operate similarly. Choosing random weights in POET\(_\), as opposed to incorporating a training step, favors random exploration of weights over attempting to directly converge towards optimal weights. If random changes in weights result in an individual achieving higher fitness, there is a chance that tournament selection will choose this individual to contribute part of its genetic material to subsequent generations. The results of these experiments obtained on the test set are presented in Fig. 7a.

Fig. 7

a Comparison of POET\(_\) (blue) and POET\(_\) (purple) models on the test set. b Performance of the best POET\(_\) model on the training set (orange) and the test set (green). The translucent bands around the regression line represent the confidence interval for the regression estimate

As expected, the results of the experiments with random weights (POET\(_\)) are lower than the results of the experiments with the training step (POET\(_\)). A paired t-test was performed and confirmed that the difference is statistically significant (p-value=1.07e\(^\)). Indeed, the average fitness value obtained on the test set with POET\(_\) is 0.359 compared to POET\(_\) which is 0.443. The results of POET\(_\) are about 23% (+0.084) higher than the experiments with POET\(_\). Among the POET\(_\) models, the best model (Fig. 7b) has a fitness value of 0.58 (with p-value=5.04e\(^\)) on the test set. This fitness value is 0.13 (\(\sim \)22%) lower than the best model achieved using POET\(_\). These results confirm the importance and efficiency of the training step during the execution of the algorithm.

Best POET\(_\), model obtained after the evolutionary process

Out of all the previous experiments, the best POET\(_\) model (Additional file 2) exhibited interesting results with a strong correlation of 0.88 (p-value=1.2e\(^\)) on the training set and 0.71 (p-value=7.7e\(^\)) on the test set (Fig. 8a). A correlation exceeding 0.5 indicates a highly positive correlation between the predicted values of the model and the actual wet lab measurements. Furthermore, a p-value below 0.05 indicates that the results are statistically significant. As shown in Fig. 8b, the fitness values of the best individual and for the entire population continue to improve until around 100 generations and then tend to stabilize. This means that the algorithm converges to a good solution. It is interesting to note that this model comprises 29 rules, consisting of a combination of REs (80%) and contiguous motifs (20%). For instance, the ’KL’ motif is one of the contiguous motifs with a weight of 3.397. Finally, these results confirm that our GP algorithm is capable of evolving protein-function models adapted to the CEST problem. Consequently, the algorithm is effective in identifying motifs that can enhance the CEST signal.

Fig. 8

a Performance of the best POET\(_\) model on the training set (orange) and on the test set (green). The strong correlation indicates that the algorithm has converged to a good solution. The translucent bands around the regression line represent the confidence interval for the regression estimate. b Evolution of the fitness value during the evolutionary process. The green curve represents the fitness value of the best individual, and the orange curve represents the fitness value of the entire population

Comparison between POET\(_\) and initial POET

In order to evaluate the efficiency of adding REs to build protein-function models, we conducted 100 experiments using the initial version of POET as a baseline for comparison. The initial version of POET has previously demonstrated effectiveness in predicting high CEST contrast peptides. In the initial algorithm, models consist of collections of evolved rules comprising sequences of peptide or AA patterns and a numerical weight indicating their importance in producing high contrasts. While these models are nonlinear, they employ a linear method to represent the discovered patterns in each rule. Our hypothesis is that REs can enhance motif discovery in POET\(_\) and, in turn, increase the efficacy of the evolved models. For a fair comparison of the 2 programs, the same training set was used to train the POET and POET\(_\) models, and the same test set was also used to evaluate them. The default parameters utilized in [53] were employed throughout the experiments.

On average, POET exhibits a correlation of 0.292 and a p-value of 0.205. Some models drastically reduce the average because the evolutionary process was unable to find a good solution, or the algorithm converged too fast and got stuck in a local maximum. Therefore, we focus only on the 9 best models to take advantage of the best results. The average correlation of the top 9 POET models is 0.504 (average p-value of 4.68e\(^\)), which is very close to the performance obtained by POET\(_\). Fig. 9 displays the results of the top 9 POET models. Model 1 obtains the best performance with a correlation coefficient of 0.59 and a p-value of 4.4e\(^\), meaning the result is statistically significant. These results demonstrate the potential of the initial version of POET. However, the best POET\(_\) model performed better than POET and indicates that REs add flexibility that POET does not have and improves the learning and prediction potential. The power and accuracy of the REs allowed the best POET\(_\) model (among all replicates) to perform better with an increase in performance of 20% (+0.12).

Fig. 9

The 9 best POET models. Each dot represents a datapoint with a true CEST value associated with a predicted CEST value. The green line represents the regression line and the translucent bands around the regression line represent the confidence interval for the regression estimate

Peptide predictions with the best evolved model

After evolving the models and identifying the best one, we utilized the best model to predict peptides that could potentially outperform the gold-standard K12 peptides by exhibiting high CEST values. We employed a computational DE process in which the best POET\(_\) model (Additional file 2) and the standard encoding with 20 AAs were used to predict new peptides. In this context, higher prediction scores correlate with higher CEST values. We conducted 3 experiments with varying numbers of cycles (1000, 100 and 10 cycles) during an in silico DE process. This approach replaces the DE screening step by selecting peptides with a potentially high CEST value using the best POET\(_\) model and drastically reduces experimental time and costs. The results for peptides with the highest predicted score (top 1) and peptides with both a high predicted score and high hydrophilicity (best) for each experiment can be found in Table 3, while all predictions are available in Additional file 1: Table S3. It is important to highlight that in the DE process applied for peptide prediction, the higher the number of cycles, the more the peptides generated will be similar and converge towards an identical solution. Conversely, a limited number of cycles results in less accurate predictions, but it allows for broader exploration and the generation of original peptides. Thus, determining the optimal number of cycles is a key point in the employed DE.

Table 3 Predicted peptides with highest predicted score (Top 1) and best predicted peptides with highest hydrophilicity and high score (Best), with 1000, 100 and 10 cycles during DE

Next, we analyze the AA composition of the predicted peptides. The results are illustrated in Fig. 10a. As expected, peptides generated after 1000 cycles exhibit a homogeneous AA composition achieving high predicted scores (>90). In contrast, peptides generated after 100 and 10 cycles display a more heterogeneous AA composition with lower scores (approximately 70-80 for 100 cycles and 40-50 for 10 cycles). The sequence logos in Fig. 10b generated with the WebLogo 3 tool [81], highlight the probability of each AA at a given position. With an increasing number of cycles, the presence of Q, L, S, and K becomes more prominent, confirming the tendency to converge towards similar peptides with a homogeneous AA composition.

Fig. 10

a Number of AAs present in the predicted peptides in the 3 types of DE experiments: 1000 (blue), 100 (orange) and 10 (green) cycles. b Sequence logos highlighting the probability of each AA at a given position, for the 3 experiments. As the number of cycles increases, the predicted peptides are more similar with high rates of lysine and leucine. The polar AAs are in green, the neutral in purple, the positively charged in blue, the negatively charged in red and the hydrophobic in black

Also, we observed a significant presence of isoleucine in predicted peptides in experiments involving 100 and 10 cycles (Additional file 1: Table S4). The abundance of lysine, glutamine, and serine in the predicted peptides is consistent with our initial analysis of the dataset. Lysine, a positively charged AA, plays a crucial role in detecting CEST signals. Glutamine and serine, non-charged polar AAs with amide and hydroxyl groups, respectively, facilitate proton exchange with water molecules. Hence, we expected to find these AAs in the predicted peptides. Conversely, we anticipated a high presence of arginine and tryptophan, given their abundance in the dataset. However, the peptides predicted for 10, 100, and 1000 cycles only contained 1.6%, 3.3%, and 0% arginine, respectively, and 4.5%, 2.5%, and 1.6% tryptophan. Interestingly, we observed a significant occurrence of leucine in the predicted peptides, with percentages of 5.83% for 10 cycles, 15.42% for 100 cycles, and 32.92% for 1000 cycles. This is notable because leucine is not very abundant in the dataset. Leucine, a hydrophobic AA, contradicts the preference for hydrophilic and soluble peptides in CEST experiments. However, leucine plays a key role in protein structure folding and has a strong tendency to form alpha helices while maintaining their stability. Consequently, we used the ColabFold tool [82] based on the AlphaFold2 model [34] to perform 3D structure predictions of the leucine-rich predicted peptides. The results presented in Additional file 1: Figure S1 demonstrate that the predicted patterns tend to form alpha helices. Thus, the model can identify leucine-rich motifs that play a significant role in the formation of specific secondary structures, such as the alpha helix. In this manner, the GP algorithm has produced original results. Despite our initial expectation of observing a substantial number of arginine, threonine, and tryptophan, it found and favored glutamine, leucine, and isoleucine. This suggests that the algorithm was capable of discovering motifs that contribute to the function and/or structure of the predicted peptides.

We identified the main motifs present in the predicted peptides for the three types of experiments. As anticipated, these motifs primarily consisted of the residues K, L, Q, S, and I. In the peptides predicted after 1000 cycles, the main motifs involve lysine and leucine, such as LK (45), KL (38), LLK (28), or LKLL (17). However, there are also motifs that incorporated other AAs, such as LQS (10) or SLK (16). In experiments involving fewer than 100 and 10 cycles, motifs such as QS, GS, SI, SL, and SLK, LKS, IKK, LQS, QSL were observed. These results confirm the ability of our algorithm to extract valuable information from the data and leverage it to generate peptides with potentially significant CEST values.

Experimental validation of predicted peptides

The best protein-function model evolved by POET\(_\) was used to generate novel peptides that have the potential to enhance CEST contrast. In order to validate the reliability of our approach, we selected the top 3 predicted peptides with higher hydrophilicity and high score from each DE experiment (10, 100, and 1000 cycles) and evaluated their performance in the wet lab.

The 9 peptides were synthesized, and the magnetization transfer ratio asymmetry (MTR\(_\)), a measure of CEST contrast, was obtained using NMR spectroscopy. The MTR\(_\) was normalized to the molar concentration of the peptide (Additional file 3: Table S1) and plotted as a function of the saturation frequency offset (Fig. 11). Since the POET\(_\) was trained from the MTR\(_\) contrast at 3.6 ppm, the MRI results are presented in Table 4 for MTR\(_\) at 3.6 ppm. Data are normalized relative to the gold standard K12 peptide.

Fig. 11

MTR\(_\) plot of nine peptides and the gold standard peptide (K12) measured by NMR

It is interesting to note that the results obtained from both the 1000-cycle and 10-cycle experiments do not demonstrate convincing results, showing an average MTR\(_\) of 6.47 (1000 cycles) and 7.67 (10 cycles). This outcome is likely due to either too many or too few cycles, leading to the generation of either too a too homogenous or a too diverse set of peptides. For instance, in the 1000-cycle experiments, 66% of the peptides consisted of QSLK or KLKK motifs, while no dominant motif was identified in the peptides from the 10-cycle experiments. These results highlight the limitations of our approach and allow us to explore relevant search spaces that are neither too constricted nor overly expansive, striking a balance between generating homogenous and overly diverse peptides. Conversely, among the 3 predicted peptides in the experiment with 100 cycles, peptide QDGSKKSLKSCK (QDGSK brown line in Fig. 11) generated MTR\(_\) 58% (17.59 MTR\(_\)) larger than the gold standard peptide K12 at 3.6ppm (10.51 MTR\(_\)). This prediction not only demonstrates superior CEST sensitivity, but also has a high predicted score and the highest hydrophilicity among the peptides considered (Table 3).

An interesting observation is that this peptide contains only 25% K residues, which is important for increasing the diversity of the AA composition of genetically encoded reporters [28]. QDGSKKSLKSCK is also unique compared to other peptides since it has a distinct peak at 3.6 ppm, resulting from the amide exchangeable protons, with little or no contribution from amine or guanidine exchangeable protons resonating between 1.8 and 2.0 ppm.

Table 4 Experimental results obtained in wet lab of the peptides predicted by POET\(_. \)

These findings confirm that after training/evolving and employing the best POET\(_\) model, the search space was successfully narrowed down, allowing us to highlight a candidate peptide that exhibits a performance exceeding 58% in comparison to the gold standard peptide K12. POET\(_\) has proven its ability to extract motifs with compelling properties, facilitating the generation of peptides tailored to address specific problems.

View original article

JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN

分享书签

0 0 0 0 0 0 0

More from this channel

Computational peptide discovery with a genetic programming approach

留言 (0)