DrugReAlign: a multisource prompt framework for drug repurposing based on large language models

DrugReAlign framework

To optimize the utilization of existing drugs, our research introduces an innovative drug repositioning framework that harnesses LLMs and incorporates multi-source prompting techniques, illustrated in Fig. 1. This framework is designed to enable LLMs to perform comprehensive analyses of target sites, facilitating self-prompting—a strategy proven effective in prior studies [23, 24] One critical challenge with LLMs is the risk of hallucinations and misinformation, particularly when prompts are derived from a single or narrow data source. This can lead to inaccurate predictions or fabricated relationships [25, 26], To address this, we employ a multi-source prompting approach that integrates diverse and reliable data inputs. In Additional File 1: Figure S1 [26, 27], we provide the prompt template and a template for the predicted results of drug repositioning executed using the proposed framework. Additionally, to validate the usability of DrugReAlign on large-scale datasets, we tested its multithreading capabilities. The results, demonstrating the rapid screening ability of multithreaded DrugReAlign for large volumes of data, can be found in Additional File 1: Table S1.

Fig. 1

DrugReAlign Framework and Detailed Flowchart. a Construction of multisource prompts for targets, (b) Screening of potential drugs interacting with targets based on LLMs and multisource prompt information. c Example of Prompt Construction for Specific Targets (d) Interactive querying and decision support using LLMs

Our methodology combines information from Protein Data Bank (PDB) structure summaries and known spatial interactions between targets and small molecules. By incorporating structural and interaction data from multiple sources, we significantly reduce the likelihood of LLMs generating misleading or false predictions. Specifically, the detailed structural data from PDB provides a factual and constrained foundation, while the interaction information from known small molecule relationships further narrows the model's focus to relevant biological interactions. This ensures that LLMs are guided by concrete and experimentally validated data, rather than relying solely on their internal language-based knowledge, which can sometimes lead to erroneous or non-existent predictions.

In module (a), we retrieved the PDB structure descriptions of targets from the RCSB [28] database, including PDB name, classification, gene ownership, etc. Additionally, we extract known spatial interaction information between targets and small molecules (such as spatial coordinates and interaction forces) from the PDB files to form example-based prompts. This multi-source prompt construction provides the LLMs with real-world, experimentally verified examples, ensuring that its outputs are grounded in factual data. Module (c) demonstrates the construction of example-based prompts for specific targets., and rationales. In module (b), the summary of PDB structures and target-small molecule spatial interactions are inputted into LLMs. This combined data ensures that the LLM is working within a well-defined, reliable context, which helps to mitigate the risk of hallucinations. The LLM analyzes these prompts and generates suggestions for drug repositioning, focusing on interactions that are biologically plausible and supported by data.

Subsequently, we extract the predicted drug structures and conduct molecular docking experiments with the respective targets to validate the proposed framework’s effectiveness in drug repositioning tasks. In module (d), using LLMs and multi-source prompt techniques, the framework generates detailed analysis reports, including rationale for drug rankings, molecular docking results, and interactive queries, enhancing the transparency and reliability of the process. This combination of multi-source prompts and experimental validation helps further mitigate hallucinations by providing data-driven outputs at every stage, ensuring that predictions remain within the bounds of known science. Thus, this framework not only provides target analysis based on LLMs but also minimizes the risks associated with hallucinations and misinformation through its reliance on multi-source prompting, improving the reliability of potential drug repositioning relationships. This approach effectively alleviates the researcher and resource-intensive challenges of traditional methods in drug repositioning tasks.

Validation of DrugReAlign on DTI datasets

To preliminarily validate the reliability of DrugReAlign and LLMs in the task of drug repositioning, we selected NR [29] and GPCR [29] as DTI benchmark datasets for validation. Detailed information about these datasets can be found in Additional File 1: Table S2. The results of individual runs from the five repeated experiments are provided in Additional File 2. Initially, we identified the PDB structures corresponding to the relevant targets and then used DrugReAlign to search for market-available drugs that might interact with these targets. The specific experimental results are presented in Table 1.

Table 1 Performance of DrugReAlign in drug recommendations within the DTI datasets

Target Coverage Rate (TCR): The proportion of targets with at least one drug interaction recorded in the dataset. Top1 Recommendation Success Rate (T1RSR): The proportion of targets where the top1 recommended drug is recorded in the dataset. Overall Interaction Rate (OIR): The proportion of all drug recommendations that are recorded in the dataset. Metrics are averaged over 5 runs, and Bold values represent the highest performance, with standard deviations included to reflect variability. NR contains 25 targets and GPCR contains 68 targets.

We first defined the TCR metric, which reflects the proportion of recommended results that include known DTI relationships. This metric is significant for the drug repositioning task because DrugReAlign requires LLMs to analyze target information to complete the recommendation task. The presence of known DTIs indicates that LLMs have, to some extent, understood the conditions for interacting with the target, thereby greatly enhancing the credibility of the recommendation results. Additionally, we defined the T1SR metric, which represents the proportion of recommendations where the top result is a known DTI among all recommended results. This metric similarly indicates the credibility of the recommendation results. OIR represents the proportion of known DTI relationships among all recommended results. An increase in this metric indicates an improvement in the credibility of DTI recommendations, while a decrease might indicate an increase in the diversity of recommendation results. Therefore, in different scenarios, we need to consider both the credibility and diversity of the recommendation results to achieve a balance between the two.

In both datasets, GPT-4 achieved the best performance in the TCR and T1SR metrics, while GPT-3.5 exhibited significant fluctuations in both datasets. This may be due to the varying proportions of different target types in GPT-3.5's training data. Additionally, NewBing showed relatively lower TCR and T1SR values in both datasets, indicating a tendency to recommend unknown DTIs. However, NewBing's OIR in the GPCR dataset was comparable to other models, suggesting that while it focused on known DTIs for certain targets, it explored unknown DTIs for others, offering a balance between discovery and reliability.

As for medllama3-v20, although its performance was lower in both TCR and T1SR metrics, the decline in OIR might suggest a greater focus on exploring novel DTI relationships, thereby enhancing diversity in its recommendations. This could provide potential for discovering new interactions, despite a lower immediate credibility in known DTI coverage. Overall, all models performed relatively well in terms of TCR, with most recommendations including known DTIs. This strongly demonstrates the robustness of LLMs in the task of drug repositioning, with varying degrees of focus on known versus novel DTI discovery.

Quantitative analysis of drug repositioning results based on AutoDock Vina

Molecular docking technology is extensively utilized to evaluate and screen candidate drugs for targets. Molecular docking software can predict their binding affinity by simulating the interaction between a drug and its target. The value of the affinity largely indicates a drug's binding capability and potential biological activity towards a specific target. The higher the affinity, the tighter the drug binds to the target, potentially indicating stronger biological effects and better therapeutic potential [30]. To assess the prediction results of LLMs, we have conducted a substantial number of molecular docking experiments using AutoDock Vina on relevant targets and drugs. In molecular docking, binding free energy lower than -5 kcal/mol generally indicates to strong molecular interactions, while values lower than -7 kcal/mol or beyond are regarded as strong interactions.

Figure 2(a) and Table 2 display the molecular docking scores for drugs and their corresponding targets as predicted by GPT-3.5, GPT-4.0, New Bing model and medllama3-v20 models. We observed that the results predicted by all three LLMs models achieved satisfactory molecular docking scores, indicating their capability to predict potential therapeutic drugs for specific targets. The predictions by GPT-4 and New Bing models scored higher in docking experiments, whereas GPT-3.5's predictions were significantly weaker, with an average binding free energy of -6.40 kcal/mol. Interestingly, despite having only 8B parameters, medllama3-v20's average docking score (-7.35 kcal/mol) was comparable to that of GPT-4, showing promising performance. This suggests medllama3-v20’s potential to deliver competitive results in molecular docking, though its consistency remains an area for further exploration.

Fig. 2

Distribution of binding free energy and semantic relevance analysis for drugs recommended by LLMs. a Distribution of binding free energy between drugs predicted by LLMs and their corresponding targets. b Correlation between the rankings of drugs predicted by LLMs and binding free energy. c Exploration of the correlation between average binding free energy and semantic similarity for drugs targeting the same site

Table 2 Comparison of binding free energy (kcal/mol) for drugs recommended by LLMs based on target information. Each model was tested using 1,278 targets to evaluate their performance

These findings suggest a certain correlation between molecular docking scores and the overall capability of LLMs, potentially offering a quantitative method to assess the performance of LLMs in drug repositioning tasks. Further, we explored the correlation between the drug rankings predicted by LLMs and their molecular docking scores, as shown in Fig. 2(b) for GPT-4 and New Bing models, the average trend lines were relatively flat, indicating no significant linear relationship between the rankings of predicted drugs and the distribution of docking scores. Notably, the GPT-3.5 model was significantly behind other LLMs in overall performance, yet the rankings of predicted drugs showed an approximate linear relationship with the distribution of docking scores. For medllama3-v20, the performance in drug rankings and docking scores was quite similar to that of GPT-4, which might be surprising given its 8B parameter size. This suggests that medllama3-v20, despite having fewer parameters, can still achieve results comparable to larger models like GPT-4 by focusing on similar aspects when parsing instructions. This highlights that different LLMs may prioritize different elements of the task, with medllama3-v20 demonstrating capabilities on par with more advanced models in abstract tasks, while other models may perform better with more direct and concrete instructions.

Correlation between interpretability and docking performance of LLMs

LLMs can provide precise answers on specific topics through concise dialogue. In this study, LLMs output predicted drug names, rankings, and reasons based on multi-source cue information of the target. In this section, we particularly focus on the reasons (or explanations) provided by LLMs for predicting drugs and evaluate the quality of explanations given by LLMs when predicting drugs for specific targets. We used OpenAI's text embedding model, text-embedding-3-small, to convert all predicted drug explanations into vectors for quantitative assessment.

Specifically, we quantified the correlation between the explanations for predicted drugs by LLMs and the molecular docking scores. We selected the drug information for each target repositioning as a unit, calculated the average cosine similarity of the explanations for predicted drugs within these units, and calculated the average binding free energy of the predicted drugs to their respective targets. Based on all units, we calculated the correlation coefficient and p-value between the average cosine similarity of drug explanations and the corresponding average molecular docking scores to quantify the strength and significance of their relationship. As shown in Fig. 2(d), for the three types of LLMs, the calculated correlation coefficients were all less than 0. In the New Bing model, the correlation coefficient reached -0.3, indicating a slight to moderate negative correlation trend. That is, there is a positive correlation between the similarity of explanations for predicted drugs and their molecular docking affinity. In other words, the higher the similarity of LLM explanations for drugs predicted for a specific target, the stronger the binding affinity of these predicted drugs with that target. Moreover, in all LLMs, the calculated p-values were significantly less than 0.05, indicating that the negative correlation between the average cosine similarity of drug explanations and the corresponding average molecular docking scores is statistically significant. For medllama3-v20, the correlation was noticeably weaker compared to other LLMs. After reviewing its explanations, we found that medllama3-v20 provided simpler and more uniform responses for different drug predictions. This lack of detail may be due to its smaller parameter size (8B), limiting its ability to capture nuanced relationships between drugs and targets. Additionally, it might have been fine-tuned on instruction-based datasets, which could focus more on following general patterns rather than providing diverse and detailed explanations. As a result, both its simplicity and potential focus on instructions likely contributed to the weaker correlation observed.

Propensity of predicted drugs

In this section, we evaluated the tendencies of the three types of LLMs in drug repositioning tasks within the proposed framework. Figure 3(a) shows the intersection size of the drug sets predicted by the three models. The New Bing model predicted a total of 2,266 drugs, surpassing other LLMs and indicating its broader coverage in the drug database. It is also observed that 1,003 drugs were exclusively predicted by the New Bing model, suggesting its ability to provide a greater number of alternative drugs not covered by other models. The GPT-3.5 model predicted only 1,424 drugs, significantly fewer than the other LLMs, with just 499 drugs being unique predictions. We speculate that this is due to differences in the volume of training data sources among the LLMs. Therefore, when utilizing LLMs for drug repositioning tasks, it is crucial to consider the breadth and update frequency of their training data. This can offer more potential drugs for specific research targets, helping to expand existing treatment options and develop new ones.

Fig. 3

Relationships among recommended drug sets and distribution of Lipinski's rule properties of LLMs. a Relationships among drug recommendation sets of different LLMs under the complete dataset. b Relationships among drug recommendation sets of different LLMs under subsets of data. c Relationships among drug recommendation sets of different LLMs under subsets of data with spatial interaction information removed. d Distribution of Lipinski's rule properties of drugs recommended by different LLMs under the complete datasets. The bars represent the number of drugs in each intersection, with their height indicating the intersection size. The dots below show which drug sets are being intersected, and connected dots represent drugs shared across multiple sets. The bars on the left display the total drug set size for each model

We utilized the Lipinski's Rule of Five [31] (molecular weight (MW), lipophilicity (LogP), number of hydrogen bond donors (HBD), number of hydrogen bond acceptors (HBA), and number of rotatable bonds (RotB)) to analyze and evaluate predicted small molecule drugs with a molecular weight less than 500. As observed in Fig. 3(d), compared to the GPT-4 and New Bing models, the GPT-3.5 model exhibited a clear preference for drugs with smaller molecular weights. Furthermore, we also analyzed and evaluated large molecule drugs predicted by the proposed framework with molecular weights greater than 500. In the study, the New Bing and GPT-4 models predicted 385 and 354 large molecule drugs, respectively, while GPT-3.5 predicted only 256. This further emphasizes the GPT-3.5 model's preference for small molecule drugs.

The preference of the GPT-3.5 model also extends to the chemical properties of the selected drugs. In Fig. 3(d), the GPT-3.5 model more frequently outputs small molecule drugs with relatively lower LogP and RotB values, and the number of HBDs is usually slightly higher. A higher number of HBDs enhances the molecule's hydrophilicity, at this time with lower lipophilicity and higher rigidity, consistent with experimental outcomes. These findings reflect the GPT-3.5 model's tendency to select drugs with smaller molecular weights, stronger hydrophilicity, lower lipophilicity, and higher rigidity. Enhanced hydrophilicity aids in the solubility and absorption of the drug, while the rigidity of the molecular structure helps in precise binding with the target, which are key factors to consider in drug development. In Additional File 1: Figure S2 [32], we provide the ADMET property analysis of LLMs-recommended drugs executed using the proposed framework, which enables a more comprehensive analysis of the differences among these drugs.

Ablation experiment

We conducted ablation studies to investigate whether spatial interaction information between drugs and targets can serve as an effective prompt to enhance the drug repurposing performance of LLMs, as shown in Table 3. Here, "w/o SF" indicates the exclusion of spatial interaction data between drugs and targets as prompt information and the removal of corresponding analysis requests. We randomly selected 218 targets and reapplied three types of LLMs for drug repurposing, using AutoDock Vina to calculate the binding free energy.

Table 3 Comparison of binding free energy (kcal/mol) between LLMs before and after the removal of drug-target spatial interaction information and traditional deep learning models

An intriguing observation is that both GPT-4 and New Bing models show a notable decline in molecular docking scores when information on drug-target spatial interactions is removed. In contrast, the docking scores for the GPT-3.5 model significantly improve. With the removal of drug-target spatial interaction cues, the molecular docking scores of all three models align at the same baseline level. These results clearly indicate that GPT-3.5's ability to comprehend drug-target spatial interaction information is limited. Information not understood and perceived as noise has a noticeable negative impact on the outcomes of drug repositioning [33]. However, for simple target summary information, the comprehension abilities of the models are largely similar, showing no significant differences. These findings highlight the importance of tailoring prompts to leverage the strengths of different LLMs for achieving optimal prediction outcomes. For medllama3-v20, the removal of spatial interaction information led to a relatively smaller decline in docking scores compared to GPT-4 and New Bing. Its performance without spatial cues (-7.29 kcal/mol) remained close to its original score, indicating that medllama3-v20 might rely less on spatial interaction data than other models. This could be attributed to its design or fine-tuning approach, potentially focusing more on general structural or sequence-based information. As a result, medllama3-v20 maintained more stable predictions, suggesting it may be more resilient in scenarios where spatial interaction data is unavailable or incomplete.

Figure 3(b) displays the distribution of predicted drug quantities by the three LLMs when retaining drug-target spatial interaction information. Figure 3(c) shows the distribution of predicted drug quantities by the three models after removing this information. When spatial cues are omitted, both the total quantity of drugs predicted by GPT-3.5 and the number of unique predictions significantly increase, further supporting that drug-target spatial interaction information may be perceived as noise within the GPT-3.5 model. In contrast, the total number of drugs predicted by the GPT-4 model remains relatively unchanged, but there is a noticeable decrease in the number of unique drug predictions, suggesting that the novelty of drug predictions by the GPT-4 model could be compromised without spatial cues. For the New Bing model, both the total and unique numbers of predicted drugs significantly increase after spatial cues are removed. However, according to Table 3, the average molecular docking scores for drugs predicted by the New Bing model and their corresponding targets significantly decrease. This may be because drug-target spatial interaction information acts as an effective constraint, enhancing the predictive performance of the New Bing model. With the removal of this spatial interaction information, constraints are relaxed, potentially leading the New Bing model to predict a greater number of less effective drugs.

Furthermore, to compare the performance of traditional deep learning models with LLMs in drug repurposing tasks, we selected two state-of-the-art (SOTA) models, DrugBan [34] and TransformerCPI2.0 [35], to conduct experiments on the previously selected 218 targets. Specifically, we used 14,642 compounds from the BindingDB dataset organized by Bai et al. as experimental content. We predicted the interactions between the targets and all drugs, selecting the top 5 drugs with the highest scores as the drug repurposing results for each target, followed by molecular docking and corresponding analysis, with the specific binding free energy results as shown in Table 3.

The experimental data indicate that traditional deep learning models significantly underperform LLMs in drug repurposing tasks, with the two SOTA models achieving an average binding free energy of only about -6 kcal/mol. Additionally, deep learning models also significantly lack novelty in drug repurposing; among the 14,642 candidate compounds, only 285 compounds are present in the DrugBan model's predictions, and 561 in TransformerCPI2.0's predictions, while a single LLM's recommendations can reach more than 700 compounds. We believe this is primarily due to the significant difference between the target domain of drug repurposing and the source domain of the training data, with relevant targets not being present in the model's training data, thus leading to poor model performance. The cross-domain issue is a common shortcoming of single-task models designed for specific tasks, mainly limited by the training data and the model's parameters. In contrast, LLMs effectively overcome these deficiencies. The vast amount of training data and model parameters enable LLMs to exhibit good generalization capabilities across various tasks.

Evaluating LLM-Driven drug repositioning with deep learning models

Deep learning models are widely used in fields such as drug repurposing and drug-target interactions, but they are constrained by the available training data and generally lack strong generalization capabilities for unseen data. However, with LLMs extensive training data and large number of parameters, hold promise for overcoming these limitations. To clarify the constraints of deep learning.

models on related issues, we utilized TransformerCPI2.0 and DrugBan to predict DTIs based on drug repurposing results predicted by LLMs, as shown in Fig. 4. Most data points in the graphs deviate from the diagonal line, indicating significant differences in the predictions of the two models, and most data points were deemed non-interacting, which contradicts the results of large-scale molecular docking studies. These experimental results suggest that the drug recommendations predicted by LLMs are novel and difficult to predict with traditional deep learning methods. Thus, LLMs can indeed significantly break through the limitations of traditional models in terms of training data and parameters, and they hold great significance for the field of drug discovery.

Fig. 4

Analysis of deep learning predictions for LLM-recommended drugs and dimensionality reduction visualization of pre-trained molecular representation models. a-c Drug-target interaction predictions for drugs recommended by GPT-4, GPT-3.5, and NewBing, respectively. d Visualization of dimensionality reduction using T-SNE on drug recommendations from LLMs

Besides, to further explore the reliability of LLMs in drug repurposing recommendations, we filtered experimental protein targets based on a sequence similarity greater than 90% (PDB IDs: 1EQZ: Entity 4 [36], 3AFA: Entity 1 [37], 1U35: Entity 2 [38], 3LEL: Entity 1 [39]) for our experiments. We collected drug repurposing results for these four protein targets from different LLMs (GPT4, GPT3.5, NewBing). Then, using MolFormer [27], we embedded the drugs obtained and performed dimensionality reduction with the T-SNE algorithm.

As shown in Fig. 4(d), the majority of the drugs recommended by the LLMs cluster in a small feature space, indicating that the drugs recommended by LLMs for protein targets with similar sequences exhibit high similarity in their feature space. This suggests that these drugs likely share similar pharmacological effects and chemical structures. This experimental result also supports the credibility of drug recommendations made by LLMs.

View original article

BMC BIOLOGY

分享书签

0 0 0 0 0 0 0

More from this channel

DrugReAlign: a multisource prompt framework for drug repurposing based on large language models

留言 (0)