Expanding potential targets of herbal chemicals by node2vec based on herb–drug interactions

Data collectionWestern drugs collection

To achieve reliable drug information related to CVD, we referred to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [29]. The study retrieved drugs from the Drugbank database [30], and the Drugcentral database [31] served as a supplement. In detail, we filtered the database by ATC code with all subparts within the “C CARDIOVASCULAR SYSTEM” and two subparts, “B01 ANTITHROMBOTIC AGENTS” and “B02 ANTIHEMORRHAGICS,” within the “B BLOOD AND BLOOD FORMING ORGANS” to acquire CVD-related drugs. For these sampled drugs, we identified four aspects of drug information, including “Approval status,” “Known action,” “Organism,” and “CAS/SMILES,” to ensure the chosen drugs are still used in the market for humans with specific structures. One aspect of the drugs’ target information, “Uniprot ID of drug targets,” was used for target standardization.

Herbal chemicals collection

The chemical information of DS and CX was collected from the literature and three chemical databases: traditional Chinese Medicine System Pharmacology Database (TCMSP; http://tcmspw.com/tcmsp.php) [32]; Traditional Chinese Medicines Integrated Database (TCMID; http://www.megabionet.org/tcmid/) [33]; Shanghai Institute of Organic Chemistry of CAS Chemistry Database (http://www.organchem.csdb.cn) and The Encyclopedia of Traditional Chinese Medicine 2.0 (ETCM; http://www.tcmip.cn/ETCM2/front/#) [34, 35]. PubChem (http://pubchem.ncbi.nlm.nih.gov) [36] was used to standardize chemicals and supplement relevant chemical data, such as PubChem CID and SMILES information, and essential amino acids, monosaccharides, and disaccharides were excluded. To ensure the reliability of herbal chemical targets, we adopted “bioassay results” from PubChem, which showed detailed and credible activity information; only the results labeled as “active” in the activity column were chosen, and all targets were standardized by Uniprot (https://www.uniprot.org/) [37].

Data processing and dataset preparationHerb–drug interactions identification

Three kinds of interactions were involved, including the chemical-target connection (CTC), the similarity of chemicals (chemical–chemical connection, CCC), and the interaction of targets (protein–protein interactions, PPI). Firstly, the direct CTC was acquired during data collection after a strict screening process. Secondly, CCC was constructed by structural similarity analysis. The ChemmineR [38] toolkit running in R studio was used to perform a fingerprint-based chemical similarity search with a Tanimoto coefficient  ≥ 0.6 [39]. If the structural similarity of chemicals was 0.6 and above, two chemicals were connected. Thirdly, a PPI network was constructed to acquire PPI interactions by STRING [40], which can evaluate the tightness of proteins by providing a scoring system with a score range from 0 to 1. Only protein interaction scores of 0.9 and above were connected to ensure close target relationships.

Construction of groups, datasets and networks

The data on herbs and approved drugs formed different data groups. DS-drug and CX-drug groups were prepared directly. Meanwhile, because of their synergistic therapeutic effects clinically, DS and CX are regarded as the herbal pair to perform their function together. Therefore, the DS-CX-drug group was formed. These three groups were analyzed uniformly.

Among the three kinds of interactions, CTC is an indispensable part. Theoretically, structural similarity analysis and target interaction information will provide extra information and improve the accuracy of predictions, but comparisons are still needed. Therefore, due to the three types of interactions between chemicals and targets, three types of datasets were constructed to compare in every group. The first type of dataset only contained direct chemical and target interactions (CTC included). Considering the rule that structurally similar molecules have similar biological activities [41], information on chemical structure similarity was added for the second time (including CTC and CCC), and the third added correlations among proteins based on the second group for more information supplied (CTC, CCC, and PPI included). Finally, there are a total of nine datasets, three for each group.

Link prediction

Identifying potential targets of TM complex systems is a key problem. Prof. Shao Li proposed the concept of network targets [42] to provide a theoretical basis for the solution of this problem. Li's team published successively for the mechanism of action of TCM prescriptions [43, 44], biomolecular markers of TCM evidence [45], etc. It is through networks that GE identifies potential targets in target prediction, and the concept of network targets provides theoretical support.

In this study we transform the problem of target identification into link prediction, which is a method to predict the existence of a connection between two nodes. We selected one representation algorithm with node2vec of graph embedding (GE) and four traditional algorithms (Adamic-Adar, Jaccard similarity coefficient, preferential attachment, and spectral clustering) to evaluate how they perform on a chemical–target prediction task. We validated the results by checking the AP and the AUROC scores with tenfold cross validation. Each dataset is separated into a training set, a validation set, and a test set with a ratio of 6:3:1. A diagram of methods to explore potential targets were shown as Fig. 2 for better understanding. This diagram showed the dataset with CTC, CCC, PPI as an example to elucidate the process.

Fig. 2figure 2

Diagram of methods of applying algorithms on link prediction. In processes A, B and D, chemicals of TM were colored in orange and approved drugs in blue. Circle nodes labeled with “C” meant chemicals and hexagons labeled with “T” meant targets. Black lines represented the connection between chemicals and targets from known knowledge. A meant the dataset from the curated database. In B, two individual networks were connected by integrating the links between chemicals and targets to form the CCT and PPI, which were labeled as green lines. C shown here mainly reflects the operating principle of node2vec. In D, new predicted interactions were labeled as red solid and dashed edges. In this research, we paid attention to the interaction between chemicals of TM and targets of approved drugs, which were labeled as red solid lines

AlgorithmsNode2vec

The node2vec algorithm, introduced by Aditya Grover and Jure Leskovec in 2016 [25], simply means transferring the data description of a node into a vector. Developed from DeepWalk [24], node2vec samples node information by random walk with bias. The basic idea of the algorithm is to form a low-dimensional vector space by extracting features from a graph by both a breadth-first search and a depth-first search. Node2vec applies two parameters to implement the strategy of random walk. Return Parameter \(}\) controls the probability of the walk visiting a visited node, and a high value of \(}\) tends to visit a node never before reached. In–Out Parameter \(}\) controls the search visiting the base node inward or outward. After the data transformation step, a logistic regression algorithm is applied to the final classification task based on the vector-type data of a graph.

Adamic-Adar (AA)

The Adamic-Adar algorithm, a frequency weighted common neighbors algorithm, was introduced by Eytan Adar and Lada Adamic in 2003 [46]. The logarithmic function helps to create a weight to a shared neighbor between two nodes. This algorithm simply means that two nodes with more shared or common neighbors have more possibilities of linking.

It is defined as:

$$}_}\left( },\user2} \right)}} = \user2\mathop \sum \limits_} \in \left( }\left( } \right) \cap }\left( } \right)} \right)}} \frac}\left( } \right)} \right|}}$$

where \(}\left(}\right)\) is the set of neighbors connected to \(}\).

Jaccard similarity coefficient (JS)

The Jaccard similarity coefficient algorithm was first introduced by Paul Jaccard and reformulated by Tanimoto TT [47]. This algorithm is commonly used to calculate the diversity or the similarity between two nodes.

The index is defined as:

$$}_}\left( },\user2} \right)}} = \user2\frac}\left( } \right) \cap }\left( } \right)} \right|}}}\left( } \right) \cup }\left( } \right)} \right|}}$$

where \(}\left(}\right)\) is the set of neighbors connected to \(}\).

Preferential attachment (PA)

The Preferential attachment algorithm was introduced in 1925 by Udny Yule and popularly applied in the Barabási–Albert model by Albert-László Barabási and Réka Albert. This algorithm considers that a node with more connected neighbors is more likely to have a new link.

It is defined as:

$$}_}\left( },\user2} \right)}} = \user2\left| }\left( } \right)} \right| \times \left| }\left( } \right)} \right|$$

where \(}\left(}\right)\) is the set of neighbors connected to \(}\).

Spectral clustering (SC)

Spectral clustering, based on a normalized Laplacian matrix, belongs to the clustering algorithms family. It performs best when the original data is highly non-convex. Given an \(}\times }\) adjacency matrix \(}\) of the graph with \(}\) nodes, a Laplacian matrix can be defined as:

where \(}\) is the \(}\times }\) diagonal matrix of \(}\).

After the data transformation step, Euclidean distance or k-nearest neighbors (KNN) algorithm will be applied on the Laplacian matrix with features from eigenvectors.

EvaluationAverage precision (AP) score

The AP score is one of a most popular and useful indicators on the prediction performance of a classification model. The score computes the Precision value \(}\) while the Recall value \(}\), a threshold for the metrics, increases from 0 to 1. The Precision value and the Recall value are defined as:

$$} = \frac}} + \user2}}$$

$$} = \user2\frac}} + \user2}}$$

Once the Precision value and Recall value are calculated, the AP score can be computed by the equation given below:

$$} = \mathop \sum \limits_}} \left( }_}} - }_} - 1}} } \right)}_}}$$

where \(}}_}}\) and \(}}_}}\) is the Recall value and the Precision value at the nth threshold.

Area under the receiver operating characteristic (AUROC) score

The AUROC score describes the expectation that a uniformly drawn random positive is ranked before a uniformly drawn random negative. It indicates precisely and comprehensively even if the dataset is imbalanced. The value varies from 0.5 to 1, as does the performance of the classification model from bad to good.

Before calculating the AUROC score, it is indispensable to draw a receiver operating characteristic (ROC) curve. A ROC curve consists of two parameters: true positive rate (TPR) and false positive rate (FPR). TPR is the same as the Recall value given above. FPR is defined as:

$$} = \user2\frac}} + \user2}}$$

The x-axis of a ROC curve is FPR, and the y-axis is TPR.

Molecular docking

To verify the results of the GE link prediction, virtual molecular docking was used. The crystal structures of the targets were downloaded from the RCSB PDB (https://www.rcsb.org/) [48], and only X-ray structures with a resolution less than 3 Å were selected and saved as pdb format files. The ligand and receptor were split by Discovery Studio 4.5 [49]. Autodock Tools was used to prepare pdbqt format files. The gird boxes were adjusted to cover the entire pocket. After getting the related protein files, we searched the PubChem database for TM chemicals and Western drugs information, which were saved as sdf format and transformed into pdbqt format by OpenBabel to dock in the next step. Autodock Vina1.1.2 [50] was used to simulate the potential interactions among the selected chemicals and the targets.

Experimental verification

Besides virtual molecular docking, cellular thermal shift assay (CETSA) and mRNA expression upon the treatment of predicated compounds were applied to verify predicted results.

Chemicals and reagents

Ginsenoside rb1, neocryptotanshinone, caffeic acid and ligustilide (the purities of all standards were higher than 98% by high-performance liquid chromatography analysis) were purchased from Chengdu Pufeide Biotech Co., Ltd. (Chengdu, China).

TRIzol™ Reagent, Fetal bovine serum (FBS), 0.25% Trypsin–EDTA (w/v), Dulbecco’s modified eagle's medium (DMEM), penicillin–streptomycin (10,000 U/mL, P/S), and phosphate-buffered saline (PBS) were purchased from Thermo Fisher Scientific (Waltham, MA, USA). Human MTNR1A polyclonal antibody and GGT1 polyclonal antibody were purchased from CLOUD-CLONE CORP. (CCC, USA). Anti-rabbit IgG, HRP-linked antibody was purchased from Cell Signaling Technology (Danvers, MA, USA). β-actin was purchased from COHESION BIOSCIENCES (SUZHOU, CHINA).

Cell culture

Human umbilical vein endothelial cells (HUVECs) were supplied by American Type Culture Collection (Manassas, Virginia, USA) and cultured in DMEM medium supplemented with 10% FBS and 1% P/S at 37 °C in an atmosphere of 95% humidity and 5% CO2. HUVECs were subjected to cell experiments when cultured to 90% confluence.

CETSA

The HUVECs cells were subcultured in a 100 mm cell culture dish and lysed with RIPA lysis buffer containing PMSF and protease inhibitor cocktail on ice for 10 min then centrifuged (12,000 × g, 10 min) at 4 ℃. Cell lysates were incubated with or without 20 μM compounds (Caffeic acid or Ligustilide) under shaking at 4 °C overnight. The protein concentration was adjusted to 2 μg/μL using RIPA lysis buffer. 40 μL cell lysates were transferred to new tubes and heated for 2.5 min for each tube at different temperatures (53–72 ℃) using a thermal mixer C (Eppendorf, USA). After centrifugation (12,000 × g for 10 min), 30 μL of the supernatants were incubated with 10 μL 5 × SDS-PAGE loading buffer at 95 ℃ for 10 min before western blotting assay.

Quantitative real-time RT-PCR

Total RNA was extracted from HUVECs by TRIzol Reagent according to the manufacturer’s protocol. The content of total RNA was detected by the NanoVue spectrophotometer (Biochrom, United Kingdom). RNA was transcribed to cDNA using the PrimeScript™ RT Reagent Kit (TaKaRa Bio Inc., Kusatsu, Japan) by the manufacturer’s instruction. Real-time PCR was performed on a ViiA 7 Real-Time PCR System (Thermo Fisher Scientific, MA, USA). The primers were synthesized by IGE BIOTECHNOL OGY LTD (Guangzhou, China) and sequences were as follows: GGT1, forward TGACGTACCACCGCATCGTAGA and reverse CAGCGAAGAACTCGGAGGTCAT; MTNR1A, forward CTGGTCATCCTGTCGGTGTATC and reverse TCGACATCAGCACCAACGGGTA; β-actin, forward CACCATTGGCAATGAGCGGTTC and reverse AGGTCTTTGCGGATGTCCACGT.

The fold change of mRNA was determined relative to a blank control after normalizing to β-actin in each sample using the delta-delta Ct method.

留言 (0)

沒有登入
gif