Predicting protein–protein interactions between banana and Fusarium oxysporum f. sp. cubense race 4 integrating sequence and domain homologous alignment and neural network verification

Datasets

We first downloaded 45,856 banana proteins in banana protein sequences from https://banana-genome-hub.southgreen.fr and 14,459 Foc4 protein sequences from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/350/365/GCA_000350365.1_Foc4_1.0, respectively. Secondly, We downloaded all PPIs of six model species, Arabidopsis thaliana, nematode, Drosophila, yeast, Escherichia coli, and human, from the database MINTat https://mint.bio.uniroma2.it/, the database DIP at https://dip.doe-mbi.ucla.edu/dip/main.cgi, the database TAIR at https://www.arabidopsis.org/, the database BioGRID at https://downloads.thebioged.org/biogerid/release-archive/ biogerid-3.5.166/, and the database INTACT at https://www.ebi.ac.uk/intact/, respectively. Thirdly, we downloaded 118,921 PPIs from the database MINT, 76,881 PPIs from the database DIP, 2656 PPIs from the database TAIR, and 183,768 PPIs from the database IntAct. Finally, we downloaded 62,782 pathogen-host interspecific protein interactions from the databaseHPIDB at http://hpidb.igbb.msstate.edu/.All domain-domain interaction template PPIs were downloaded from the database3DID [30] at https://3did.irbbarcelona.org/. The corresponding protein sequences of the above six species were downloaded from the database Uniprot at https://www.uniprot.org/. Different databases may use different IDs for the same protein. We used the software tool Biomart [31] to convert the different protein IDs into uniform IDs.

Methods

We first downloaded the experimentally verified intra-species and inter-species PPIs from the database as the interaction template. Next, we applied the interolog method and domain-domain method to predict the data sets of PPIs between banana and Foc4 to find the common PPIs between banana and Foc4. Thirdly, we used the conjoint triad(CT) [32] and auto covariance(AC) [33] to code protein sequence features to obtain the structure information of continuous and discontinuous protein sequences. Fourthly, we verified the predicted PPIs data sets for banana and Foc4 by using LSTM neural network five-fold cross-validation method and independent test method. Finally, we computed the accuracy, sensitivity, specificity, receiver operating characteristic curve (ROC), and area under the curve(AUC) of the predicted results. Figure 1 shows the process of predicting PPIs between banana and Foc4, in which iPPIs indicate interolog PPIs, dPPIs represent domain-domain PPIs, and DDI denotes domain-domain interactions.

Fig. 1figure 1

Process of predicting PPIs between banana and Foc4, where the solid arrow represents ‘control flow direction’ and the dashed arrow denotes ‘data flow direction’

Predicting PPIs between banana and Foc4

The interolog method is a means for predicting homologous interactions. Its main idea is that homologous proteins may have similar properties. If two proteins A and B interact with each other via verified experiments, and two proteins A' and B' are homologous proteins of A and B respectively, then according to the principle that homologous proteins have similar properties, proteins A' and B' may also interact with each other [23]. The idea of the domain-domain interaction prediction method is that if proteins C and D contain domains C and D which can interact with each other, proteins C and D may interact with each other [24].

Based on the protein sequence data of banana and Foc4, we used the interolog method and domain-domain method to predict the interactions between banana and Foc4. We selected the transmembrane or secreted proteins in Foc4 as the protein infecting banana [26] and obtained the final PPIs data set between banana and Foc4.

For the interolog method, we used the local sequence alignment tool BLAST to find the homology proteins, where the parameter E was set to 0.00001, the sequence identity was set to 30%, and the coverage was set to 80%[26, 27]. Firstly, the protein sequences of six model species are compared with banana and Foc4 to find out the orthologous proteins between banana and Foc4. Then, the host protein sequences in the database HPIDB are compared with the banana protein sequences and the pathogen protein sequences are compared with the Foc4 protein sequences to obtain interspecific homologous proteins.

We submitted the protein sequences of banana and Foc4 to the database 3DID to find out the domains contained in each protein, where the value of parameter E was set to 0.00001 and the sequence identity was set to 90% [26]. If any PPI of banana and Foc4 contains a couple of interactive domains in the database 3DID, it is considered that this pair of proteins for banana and Foc4 may interact with each other [34].

We applied the two software tools signalP [35] and WoLFPSOFT [36] with the default values of their parameters to find secretory proteins. If a protein predicted by signalP contains a signal peptide and is located as extracellular by WoLFPSOFT, the protein is a secretory protein. In addition, we used the software TMHMM2.0 [37] to predict transmembrane proteins in Foc4 proteins. If the number of transmembrane helices predicted by TMHMM is greater than 1, the proteins are considered to be transmembrane proteins [38].

PPIs coding of sequence features

Proteins are biomolecules composed of amino acids, while protein sequences are represented by 20 standard amino acids. Encoding the sequence feature of a protein is to extract the feature vector from the protein sequence. The sequence feature extraction transforms the original sequences into a fixed-length numerical vector. In recent years, some researchers have proposed some methods to predict PPIs using only protein sequence information, but these methods can not fully capture interaction information from continuous and discontinuous amino acid fragments at the same time.

In order to solve the above problem, the conjoint triad (CT) method and auto covariance(AC) method were used to encode sequence features. By using the CT method, 20 amino acids are divided into seven categories according to the volume of even electrodes and side chain volume. Each three consecutive amino acids is regarded as a basic unit, and the class frequency of all basic units in a protein is counted. The AC method mainly considers the proximity effect and uses both the continuous and discontinuous sequence information in a protein sequence. The number of all possible kinds for each basic unit is 7 × 7 × 7 = 343. Thus, the final feature vector with 686-dimension contains the features of two proteins interacting with each other. Min–max normalization was performed on the feature vectors to map the result of encoding each protein pair into the interval [0,1], so as to remove the influence of protein length on frequency counting. Let \(\) represent the i-th component of a protein eigenvector, the i-th component of a normalized protein feature vector, di, is computed as follows [32]:

$$d_ = \frac - \min \ ,f_ ,......,f_ \} }} ,f_ ,......,f_ \} }},i = 1,2,3, \ldots ,343$$

(1)

The interactions between amino acids are reflected by seven physical and chemical characteristics of amino acids. The seven physical and chemical properties are hydrophobicity, hydrophilicity, net charge index, polarity, polarizability, solvent accessible surface area, and side chain volume, respectively. Each protein sequence is transformed into a 7-dimensional vector, and each amino acid is represented by a normalized value of seven descriptors. The initial values of seven physical and chemical properties of 20 amino acids can be found in [33]. The variance \(A}\) is computed as follows [33]:

$$A} = \frac\sum\limits_^ } - \frac\sum\limits_^n }} )(} - \frac\sum\limits_^n }} )}$$

(2)

where lag represents the distance between the two amino acid residues, n is the length of protein sequence X, Xi,j represents the j-th descriptor in the i-th position of a protein sequence. In this paper, seven physical and chemical properties are used and the optimal value of lag is set to 30 [39]. After AC transformation, each protein sequence has been transformed into a 210-dimensional vector. Combined with the CT method, each PPI sequence has been transformed into a vector of (343 + 210) × 2 = 1106 dimensions.

Verification

We used the interolog method and domain-domain method to deal with the proteins of banana and Foc4 to obtain their PPIs, and treated these PPIs as the positive samples with size 739. We verified the predicted results by the five-fold cross-validation method and independent test method, respectively. The Long Short-Term Memory(LSTM) neural network [40] was used to predict PPIs between banana and Foc4.

By using the characteristic coding of the PPIs between banana and Foc4, the original protein sequence was converted into a fixed-length numerical vector which was used as the input of the LSTM neural network. The input layer of LSTM neural network was a feature vector composed of the forward and backward hidden layer output vectors hf and hb. The corrected linear unit(relu) was used as the activation function in the hidden layer, and the softmax function was used in the output layer. According to the results of the CT and AC coding schemes, the input sequence was \(X = \left( ,,,...,}} \right)\) and the prediction model outputs a corresponding result sequence was \(Y = \left\,,,...,}} \right\}.\) In the prediction model, the learning rate was set to 0.001, the batch size was 128, and the fully connected layer has 128 neurons. In five-fold cross-validation, we randomly selected negative samples from banana and Foc4 proteome. The size of the selected negative samples was the same as the size of the predicted PPIs.. The selected negative samples filtered out the samples in the predicted PPIs between banana and FOC with a sequence consistency greater than 20%. When the size of positive samples is m, the size of negative samples is 10 × m. We selected the samples with size of 2 × m/3 in the positive samples and the samples with size of 2 × m/3 in the negative samples to form the training set, and selected the remaining positive samples with size of m/3 and the remaining negative samples with size of 10 × m-2 × m/3 = 28 × m/3 to form the test set.

In this paper, we used the accuracy ACC, sensitivity Sn, specificity Sp, receiver operating characteristic curve ROC, and area under curve AUC to evaluate the prediction effect [23]:

where TN is the number of true counterexamples, TP represents the number of true examples, FN denotes the number of false counterexamples, and FP is the number of false-positive examples.

Each protein is used as a node and the interaction between each pair of proteins is represented as an edge, a PPIs network is created by all the nodes and edges. We used the software Cytoscape3.7 [41] to visualize the PPIs network to conveniently and intuitively observe the characteristics of the network. We used the ClusterViz plug-in in Cytoscape [41] to divide the interaction network into different functional modules. We executed the algorithm ClusterVizuse FAG-EC [42] to partition the network into several subnetworks. The median centrality Vi of node i in the network is calculated as follows:

$$ = \sum\limits_ ^i}}}}}}$$

(6)

where \(}\) denotes the number of the shortest paths from node s to node t, and \(n_^i\) represents the number of the shortest paths from node s to node t via node i in the network.

We applied the software TBTools [43] to carry out the GO (Gene Ontology) functional enrichment analysis of PPIs. According to the specification for TBTools, we set the value of parameter p < 0.05 and used Bonferroni correction [44]. KEGG (Kyoto Encylopedia of Genes and Genomics) enrichment analysis (p-value < 0.05) of PPIs was performed by using KOBAS2.

ResultsExperimental environment

The computer used was with Intel (R) Xeon (R) W-2133 CPU @ 3.6 GHz processor and memory capacity 8 GB running operating system Windows10. The prediction algorithm was implemented by Python3 programming.

Experimental results

We first predicted 26,910 PPIs and 376,755 PPIs between banana and Foc4 by using the interolog method and domain-domain method, respectively. Table 1 shows the results of predicted PPIs, where 739 interactions with 515 banana proteins and 81 Foc4 proteins are common overlapping PPIs predicted by the interolog method and domain-domain method. Method1 represents the interolog method, and Method2 denotes the domain-domain method. The detailed data sets of all predicted results are given in Supplementary table 1.

Table 1 Statistical information of predicted PPIs between banana and Foc4

It can be seen from the results in Table 1 that the number of PPIs predicted by the interolog method is less than that of PPIs predicted by the domain-domain method. This is because the interolog method adopts the homologous sequence-based alignment, which depends on the amount of data in the existing database, while the domain-domain method is based on the interactive domains contained in proteins, and a protein can contain two or more interactive domains [45].

We extracted the feature vector of proteins in banana-Foc4 PPIs, and analyzed the reliability of banana-Foc4 PPIs predicted by the LSTM neural network-based five-fold cross-validation method and independent test method. Table 2 shows the results of sensitivity Sn, specificity Sp, accuracy ACC, and receiver operating characteristic curve ROC of the predicted banana-Foc4 PPIs.

Table 2 Values of Sn, Sp, ACC, ROC, and running time of predicted banana-Foc4 PPIs

We can see from Table 2 that for the LSTM model, the results predicted by the five-fold cross-validation method were better than the ones predicted by the independent test method, and the results predicted by the LSTM model were better than the ones predicted by the SVM (Support Vector Machine) model, while the LSTM model required much longer computational time than the SVM model. On the other hand, the experimental results also show that the PPIs between banana and Foc4 predicted by five-fold cross-validation and independent test methods have high structural similarity. It illustrates that the PPIs between banana and Foc4 may interact in sequence structure characteristics.

The following is to analyze the network structure characteristics of the PPIs between banana and Foc4 predicted by the experiment. By using Cytoscape, each protein in the interactions between banana and Foc4 was treated as a node, and each interaction between banana and Foc4 was treated as an edge. The result of the PPIs network between banana and Foc4 is shown in Fig. 2, and the detailed information of the PPIs network is given in Supplementary table 2.

Fig. 2figure 2

PPIs network between banana and Foc4, where the red node represents Foc4 protein, and the blue node denotes banana protein

In the PPI network, the connectivity of a protein is defined as the number of all other proteins linking to this protein. The connectivity is an index of evaluating the importance of a protein in the network. From Fig. 2 we can see that the average connectivity of Foc4 protein was 9.12 and the average connectivity of banana protein was 1.43. This indicates that the connectivity of Foc4 protein was higher than that of banana protein in the PPI network for banana and Foc4, and Foc4 protein played a more active role, which affected a series of biological processes of banana infected by Foc4. It can also be seen from Fig. 2 that the PPI network for banana and Foc4 was divided into 51 sub-networks, in which the largest sub-network contains 86 nodes, the smallest sub-network has only two nodes, and there are 30 sub-networks with more at least to 6 nodes. Some complex sub-networks with more nodes contain multiple Foc4 proteins. Some sub-networks only contain one Foc4 protein. The smallest sub-network only has one banana interacting with the Foc4 protein. In addition, we found that three proteins of Foc4, namely EMT64532.1, EMT73264.1, and EMT73245.1, interact with 72, 58, and 29 proteins of banana, respectively. This illustrates that these three proteins of Foc4 play important roles in the interactions, and these results will provide a basis for future biological experiments.

To annotate the GO function of PPIs for banana and Foc4, we first aligned the banana protein with SwissProt protein by the software BLAST. Then, we compared the obtained Foc4 protein with SwissProt protein. Finally, we used the TBTools to annotate the GO function PPIs for banana and Foc4. The top 20 annotated results of proteins for Foc4 are shown in Table 3, and the annotated results of proteins for banana are shown in Table 4.

Table 3 Top 20 GO annotated results of proteins for Foc4Table 4 GO Annotated results of proteins for banana

It can be seen from Table 3 that in the annotated GO function results of Foc4 protein, the top three ones are membrane fusion, export from cell, and transport respectively. In addition, we can also see that Foc4 protein annotates vesicle fusion, export across membrane, transmembrane transport, and membrane organization, which are all related to cell membrane function. Foc4 protein must cross the cell membrane if it wanted to enter banana and interact with banana protein.

Table 4 shows that in the annotated GO function results of banana proteins, the top three ones are transport, translation, and catabolic process respectively. Some banana R-proteins(resistance proteins) are annotated with tropism, cellular homeostasis, cell–cell signaling, and other functions, all of which are related to the response of cells to external stress. Foc4 protein enters the banana, and the banana uses the specificity of intracellular resistance proteins to recognize the effector and trigger immune response [46].

It can be seen from Table 5 that in the annotated KEGG function results of Foc4 protein, there are many protein annotates membrane transport, ABC transporters, interactions in vesicular transport and transporters, which are all related to the environmental information processing pathway. The annotated KEGG function results of banana protein in Table 6, there are many protein annotates interactions in vesicular transport, membrane transport, ABC transporters, which are related to the environmental information processing pathway.

Table 5 KEGG Annotated results of proteins for Foc4Table 6 KEGG Annotated results of proteins for banana

The GO annotation results of predicted PPIs between banana and Foc4 show that Foc4 protein were annotated the functions related to cell membrane such as vesicle fusion, transmembrane export, transmembrane transport and membrane tissue, and banana protein were annotated the functions related to external stress response such as transport, tropism, cell automatic regulation and cell signal transduction. The KEGG annotation results show that the Foc4 protein annotates membrane transport, ABC transporters, interactions in vesicular transport and transporters. The banana protein were annotated the functions related to the environmental information processing pathway. This illustrates that the PPIs between banana and Foc4 predicted by our method are reliable from the perspective of GO and KEGG functional annotation.

留言 (0)

沒有登入
gif