An ensemble-based drug–target interaction prediction approach using multiple feature information with data balancing

In this section, we underline the effective results of our DTI prediction model that implements the four feature sets. Each technique is applied in python language by sci-kit-learn, ensemble package, Kares library, TensorFlow library, and XGBoost package (version 3.8). The algorithms were sped up using Windows 10 with a 3.10 GHz Intel core i9 processor and 64.0 GB RAM.


The empirical drugs and targeted datasets were aggregated from the DrugBank [5] database. The DrugBank database includes SMILE chemical structures and FASTA sequences with certified, experiential, nutraceutical, biotech, and withdrawn version (Group) drug and protein packages. Our study’s approved version of drugs, targets, and interactions of experimental datasets is on the recent release of DrugBank Online (version 5.1.8, released 2021-01-03). Our datasets consist of 11,150 drugs and 5260 protein targets with 58,649,000 potential interactions, with just 19,866 interactions noted as positive interactions as shown in Table 2. Thus, the number of positive interactions is much lower than that of the potentially negative interactions. The number of unknown interactions is equal to 58,629,134, causing an imbalance in the datasets. For this reason, we presented a method for predicting the negative samples to dominate the imbalance between positive and negative interactive datasets. The DrugBank dataset statistics are presented in the DrugBank database.

Table 2 DrugBank dataset statistics

We applied these datasets to feature generation processes and extracted the features. These features combined the four feature sets of the interaction between the drug and protein. The different combinations of these feature sets are shown in Table 3.

Table 3 Four feature sets of the drug–target interaction

Now, we have five feature sets with a different number of features.

The results for negative sample prediction

SVM one-class learning requires the selection of the kernel and the stable coefficient to define the boundary. An RBF kernel is usually chosen even though there is no exact formula or algorithm for determining the bandwidth factor. The second important parameter in SVM one-class learning is a nu parameter, known as the one-order SVM margin, which corresponds to the possibility of finding a new, but regular, observable out-of-bounds nu that is equal to 0.01.

First, in the one-class SVM, training with positive samples to construct the hyperplane in all positive samples (positive hyperplane) occurs. Then, using the decision function in this method, determine the distances between the unknown interactions and the positive hyperplane. Next, apply this function in four feature sets. Second, determine the highest negative value of the distances, which indicates the highest outliers from the positive hyperplane. The evaluation results are shown in Table 4.

Table 4 Evaluation results of negative sample prediction using one-class SVMThe prediction algorithm results

The results in Table 5 record the accuracy, mean square error, MCC, and F-score obtained by different techniques. Using feature set [1], the highest accuracy score value of 0.9999 is achieved by AdaBoost ensemble learning, and Light Boost obtained the second best value of 0.9998.

Table 5 Evaluation results of feature sets of the drug–target interaction using machine and ensemble algorithms according to precision, recall, F-score, and accuracy

For feature set [2], the highest precision score value, best recall value, highest F-score value, and highest accuracy score value of 0.9998 were achieved by AdaBoost ensemble learning and Random Forest. Light Boost obtained the second highest value of 0.9996.

For feature set [3], the best precision score value, best recall value, best F-score value, and highest accuracy score value of 0.9993 were obtained by AdaBoost ensemble learning and Random Forest. XGBoost obtained the second highest value of 0.999.

For feature set [4], the best precision score value, best recall value, best F-score value, and highest accuracy score value of 0.999 were obtained by AdaBoost ensemble learning and Random Forest. SVM obtained the worst value for prediction.

For all feature sets, the best precision score value, best recall value, best F-score value, and highest accuracy score value of 0.9993 are obtained by AdaBoost ensemble learning and Random Forest, and SVM obtained the worst value for prediction.

From the previous results, it was found that feature sets 1 and 2 gave better results than the others because they contained a representation of drugs using Morgan’s fingerprint. This gives support that Morgan’s fingerprint is a better representation of drugs than the other features used. When all features were used, we found a decrease in the results, which means that some features do not give a good description of drugs and proteins. In drug features found constitutional descriptors achieve the worst results in DTIs prediction.

The results are in Table 6. record area under the curve (AUC), mean square error, and MCC achieved by different techniques. Using feature set [1], the highest AUC value of 0.9998 was obtained by AdaBoost ensemble learning, and Light Boost obtained the second best value of 0.9997. The best MCC value of 0.9996 was obtained by AdaBoost and Light Boost ensemble learning.

Table 6 Record area under the curve (AUC), mean square error, and MCC are achieved by different techniques

For feature set [2], the best AUC value and best MCC value of 0.9998 and 0.9997, respectively, were obtained by AdaBoost ensemble learning. Random Forest and Light Boost obtained the second highest value of 0.9996.

For feature set [3], the best AUC value and best MCC of 0.9993 and 0.9986, respectively, were obtained by AdaBoost ensemble learning and Random Forest. XGBoost obtained the second highest value of 0.999.

For feature set [4], the best AUC value and best MCC value of 0.999 and 0.998, respectively, were obtained by AdaBoost ensemble learning, Random Forest, and XGBoost. AdaBoost ensemble learning also obtained the least mean square error for prediction.

For the all feature set, the best AUC value and best MCC value of 0.9993 and 0.999, respectively, were obtained by AdaBoost ensemble learning. In addition, AdaBoost ensemble learning provided the least mean square error for prediction.

The AUC is computed depending on every model’s AUC curve for describing the quality of work, which offers the most accurate visual explanation for predicting DTIs.

Figure 3 shows the ROC curve and value of AUC for the learning techniques. Using feature set (1), the best AUC value of 0.9998 was obtained by AdaBoost ensemble learning. For feature set (2), the best AUC value and best MCC value of 0.9998 were obtained by AdaBoost ensemble learning. Figure 4 shows the ROC curve and value of AUC for the learning techniques. For feature set (3), the best AUC value of 0.9993 was obtained by AdaBoost ensemble learning and Random Forest. For feature set (4), the best AUC value of 0.999 was obtained by AdaBoost ensemble learning.  Figure 5 shows the results of the ROC curve and the value of the AUC for the learning techniques. The AdaBoost method predicted the max score in the AUC = 0.9993 for all feature sets

Fig. 3figure 3

The results for the ROC curve and the value of AUC for the learning techniques show that the AdaBoost method predicts the max score in the AUC = 0.9998 for feature set [1] and set [2]

The best results were obtained with the classifier because one of the defects of the classifier is that it is sensitive to outlier samples. This indicates that a very large proportion of the outlier samples had been removed to give the best using our methods in predicting negative samples using a one-class SVM classifier.

Fig. 4figure 4

The results of the ROC curve and the AUC value for the AdaBoost and Random Forest learning methods, which predicted the max AUC as 0.9993 for feature set [3]. In feature set [4], the AdaBoost method predicted the max score in the AUC = 0.9992

Fig. 5figure 5

The results of the ROC curve and the value of the AUC for the learning techniques. The AdaBoost method predicted the max score in the AUC = 0.9993 for all feature sets

Feature analysisFeature importance

In the study, we applied machine learning to discover the important features from different types of features that are used. The genetic algorithm [37] and XGBoost are the methods chosen because they obtain the highest performance compared to other methods.

Figure 6 shows the number of correctly classified samples in different learning techniques. Using Random Forest, the best number of correctly classified samples is obtained by the genetic method in feature set [2] and feature set [3]. For AdaBoost, the best number of correctly classified samples is obtained by XGBoost ensemble learning in feature set [1], feature set [3], and all feature set.

Fig. 6figure 6

The results when applying the feature important stage before the classifier showed that the XGBoost method obtained the highest score for feature set [2] in the Random Forest classifier whereas the genetic method obtained the highest score in feature set [1] in the AdaBoost classifier

Undersampling and oversampling methods

In our study, we applied under sampling and oversampling methods for comparison with the proposed model that used the random under sampling technique for under sampling methods [38] and the SMOTE technique for the oversampling method [38].

Our approach exceeded all other under sampling and oversampling methods because we relied on predictions of negative samples by assessing a probability distribution function in one-class SVM.

Figure 7 shows that our approach exceeded the best performance in different learning techniques. Using Random Forest and AdaBoost, in feature set [3]. Finally, we calculated the bias of the roads, and the average value was 0.249.

Fig. 7figure 7

The results when applying the feature analysis stage using the random under sampling and SMOTE oversampling method in feature set [3] and using the Random Forest and AdaBoost obtained the highest performance in all feature analyses

留言 (0)
