A comparative evaluation of machine learning ensemble approaches for disease prediction using multiple datasets

3.1 Accuracy comparison

The accuracy outcomes of the ensemble algorithms and their variants are shown in Table 3 against all datasets considered in this study. A bold number in a cell indicates that the corresponding algorithm (column title) showed the best accuracy performance against the given dataset (row title). Interestingly, datasets D7 and D12 revealed 100% accuracy for all classification algorithms. The last row shows the number of times each algorithm revealed the best performance. Classical stacking (9) has been found to offer the best performance at most times, followed by multi-level stacking (8). Classical boosting and Logit boost performed worst against the same criteria, each revealing the best performance only three times.

Table 3 Accuracy (%) of ensemble classifiers and their different variants

Table 4 summarises the outcomes from Table 3 for the three basic ensemble approaches. In doing so, we considered all variants for a basic ensemble technique. For example, we considered all six variants while checking whether bagging produces the best result. If any of them has the best accuracy, we increase the count for the bagging technique. An “x” in a cell designates the ensemble technique that produced the best results for that dataset. For datasets D7, D11, D12, D14, and D15, all three approaches or their variants have shown the best accuracy performance. Again, stacking (14) was the best-performing method, as revealed in the last column.

Table 4 Best accuracy frequency and accuracy score against different datasets

Apart from the datasets showing the best accuracy for each ensemble technique (i.e., D7, D11, D12, D14 and D15), bagging showed the best accuracy only once (D13), and boosting showed three times (D1, D6 and D8). On the other hand, stacking performed the best nine times (D2-D5, D8-D10, D13 and D16). From this data analysis perspective, it is again stacking that performed best for disease prediction.

3.2 Precision comparison

Table 5 displays the results of precision scores for different ensemble techniques and their variants across disease datasets. All 15 ensemble classifiers considered in this study showed a 100% precision score for datasets D7 and D12. Datasets D12, D13, D15, and D16 consistently performed, giving a precision score of > 90% against each classifier. Regarding how many times a variant reveals the best precision performance (last row of Table 5), Classical Stacking (9) ranked first, followed by Two-level and Multi-level Stacking, each showing the best performance eight times. Classical boosting and logit boost were positioned the lowest in this regard, delivering the best performance four times each. Like the accuracy measure, Classical Boosting and Logit Boosting showed the worst outcome regarding the number of times revealing the best performance. They showed the best performance only four times, much lower than that of classical stacking, which showed the best performance the most times (9).

Table 5 Precision (%) of ensemble classifiers and their different variants

When variants converged to their corresponding parent ensemble approaches in terms of the number of times revealing the best precision performance, stacking appeared to be the best. The results are presented in Table 6. Stacking showed the best performance 14 times out of 16 datasets, followed by boosting (9) and bagging (8). All variants showed the best precision performance for datasets D7, D8, D12 and D14-D16. For the remaining ten datasets (D1-D6, D9-D11 and D13), stacking achieved the best precision eight times, followed by boosting (3) and bagging (2).

Table 6 Best precision frequency and precision score against different datasets3.3 Recall comparison

For accuracy and precision, the variants of the stacking technique showed the best and second-best performance. Recall outcomes make an exception in this regard – there is a tie for the second-best recall score between random subspace and classical stacking. Each showed the best performance seven times, according to the last row of Table 7. Dataset D12 revealed 100% recall against all ensemble variants. Logit Boost led to the best performance minimum number of times (3) among all variants.

Table 7 Recall (%) of ensemble classifiers and their different variants

For the three parent ensemble approaches, there is a three-way tie for the best-performing score against datasets D7, D12, D14 and D15, according to Table 8. Stacking scored the best 12 times, followed by boosting (9) and bagging (7). For datasets D3, D7 and D12-D15, stacking showed a 100% recall score.

Table 8 Best recall frequency and recall score against different datasets3.4 F1 score comparison

We observed a similar trend in the F1 score as what we observed for accuracy and precision. Stacking variants outperformed other candidate variants, as detailed in Table 9. Multi-level stacking appeared nine times as the best performer, followed by classical stacking (8) and two-level stacking (8). Datasets D7 and D12 showed a 100% F1 score for all variants. D16 showed the same F1 score (94%) for all variants. Classical boosting appeared minimum times (3) as the best performer.

Table 9 F1 score (%) of ensemble classifiers and their different variants

At the meta-level (i.e., basic ensemble approaches), stacking showed the best F1 score performance 13 times, followed by boosting (10) and bagging (7), according to Table 10. For datasets D7, D12 and D14-D16, all the classifiers have shown the same F1 score.

Table 10 Best F1 score frequency and F1 score against different datasets3.5 AUC comparison

Like in accuracy, precision and F1 score, stacking variants outperformed other candidate variants for AUC, as detailed in Table 11. Multi-level stacking appeared nine times as the best performer, followed by classical stacking (7) and two-level stacking (7). Dataset D15 showed a 100% AUC value for all variants. D16 showed the same AUC score (89%) for all variants. Logit Boost appeared minimum times (3) as the best performer.

Table 11 AUC score (%) of ensemble classifiers and their different variants

According to Table 12, at the meta-level (i.e., basic ensemble approaches), stacking showed the best AUC performance 13 times, followed by boosting (11) and bagging (7). For datasets D2, D7, D12 and D15-D16, all the classifiers showed the same AUC value.

Table 12 Best AUC score frequency and AUC score against different datasets3.6 AUPRC comparison

Multi-level stacking and classical stacking tied in the number of their appearance as the best peformer (8), according to Table 13. Decision tree, XGBoost and two-level stacking appeared six times each as the best performer. Like in AUC, dataset D15 showed a 100% AUPRC score for all variants. Dataset D12 showed the same AUPRC score (98%) for all variants. Classical Boosting and Logit Boost appeared minimum times (3) as the best performer.

Table 13 AUPRC score (%) of ensemble classifiers and their different variants

According to Table 14, at the meta-level (i.e., basic ensemble approaches), stacking showed the best AUPRC performance 14 times, followed by boosting (9) and bagging (7). For datasets D7-D8, D12-D13 and D15-D16, all the classifiers showed the same AUPRC value.

Table 14 Best AUPRC score frequency and AUPRC score against different datasets3.7 Comparing RPI score

Using the results from Table 3, 5, 7, 9, 11 and 13 for 16 datasets, we calculated the RPI score for all performance measures against each variant. Table 15 presents the corresponding RPI score results. Classical stacking showed the highest RPI score for accuracy (11.31%), precision (16.81%) and recall (21.50%) measures. Multi-level stacking showed the highest RPI scores for AUC (9.56%) and AUPRC (12.69%). For the F1 score, Classical Boosting had the highest RPI score (7.06%).

Table 15 RPI score for ensemble classifiers and their variants3.8 Comparison of best count statistics

The last rows of Table 3, 5, 7, 9, 11 and 13 show the number of times each variant performed best against accuracy, precision, recall, F1 score, AUC and AUPRC, respectively. Table 16 summarises these six rows to reveal the number of times each variant performed best against all six measures. Stacking variants of multi-level stacking topped the list by appearing 50 times as the best-performing variant. This value is significantly higher than other list values \((p\le 0.02)\) according to the ‘inverse normal distribution’ test for a single value. The second highest value was revealed by another stacking variant of classical stacking (48), which is also significantly higher than other remaining values \((p\le 0.04)\). The Logit Boost variant appeared the minimum times (20) as the best performer in this table.

Table 16 Comparison of approaches considering the number of times showing the best performance

留言 (0)

沒有登入
gif