Prediction models for the complication incidence and survival rate of dental implants—a systematic review and critical appraisal

The present systematic review summarized and appraised the prediction models for the complications and survival rates of dental implants. Firstly, predictors including implant length, implant position, aging, and history of periodontitis were consistently used in multiple prediction models. Second, there was limitation in the selection of predictors and inconsistency of diagnostic criteria of peri-implantitis. Third, the predictive ability of ML and LR models were similar. Fourth, reporting transparency, methodological quality, and external validity of prediction models were inadequate.

Several studies have reviewed the clinical prediction models in dentistry [42,43,44]. Consistent with our findings, a study that reviewed deep learning models in oral implantology concluded that many studies exhibited a high risk of bias [43]. High risk of bias might endanger the validity of findings, leading to potentially incorrect conclusions. It can lead decision-makers to choose ineffective treatments, wasting resources and potentially harming patients. Rigorous methods are essential to minimize bias, and caution is needed when interpreting results of clinical model research, especially when they guide clinical practice and policy. Another study reviewed the prediction models in periodontology [42]. Despite a general risk of bias and low transparency, prediction models in periodontology exhibited better methodological quality than those in implantology because more models in periodontology were tested concerning the calibration and discrimination performance than those in implantology. This may be because predictive models were first applied in periodontics and then extended to the implant field. Another meta-analysis showed that external validation studies of caries risk assessment models revealed that these models’ average predictive performance seems acceptable [44]. Prediction models in the dental caries assessment are more externally validated than those in implantology, possibly due to the higher incidence and easier diagnostic criteria for dental caries. It is essential to address these issues and conduct rigorous external validation to improve the reliability and generalizability of prediction models in dentistry.

In this systematic review, the most commonly used predictors were implant length and implant positions, followed by age and history of periodontitis. A systematic review and meta-analysis by Maha Abdel-Halim et al. examined the failure rates of short implants (<10 mm) and long implants (≥10 mm). The study results showed that the risk of failure for short implants was 2.5 times higher than for long implants, suggesting that implant length is a factor affecting implant failure [45]. Another study, using a multivariate marginal Cox analysis, provided suggestive evidence that implant length (<10 mm) was associated with an increased risk of implant failure [46]. The implant position factors mentioned here include the site of placement in jaw (anterior/posterior), the jaw (maxilla/mandible), and the three-dimensional position of the implant. Even when significant association with position is identified, from the articles we included, both anterior and posterior dental regions might be risk factors for implant survival or complications. The anterior implants’ susceptibility to complications could be attributed to its typically attenuated cortical bone compared to posterior implants [47], whereas the posterior area’s propensity for complications or implant loss may stem from its role in withstanding the primary occlusal force [48]. Furthermore, the anterior regions of both the maxilla and mandible typically possess a denser cortical layer compared to the posterior maxillary area. This dense bone is conducive to achieving primary implant stability, but also lead to the increased potential for thermal damage during the preparation of implant sites in dense bone can result in soft tissue encapsulation instead of the desired osseointegration [49]. The contrasting effects of dense bone might account for these divergent outcomes. But it is clear that the malpositioning of dental implants has been identified as an important factor in the incidence of implant-related complications and their survival rates. Specifically, inadequate inter-implant or implant-tooth distances may precipitate marginal bone loss, adversely affecting the osseointegration and prognostic stability of the implant. Concurrently, even mild implant malpositioning may give rise to inconsistency between the implant axis and occlusal forces, leading to complications [48].

In addition, older adults are more susceptible to periodontal and peri-implant diseases, which can be attributed to poor oral hygiene due to decreased dexterity and vision [50]. According to a retrospective cohort study, partially edentulous patients with a history of severe periodontitis were more prone to developing peri-implantitis [51]. Periodontitis grade of peri-implantitis patients was correlated to the severity of peri-implantitis and the occurrence of implant failure [52]. Furthermore, radiographic feature and implant position were the two most frequently used predictors among the excellent or very good models. These two predictors might improve the prediction ability of models.

Predictor selection is a critical aspect of model construction in prediction studies. First, it should be emphasized that causal factors clearly affect the outcomes, while predictors selected in prediction models do not necessarily need to have such causal effects [53]. Although using causal factors as predictors can enhance model portability across different populations and improve model reliability and acceptance, predictor selection should primarily focus on enhancing model differentiation and calibration [54]. Second, the statistical method for predictor selection is also important. In the included studies, the selection of variables depended on univariable analysis, possibly leading to incorrect predictor selection since the predictors are selected based on their statistical significance as a single predictor instead of being in context with other predictors, thus resulting in the omission of variables from the model [12, 55]. Third, it is essential to make the models applicable to daily practice. Ideal predictors should be inexpensive, easy to obtain, and accurate [12]. We noted that some models use a combination of clinical, demographic, and molecular measures to predict implant retention and complications. Although this seems to improve the performance of the models, some predictors (i.e. microbial predictors, molecular measures) are cumbersome to measure, which often makes the model less useful in the clinic.

Four studies predicted the occurrence of peri-implantitis as the most common outcome among the 14 prediction modeling studies. Regarding peri-implantitis prediction, studies have used different definitions, leading to inconsistent models. The widely acknowledged diagnostic criteria for peri-implantitis, formulated based on the 2017 consensus workshop (Table S12), require (1) the presence of bleeding and/or suppuration on gentle probing, (2) increased probing depth compared to previous examinations, and (3) the presence of bone loss beyond crestal bone level changes resulting from initial bone remodeling.

All studies were published after the consensus meeting. In 3 of those studies, the definition was made according to the consensus report of workgroup 4 of the 2017 world workshop [15, 30,31,32], highly consistent with the norms in this aspect. The definition of peri-implantitis in two models from Zhang’s study followed the consensus [15, 30], which included the presence of bleeding and/or suppuration on probing, increased probing depth compared to previous examinations, and bone loss beyond crestal bone level changes after initial bone remodeling. In Mameno’s study with three models, peri-implantitis was defined as the presence of bleeding on probing and/or suppuration in the follow-up period and the presence of >1 mm of bone resorption from the baseline measurement [31]. In Rekawek’s study, peri-implantitis was defined as radiographic evidence of changes in the crestal bone level and clinical evidence of bleeding on probing, with or without suppuration [32]. However, the definition of peri-implantitis in Zhang’s study [33] included bleeding/ suppuration on probing, probing depth ≥5 mm, and radiographic marginal bone loss ≥2 mm while probing depths of ≥6 mm were suggested in the consensus. The overly broad diagnostic criteria in this study could result in false positives. It is suggested that future studies strictly adhere to the diagnostic criteria in 2017 World Workshop on the Classification of Periodontal and Peri-Implant Diseases and Conditions (Table S12) to increase comparability between studies and promote the generalizability of the models. It bears emphasis that the diagnostic criteria encompass two sets: one with and one without baseline data. While the absence of baseline data is a common clinical scenario, literature suggests that its presence enhances diagnostic precision [56]. In its absence, the use of secondary diagnostic criteria may demonstrate diminished sensitivity, risking the omission of early or subtle peri-implantitis cases.

In terms of the proportion of models with excellent and very good AUROC, the model constructed by deep learning method has the best discrimination. Machine learning models (except for deep learning models) and logistic regression models were similar in performance in terms of the AUROC value. Four studies used multiple methods to develop models, demonstrating varying discrimination levels for different models. Zhang et al. showed that the support vector machine (SVM), artificial neural network (ANN), and logistic regression models exhibited a high sensitivity of 91.67% [17]. In Mameno’s study, the support vector machine model outperformed the logistic regression and random forest (RF) models [31]. Huang et al. [36] demonstrated that the integrated model they developed combining the logistic regression and convolutional neural network models showed better discrimination than both of them separately. In Rekawek’s study, LR and SVM had similar discrimination while RF had the best performance [32]. These findings are consistent with previous studies that have not shown a performance benefit of ML models over LR models for clinical prediction [57]. When the data meet the criteria for using logistic regression, it may be preferable as it provides more interpretable and explanatory results than ML methods [58]. Therefore, clinicians should consider the method’s suitability rather than inappropriately favor advanced and complex techniques. Though the deep learning model outperforms the statistical model in terms of the amount of excellent and very good AUROC models, it is more susceptible to overfitting. Additionally, these studies lacked external validation, making it challenging for us to determine superiority. Moreover, the burgeoning field of applied machine learning in predictive modeling introduces layers of complexity, heightened by their variable quality that sometimes makes the application of these models remain challenging [59]: Firstly, it brings algorithmic complexity. Complex machine learning algorithms demand meticulous tuning and can overfit, performing well on training data but poorly on new instances. Secondly, AI models have data reliance, posing challenges in domains with scant data resources [60]. In addition, many AI models, particularly deep learning ones, has interpretability gaps, complicating the task of explaining their decisions [61].

The TRIPOD checklist revealed low transparency in reporting the prediction models. Over 40% of the studies didn’t report missing data handling (item 9), full model presentation (item 15a), how to use the model (item 15b) and supplementary information. The lack of a full model presentation might prompt other researchers to develop a new model instead of conducting external validation research, resulting in many prediction model development publications failing to be implemented in the clinic. Important elements such as how they arrived at the sample size (item 8) and flow of participents selection (item 13a) were not reported in half of the studies. Selective reporting of models might cause a lack of authenticity and insufficient evidence, resulting in over-optimism of the performance of these models [20].

The poor quality of the model is mainly attributed to the following three aspects: the method of dealing with missing data, lack of proper calibration assessment and the lack of external validation. Of the eight studies that reported how to deal with missing data, one study used K-nearest-neighbors imputation methods, while the remaining seven studies reported excluding subjects with missing data as the method for handling missing data. Excluding subjects with missing data can be reasonable when the number of missing cases is small. However, it leads to a significant reduction in the sample size and potential bias when many variables have missing data. Then, in the only three studies that conducted calibration assessment, only two models presented calibration plot, while the other reporting only the Hosmer–Lemeshow test with no calibration plot or a table. This makes it impossible to provide effective information for evaluating the accuracy of the predicted risks in comparing the predicted versus observed outcome frequencies. Moreover, only three model conducted external validation of all the 14 studies. External validation is crucial for evaluating the model’s generalization ability and ensuring its good performance on new data, thereby enhancing the accuracy and reliability of the prediction models [62]. A key factor hindering the clinical implementation of the prediction models is that most models have a good prediction ability in the developing set but are poor in the external validation set. Therefore, it is necessary to conduct external validation in our clinical modeling process.

For statistic model studies, a larger absolute value of a coefficient typically indicates a more important feature, assuming all features are on the same scale. However, most of the statistic models use different types of predictors which make it difficult to determine the relative feature importance. As for the articles reported with this relative feature importance, the results are too heterogeneous to be compared. Calculating the feature importance plays a pivotal role in the realm of machine learning and statistical modeling. It aids in the strategic selection of features, enabling the identification of the most critical elements that can streamline the model and enhance its clarity and workability [63]. Additionally, it enriches the model’s interpretability, offering a tangible method to elucidate the underpinnings of its predictive mechanisms [61].

The included clinical prediction models necessitate evaluation against overfitting, typically through external validation (calculating RMSE, c-statics) on a distinct dataset or by employing cross-validation techniques. There are also some other methods like utilizing the Hosmer and Lemeshow test for goodness-of-fit, presenting calibration plots, employing on-the-fly data augmentation techniques as mentioned in the included studies. However, about one third of the included studies failed to address overfitting which might result in poor generalization, decreased predictive performance and increased complexity, especially in AI models.

Three of the included studies identified with an EPV less than 10. Low EPV indicate that the model is at risk of overfitting, or that more samples are needed to ensure the stability and generalization of the results. Some method can be utilized to address this issue. Penalty regression (such as ridge regression [64], lasso regression [65] and elastic net [66], etc.) is a machine learning algorithm for regression tasks where the goal is to prevent overfitting by adding a penalty term. Each of these methods has its own advantages and shortcomings. Ridge Regression reduces coefficient sizes but keeps all variables, which can’t fully prevent overfitting as effectively as methods that can eliminate variables and may still include irrelevant variables [64]. Lasso Regression trims coefficients to zero for variable selection, enhancing model simplicity and interpretability, but risks excluding crucial predictors [65]. Elastic Net strikes a balance by blending both penalties of Lasso and Ridge regression, yet it demands meticulous tuning of its parameters, posing a computational challenge [66].

However, EPV is a guiding principle rather than a hard and fast rule. It is used as an auxiliary tool for sample size calculation to ensure model stability and avoid overfitting. Researchers should consider all relevant factors, such as study design, effect size, and statistical power, etc.

The duration of follow-up across the reviewed studies spans from a minimum of six months to a maximum exceeding five years. Within these studies, four out of the 14 studies have a follow-up duration of one year or less. While this timeframe is adequate for monitoring soft tissue alterations, such as BOP and SOP, it falls somewhat short for comprehensively assessing bone tissue remodeling after implant surgery. Consequently, due to the fact that bone tissue remodeling requires a more extended period and significant bone loss can still occur after the implant has been in function for over a year [67], a one-year follow-up period may be deemed relatively brief for studies aimed at evaluating bone tissue such as marginal bone loss. Sampaio-Fernandes et al.’s 6-month follow-up was adequate for tracking mobility, pain, peri-implant mucositis but insufficient for peri-implantitis assessment [38]. Also, the one-year follow-ups in the Nobre et al. and Zhang et al. trials were too short for comprehensive peri-implantitis evaluation [37, 47]. Ha et al.’s one-year follow-up, focusing on the outcome measures were implant survival and complications that did not involve bone tissue assessment, was reasonable for the study’s objectives [68].

It should be noted that the article’s approach to resolving inconsistencies in measurement or operations varies. A case in point is the variable of multiple physicians, which is recognized as a potential factor affecting the outcome but is not uniformly addressed. In the two studies conducted by Papantonopoulos et al. [34, 69], the consistency of clinical and radiographic measurements was maintained by assigning a single operator to perform these assessments. Similarly, Ha [68], Lu [18], Wang [19], and Oh [40] controlled for variability by having the operations conducted by one operator following a standardized protocol. In contrast, Zhang et al. [15], who used three experienced operators for implant placements, did not report measures for inter-operator reliability. The other studies lacked details on data collection reliability and did not calculate stability or consistency measures, nor did they adjust for inter-rater reliability, potentially introducing bias when multiple physicians were involved.

Several findings have been listed to improve the use and performance of clinical prediction models. First, improving the model’s performance and simplicity should be a priority when selecting the predictors rather than giving a causal interpretation of predictors from a model. An ideal prediction model should exhibit a balance between predictability and simplicity. Second, a consistent definition of peri-implantitis can enhance the comparability between studies and the generalization of models. The diagnostic criteria in the consensus report of the 2017 world workshop are highly recommended for further research. Third, when selecting the model development method, it is crucial to consider the data and clinical application rather than rely on advanced methods inappropriately. Fourth, clinical prediction modeling studies should conform to the standards outlined in the TRIPOD statement to ensure reporting completeness and transparency. Fifth, proper missing data analysis, calibration and external validation should be conducted to improve the quality of the model. The PROBAST checklist can be utilized to mitigate the risk of bias.

This systematic review provided suggestions for future studies on models in implantology. Snowball strategy showed that only one literature in references was found, indicating the completeness of the search strategy. This study had some limitations. First, research results of the grey literature were not included due to its unavailability. Additionally, studies without accurate values, such as AUROC, sensitivity, specificity, PV+, PV−, and accuracy, were excluded to comprehensively analyze the models. The limitations of this review study in terms literature sources should be mentioned. Furthermore, despite many of them being retrospective studies, there is also heterogeneity of study designs, population, outcome definitions, developing methods, etc. The heterogeneity and a limited number of high-quality studies restrict the feasibility of performing a meta-analysis. Lastly, to ideally assess predictive factors, prospective longitudinal data is to be utilized. We included 12 retrospective studies study considering the few amounts of the included studies. Future studies need to use more prospective cohort studies to build model.

留言 (0)

沒有登入
gif