Performance of artificial intelligence on cervical vertebral maturation assessment: a systematic review and meta-analysis

Study selection and characteristics

The database searches yielded 656 records, of which 31 were retrieved for full-text review following title and abstract screening. After examining the full texts, 6 articles were excluded for reasons detailed in Supplementary Table 1. Thus, a total of 25 studies satisfied the inclusion criteria.; The number of included studies rose over the observed period. Also, the data types diversified over time (Fig. 1).

Fig. 1figure 1

The included studies utilized two main image modalities: cephalograms (n = 22) [13, 21, 24, 25, 27, 35] and cone beam computed tomography (CBCT) scans (n = 4) [42, 44, 45, 47], as summarized in Table 1. The majority of studies (n = 22) established ground truth labels via evaluation by clinical experts. Specifically, the reference standard was defined by one expert (n = 8 studies) [13, 24, 32,33,34, 40, 41, 47], two experts (n = 8) [25, 27, 28, 31, 35,36,37, 46], or three or more experts (n = 5) [29, 38, 39, 42, 44] and one study [43] did not report the number of experts involved. Three studies employed a combination of clinical experts and software analysis to determine labels [35, 36, 46]. A total of 55 AI models were utilized, with DL being the most common approach (n = 19) [13, 21, 24, 25, 27,28,29,30,31,32,33,34,35,36,37,38,39,40,41] analysis to determine labels analysis to determine labels, followed by ML (n = 7) [13, 25, 33, 35, 42, 43, 46], statistical modeling (n = 2) [44, 45], and rule-based AI (n = 1) [47]. Among DL technologies, CNNs stood out as the predominant model (n = 14) [21, 24, 25, 27,28,29,30,31,32, 37,38,39,40,41], including ResNet architectures (n = 6)[24, 25, 28, 37,38,39]. The most utilized ML techniques were Naïve Bayes [13, 33, 46], and support vector machines [13, 25, 35], each applied in three studies and Logistic Regression applied in five studies [13, 35, 42, 44, 45]. Logistic regression models were the primary focus in statistical modeling (n = 2) [44, 45]. Augmentation techniques, such as rotation and translation, were implemented in seven DL studies [24, 28, 31, 37,38,39,40] to increase the size of the training data. Feature extraction, using landmark coordinate measurements, was performed in 10 studies [13, 21, 33,34,35, 42,43,44, 46, 47]. Additionally, the automation of region of interest (ROI) detection was carried out in four studies [25, 29, 31, 41], with methods like U-Net being used in two instances [31, 41] to delineate crucial anatomical areas from the images.

Performance metrics most reported for DL studies included accuracy, kappa coefficient, precision, recall, and F1 score. Additional measures such as mean absolute error, area under the receiver operating characteristic curve, sensitivity, and specificity were also occasionally utilized. The statistical modeling studies reported a wider range of metrics encompassing agreement percentage, R squared, predictability, as well as some of the aforementioned measures. The machine learning studies focused on accuracy and area under the curve, while the single rule-based AI study used kappa coefficients and Goodman-Kruskal gamma correlation. The detailed description of each metric is presented in supplementary Table 2.

Risk of bias and applicability

Quality assessment identified 8 studies [21, 27, 35, 36, 38, 39, 42, 44] as having low risk of bias and concerns regarding applicability across all domains. The greatest issues were found in the reference standard domain, with 13 studies [13, 24, 25,

留言 (0)

沒有登入
gif