Comparison of deep learning models to detect crossbites on 2D intraoral photographs

Dataset

The dataset used in this study was obtained from the Section of Orthodontics, Aarhus University, Denmark. It includes patients who underwent an initial orthodontic consultation at the Section of Orthodontics, Aarhus University, Denmark between 01.07.2018 and 31.07.2023. The dataset contains randomly selected clinical photographs taken for orthodontic diagnoses and treatment planning, hence representing the whole patient cohort seen during this time interval. Photos displaying the occlusion from anterior, left, and right sides were included for each patient. Exclusion criteria were crossbites in deciduous dentitions and orthodontic treatment in progress. For patient anonymity, the intraoral image dataset was used without personal information such as name, age, or gender. The images were first labelled as non-crossbite and crossbite. In a second step, the crossbite photographs were further labelled as lateral or frontal crossbite. The analysis applied the German classification system “Kieferorthopädische Indikationsgruppen”, which determines health insurance cover of treatment and distinguishes between frontal crossbites (M4) (Fig. 1) and unilateral crossbites (K4) (Fig. 2) among others. If a patient exhibited both a lateral and frontal crossbite, the images were labelled as frontal crossbite in line with the KIG classification as M4 instead of K4 (Fig. 3). The labelling was initially done by S.V. and independently repeated by M.S. without any conflicts.

Fig. 1figure 1Fig. 2figure 2Fig. 3figure 3

Combination of a frontal crossbite and lateral crossbite. Note: According to the malocclusion category system (Kieferorthopädische Indikationsgruppe, KIG) this image would be classified as a M4 (frontal crossbite) and not K4 (lateral crossbite)

Preprocessing of the dataset

All preprocessing was performed using the PyTorch 2.0.1 framework (The Linux Foundation, San Francisco, CA, USA) for Python 3.10.12. For the training and testing of the models, 10% of the data was randomly split and only used for testing purposes, whereas the 90% of the images were used for training and validation. All images were resized to 224 × 224 or to 299 × 299 pixels to satisfy the respective model’s input requirements.

To enhance the performance of the deep learning models with a limited number of original samples and to avoid overfitting, data augmentation was applied dynamically during the training process. Hence, each time an image was loaded during training, it was randomly modified using specified transformations. These transformations included random horizontal flips, rotations of up to 20°, and brightness adjustments by up to 20%. Such dynamic augmentation ensured that the model encountered varied versions of the images throughout the training. Hence, it ensures that the model learns to generalize from the underlying patterns in data, thereby improving its generalization capabilities without increasing the actual number of images in the dataset. This process was repeated across all folds during the k-fold cross-validation.

Classification models

Neural networks are a set of algorithms designed to recognize patterns. They function by processing input data such as images through layers of artificial neurons, which are inspired by human brain neurons. Convolutional neural networks (CNNs), a type of neural network, are increasingly used in medical image diagnostics for tasks such as detection, segmentation, and classification of anatomical structures. To classify the occlusions, we used several different CNN models which have previously been successfully applied in other image classification studies [13].

ResNet18 and ResNet50

The ResNet architecture, introduced by He et al. [14], uses residual blocks, which are blocks stacked on top of each other. The ResNet architecture incorporates skip connections, which enable the network to bypass certain layers. It also integrats batch normalization between layers, which makes the training process more stable and faster. ResNet has several variants differing in the number of neural network layers, such as the ResNet18 and ResNet50 with 18 and 50 layers, respectively, which are applied in this study.

MobileNet

Howard et al. developed MobileNet as an architecture for applications where computational resources and processing time are limited. Its key innovation is the use of depth-wise separable convolutions instead of standard convolutions used in many other neural networks. This approach splits the standard convolution into a summation of two distinct steps: a depth-wise convolution, which filters the input, and a pointwise convolution, which combines the filtered results using a 1 × 1-dimensional filter [15].

Xception

Xception model, a deep learning architecture proposed by Chollet, is inspired by the Inception architecture. Xception replaces the layers with depth-wise separable convolution to filter and combine information in a more efficient way. It combines these depth-wise separable convolutions with residual connections, which help the network to learn better by allowing information to skip certain layers [16].

DenseNet

Developed by Huang et al., DenseNet is a deep convolutional neural architecture with dense connections among its units, where each layers connects directly with each subsequent layer in a feed-forward manner [17]. This means that instead of just passing information from one layer to the next, each layer receives inputs from all previous layers and passes its own output to all subsequent layers.

EfficientNet

Introduced by Tan et al. [18], EfficientNet implements a scaling method that uniformly adjusts all three dimensions of the neural network: depth, width, and resolution. In this context, ‘depth’ refers to the number of layers, ‘width’ represents the number of channels in each layer, and ‘resolution’ indicates the input image size. This scaling methodology replaces arbitrary adjustments with a systematic approach, ensuring consistent scaling across all dimensions. This approach is based on a smaller based model, which is expanded using scaling coefficients predetermined through a grid search. In our implementation, we adopted the EfficientNet-B0 variant [18].

Model training

Model training was performed using Pytorch 2.0.1. In the initial step the models were trained to classify non-crossbite vs. crossbite. In the second step the models where trained to classify non-crossbite vs. lateral crossbite vs. frontal crossbite. Given the limited sample size of the dataset, we used k-fold cross-validation with k = 10. To adjust the convolutional neural network models for our classification task, we used transfer learning. Consequently, we used the network layers of the respective model pre-trained on the ImageNet dataset [19]. We replaced the last layer (classifier) with a new output layer to conform to the number of classes in our dataset (e.g. two and three). For the output layer, SoftMax activation was used for all models. The initial learning rate was set to 0.001. The cross-entropy loss function (binary for two-classes and categorical for three-class classification) was used. AdamW optimizer was applied to reduce the risk of overfitting [20]. Batch size was 16. The number of epochs was set to 20 with an early stopping criterion if validation loss did not improve for three consecutive epochs.

Model evaluation

The models were tested on the remaining (10%) test data that were not included in the training. Accuracy, precision, recall (sensitivity), specificity, F1 score, and Cohen’s Kappa were calculated, and confusion matrices were determined for each model and tasks. Additionally, we mapped the Receiver Operating Characteristic (ROC) and calculated the corresponding Area Under the Curve (AUC) value.

留言 (0)

沒有登入
gif