Enhancing oral squamous cell carcinoma detection: a novel approach using improved EfficientNet architecture

This section addresses the details of the dataset and proposed methodology.

Dataset

There were 1224 total images from 230 patients in this dataset. There are two sets of images, each with a different resolution. “The first collection consisted of 439 OSCC images at 100x magnification and 89 histopathological images of the normal epithelium of the oral cavity. The second group consisted of 495 histopathological images of OSCC tissue at 400x magnification and 201 images of the normal epithelium of the oral cavity. The second group consisted of 495 histopathological images of OSCC tissue at 400x magnification and 201 images of the normal epithelium of the oral cavity. A total of 934 malignant (OSCC) images and 290 normal (benign) oral cavity epithelium images were obtained. Medical professionals collected, processed, and cataloged the slides of tissue stained with H&E. Images were then taken using a Leica ICC50 HD microscope [33]. Histopathological images of oral cancer squamous cell samples are presented in Fig. 2.

Fig. 2figure 2

Sample of oral squamous cell histopathological images (a) benign (b) malignant

Proposed methodology

This research suggested the detection of OSCC using histopathological images. The methodology comprises three phases. In the first phase, 17 pretrained CNN models were evaluated to detect OSSC. Each CNN model was individually executed 30 times to examine its credibility. Finally, the finding of each execution with seven parametric measures is recorded.

In the second phase, the statistical analysis was carried out in two steps. In the initial step, the Duncan multiple range test was carried out. From this, the best-performing model is chosen. The Wilcoxon signed-rank test was performed in the second step of the statistical analysis. The high-performance model selected by the Duncan test was used as a reference. Then, the seven parameter measures of the reference model were compared with those of the other 16 CNN models to determine the superior model. In this analysis, the best model obtained was EffcienNetB0, but the accuracy was less than 90%, which is more satisfactory. Hence, we are motivated to improve EffcienNetB0 by modifying its original structure, as illustrated in Fig. 3.

Google published an efficient network in 2019. The baseline network uses a neural architecture search and a scaled model to obtain a series of models. EffcienNetB0 comprises a convolutional layer, an MBconvolution1 layer, an MBconvolution6 layer, a pooling layer, a fully connected layer, and a classification layer.

EfficientNetB0 is a convolutional neural network (CNN) architecture that has gained prominence owing to its efficiency and effectiveness in various computer vision tasks. Below, we outline some of the key strengths of EfficientNetB0 in comparison with other deep learning models.

Scalability: One of the primary strengths of EfficientNetB0 is its scalable architecture, which is achieved through a compound scaling method. This method optimizes the network depth, width, and resolution simultaneously, resulting in models that are both efficient and accurate across a wide range of computational resources.

Parameter Efficiency: Compared with other deep learning architectures, EfficientNetB0 achieves superior performance while maintaining a relatively small number of parameters. This efficiency is crucial for applications with limited computational resources, making EfficientNetB0 suitable for deployment on various mobile and edge devices.

Transfer Learning Capability: Owing to its effectiveness in learning rich feature representations from images, EfficientNetB0 demonstrates strong transfer learning capabilities. Pretrained versions of EfficientNetB0 on large-scale image datasets, such as ImageNet, can be fine-tuned on smaller datasets with specific tasks, leading to improved performance and faster convergence.

State-of-the-art Performance: EfficientNetB0 consistently achieved state-of-the-art performance across benchmark datasets and computer vision tasks, including image classification, object detection, and segmentation. Its superior performance is attributed to its optimized architecture, which balances model complexity and computational efficiency.

Generalization Ability: EfficientNetB0 demonstrates robust generalization ability, meaning that it can effectively learn from limited training data and generalize well to unseen data. This is particularly beneficial for medical imaging tasks in which annotated datasets may be limited or expensive to acquire.

In our study, we employed EfficientNetB0 as the backbone architecture for our deep learning model due to these strengths, aiming to leverage its efficiency and performance for classifying oral epithelial lesions.

Fig. 3figure 3

Improved EfficientNet (a) basic architecture of improved EfficientNet, (b) details of each block of (a), (c) architecture of MB convolution, (d) architecture of PAM, (e) architecture of CAM

The modification of the main architecture of EfficientNetB0 is illustrated in Fig. 3(a). The layer of each block is illustrated in Fig. 3 (b). A dual attention network (DAN) is introduced before the fully connected layer. The features extracted from block 7 are fed to pooling through DAN. The blocks are MBConvolution, i.e., MBconvolution1 and MBconvolution6. MBconvolution1 is illustrated in Fig. 3(c); MB convolution refers to an inverted mobile bottleneck [34]. MBconvolution6 is the six-time repeat of MBconvolution1. The input image of the histopathology of OSSC was 300 × 300. The final classification result is processed through a convolution layer, an MB convolution layer, an MB convolution layer, a pooling layer, a fully connected layer, and a classification layer.

The PAM and CAM run in parallel in the DAN. The attention mechanism filters out irrelevant information and prioritizes useful information. The DAN attention mechanism achieves great accuracy by adjusting the relationship between local and global features [35]. Figure 3(d) and 3(e) depict the PAM and CAM, respectively. The position attention module encodes more contextual information into local features, improving their representation capabilities. Following that, we go over the process of adaptively aggregating spatial contexts. As shown in Fig. 3(d), we first feed a local feature A ∈ RC×H×W into a convolution layer to build two new feature maps B and C, where ∈ RC×H×W. Next, they are reshaped to RC×N, where N = H × W is the number of pixels. Next, we perform matrix multiplication on the transpose of C and B and use a softmax layer to compute the spatial attention map S ∈ RN×N

$$}=\frac.)}}}^ .)} }}$$

(1)

where Sji calculates the impact of the ith position on the jth position; the higher the correlation between two places is, the more similar their feature representations are.

Meanwhile, we feed feature A into a convolution layer to create a new feature map D ∈ RC×H×W that we reshape to RC×N. The outcome is RCHW when we conduct a matrix multiplication of D and the transpose of S. Last, we multiply it by a scale parameter and execute an elementwise sum operation on the features A to obtain the final result E ∈ RC×H×W, as shown

$$=\alpha \sum\limits_}^ }.} \right)+}$$

(2)

where it is set to zero at the start and gradually learns to attach a greater weight [36]. Equation 2 shows that the resulting feature E at each place is a weighted sum of the features across all positions and the original features. As a result, it has a global contextual perspective and selectively collects contexts based on the spatial attention map. Similar semantic traits benefit from mutual gains, boosting intraclass compactness and consistency.

Emphasis has now been placed on interdependent feature maps to improve the feature representation of certain semantics. As a result, we create a channel attention module to formally model channel interdependence. The channel attention module topology is depicted in Fig. 3(e). Unlike the position attention module, we calculate the channel attention map X ∈ RC×C straight from the original features A ∈ RC×H×W. In particular, we reshape A to RC×N and then execute matrix multiplication on A and its transpose. Finally, a softmax layer is applied to obtain the channel attention map X ∈ RC×C.

$$}=\frac.)}}}^ .)} }}$$

(3)

where xji is the impact of the ith channel on the jth channel. Furthermore, we conduct matrix multiplication on the transpose of X and A and reshape the output to RC×H×W. The result is then multiplied by the scale parameter β, and an elementwise sum operation with A is performed to generate the final output E ∈ RC×H×W.

$$=\beta \sum\limits_}^ }.)+}$$

(4)

where β gradually learns a weight from 0. Equation 4 demonstrates that the final feature of each channel is a weighted sum of all channels’ and original features’ features, which depicts the long-term semantic connections across feature maps. It improves feature discriminability [37].

We applied 17 pretrained DL CNN models—Alexnet, Darknet19, Darknet53, Densenet201, Googlenet, InceptionResNetv2, InceptionV3, Mobilenetv2, NasnetLarge, NasnetMobile, Xception, ResNet18, ResNet50, ResNet101, VGG16, VGG19, and EfficientNet—for OSCC detection. This study used these models to categorize benign and malignant cases from oral lesion histopathology images because they have achieved excellent success in various computer vision and medical image analysis challenges. The best model is then chosen and considered for future comparison.

In summary, the proposed model was executed as follows.

Step1:

Oral squamous cell carcinoma (OSCC) images were collected from clinical databases or medical institutions.

Step2:

Seventeen pretrained deep learning models were used for the classification of benign and malignant lesions in OSCC images.

Step3:

The performance of each model was evaluated using various metrics, including accuracy, sensitivity, specificity, false positive rate (FPR), precision, F1 score, Matthews correlation coefficient (MCC), kappa, and computational time.

Step4:

Statistical analysis, specifically Duncan’s multiple range test, was used to determine the best-performing model among the 17 pretrained models.

Step5:

Further validation of the selected model was performed through additional statistical analysis, such as the Wilcoxon signed-rank test, to confirm its superiority.

Step6:

Both statistical tests confirm that EfficientNetB0 outperforms the other models in terms of classification accuracy and other evaluation metrics.

Step7:

Enhancements to the EfficientNetB0 model, including the incorporation of a dual attention network (DAN) and MobileNet convolutional layers (MBConvolution), were implemented to improve the performance.

Step8:

Sequential execution of the enhanced EfficientNetB0 model on the OSCC image dataset was performed to evaluate its classification performance.

Step9:

The performance of the improved model was assessed using the same set of evaluation metrics to measure any enhancements achieved through the introduction of the dual attention network and MB convolution layers.

留言 (0)

沒有登入
gif