Swimtrans Net: a multimodal robotic system for swimming action recognition driven via Swin-Transformer

1 Introduction

Swim motion recognition, as an important research field in motion pattern analysis, holds both academic research value and practical application demand. Swimming is a widely popular sport worldwide (Valdastri et al., 2011). However, in practical training and competitions, capturing and evaluating the technical details of swim motions accurately can be challenging (Colgate and Lynch, 2004). Therefore, utilizing advanced motion recognition techniques for swim motion analysis can not only help athletes optimize training effectiveness and improve performance but also provide scientific evidence in sports medicine to effectively prevent sports injuries. Additionally, swim motion recognition technology can assist referees in making fair and accurate judgments during competitions (Chowdhury and Panda, 2015). Thus, research and development in swim motion recognition not only contribute to the advancement of sports science but also bring new opportunities and challenges to the sports industry.

The initial methods primarily involved swim motion recognition through the use of symbolic AI and knowledge representation. Expert systems, which encode domain experts' knowledge and rules for reasoning and decision-making, are widely used symbolic AI approaches. For example, Feijen et al. (2020) developed an algorithm for online monitoring of swimming training that accurately detects swimming strokes, turns, and different swimming styles. Nakashima et al. (2010) developed a swim motion display system using wrist-worn accelerometer and gyroscope sensors for athlete training. Simulation-based approaches are also effective, as they involve building physical or mathematical models to simulate swim motions for analysis and prediction. Xu (2020) utilized computer simulation techniques, employing ARMA models and Lagrangian dynamics models, to analyze the kinematics of limb movements in swimming and establish a feature model for swim motion analysis. Jie (2016) created a motion model for competitive swim techniques using virtual reality technology and motion sensing devices, enabling swim motion simulation and the development of new swimming modes. Another approach is logistic regression, a statistical method used to analyze the relationship between feature variables and outcomes of swim motions by constructing regression models. Hamidi Rad et al. (2021) employed a single IMU device and logistic regression to estimate performance-related target metrics in various swimming stages, achieving high R2 values and low relative root mean square errors. While these techniques have the benefits of being methodical and easily understandable, they also come with the limitations of needing extensive background knowledge and complex computational requirements.

To address the drawbacks of requiring substantial prior knowledge and high computational complexity in the initial algorithms, data-driven and machine learning-based approaches in swim motion recognition primarily rely on training models with large amounts of data to identify and classify swim motions. These methods offer advantages such as higher generalization capability and automated processing. Decision tree-based methods perform motion recognition by constructing hierarchical decision rules. For example, Fani et al. (2018) achieved a 67% accuracy in classifying freestyle stroke postures using a decision tree classifier. Random forest-based methods enhance recognition accuracy by ensembling multiple decision trees. For instance, Fang et al. (2021) achieved high-precision motion state recognition with an accuracy of 97.26% using a random forest model optimized with Bayesian optimization. Multi-layer perceptron (MLP), as a type of feedforward neural network, performs complex pattern recognition through multiple layers of nonlinear transformations. Na et al. (2011) combined a multi-layer perceptron with a gyroscope sensor to achieve swim motion recognition for target tracking in robotic fish. Nevertheless, these approaches are constrained by their reliance on extensive annotated data, extended model training periods, and possible computational inefficiencies when handling real-time data.

To address the drawbacks of high prior knowledge requirements and computational complexity in statistical and machine learning-based algorithms, deep learning-based algorithms in swim motion recognition primarily utilize techniques such as Convolutional Neural Networks (CNN), reinforcement learning, and Transformers to automatically extract and process complex data features. This approach offers higher accuracy and automation levels. CNN extracts spatial features through deep convolutional layers. For example, Guo and Fan (2022) achieved a classification accuracy of up to 97.48% in swim posture recognition using a hybrid neural network algorithm. Reinforcement learning identifies swim motions by learning effective propulsion strategies. For instance, Gazzola et al. (2014) combined reinforcement learning algorithms with numerical methods to achieve efficient motion control for self-propelled swimmers. Rodwell and Tallapragada (2023) demonstrated the practicality of reinforcement learning in controlling fish-like swimming robots by training speed and path control strategies using physics-informed reinforcement learning. Transformers, with their powerful sequential modeling capability, can effectively process and recognize complex time series data. Alternative approaches have also been explored to overcome the limitations of deep learning models. For example, hybrid models that integrate classical machine learning techniques with deep learning frameworks have been proposed. Athavale et al. (2021) introduced a hybrid system combining Support Vector Machines (SVM) with CNNs to leverage the strengths of both methods, achieving higher robustness in varying swimming conditions. Additionally, edge computing and federated learning have been investigated to address the high computational resource demands, enabling more efficient real-time processing and preserving data privacy (Arikumar et al., 2022). Nevertheless, these techniques come with certain drawbacks such as their heavy reliance on extensive annotated datasets, demanding computational resources, and possible delays in response time for real-time tasks.

To address the issues of high dependency on large labeled datasets, high computational resource requirements, and insufficient response speed in real-time applications, we propose our method: Swimtrans Net - a multimodal robotic system for swimming action recognition driven by Swin-Transformer. By leveraging the powerful visual data feature extraction capabilities of Swin-Transformer, Swimtrans Net effectively extracts swimming image information. Additionally, to meet the requirements of multimodal tasks, we integrate the CLIP model into the system. Swin-Transformer serves as the image encoder for CLIP, and through fine-tuning the CLIP model, it becomes capable of understanding and interpreting swimming action data, learning relevant features and patterns associated with swimming. Finally, we introduce transfer learning for pre-training to reduce training time and lower computational resources, thereby providing real-time feedback to swimmers.

Contributions of this paper:

• Swimtrans Net innovatively integrates Swin-Transformer and CLIP model, offering advanced feature extraction and multimodal data interpretation capabilities for swimming action recognition.

• The approach excels in multi-scenario adaptability, high efficiency, and broad applicability by combining visual data encoding with multimodal learning and transfer learning techniques.

• Experimental results demonstrate that Swimtrans Net significantly improves accuracy and responsiveness in real-time swimming action recognition, providing reliable and immediate feedback to swimmers.

2 Related work 2.1 Action recognition

In modern sports, accurately analyzing and recognizing various postures and actions have become essential for enhancing athlete performance and training efficiency. Deep learning and machine learning models play a crucial role in this process (Hu et al., 2016). Specifically, in swimming, these technologies have made significant advancements. They effectively identify and classify different swimming styles such as freestyle, breaststroke, and backstroke, as well as specific movements like leg kicks and arm strokes. This detailed classification and recognition capability provide valuable training data and feedback for coaches and athletes (Dong et al., 2024). Studying feature extraction and pattern recognition methods for postures and actions is key to improving the accuracy and effectiveness of swimming motion analysis and prediction. Deep learning models can capture subtle motion changes and features by analyzing extensive swimming video data, enabling them to identify different swimming techniques. This helps coaches develop more scientific training plans and provides athletes with real-time feedback and correction suggestions (Wang et al., 2024). Moreover, advancements in wearable devices and sensor technology have made obtaining high-quality motion data easier. These devices can record specific actions and postures, providing rich training data for deep learning models. For instance, high-precision accelerometers and gyroscopes can record athletes' movements in real time, which are then analyzed by deep learning models.

2.2 Transformer models

Transformer models have revolutionized artificial intelligence, demonstrating exceptional performance and versatility across various domains. In natural language processing (NLP), they significantly enhance machine translation, text summarization, question answering, sentiment analysis, and language generation, leading to more accurate and context-aware systems (Hu et al., 2021). In computer vision, Vision Transformers (ViTs) excel in image recognition, object detection, image generation, and image segmentation, achieving state-of-the-art results and advancing fields like medical imaging and autonomous driving. For audio processing, transformers improve speech recognition, music generation, and speech synthesis, contributing to better virtual assistants and transcription services (Lu et al., 2024). In healthcare, transformers assist in medical image analysis, drug discovery, and clinical data analysis, offering precise disease detection and personalized medicine insights. The finance sector benefits from transformers through algorithmic trading, fraud detection, and risk management, enhancing security and decision-making. In gaming and entertainment, transformers generate storylines, dialogues, and level designs, enriching video games and virtual reality experiences. Lastly, in robotics, transformers enable autonomous navigation and human-robot interaction, advancing technologies in autonomous vehicles and drone navigation. Overall, the versatility and power of transformer models drive innovation and efficiency across a multitude of applications, making them indispensable in modern technology (Li et al., 2014).

2.3 Multimodal data fusion

Multimodal Data Fusion focuses on enhancing the analysis and prediction of swimming motions by utilizing data from various sources, such as images, videos, and sensor data (Hu et al., 2018). By integrating data from different modalities, researchers can obtain a more comprehensive and accurate understanding of swimming motions. For instance, combining images with sensor data allows for the simultaneous capture of a swimmer's posture and motion trajectory, leading to more thorough analysis and evaluation (Zheng et al., 2022). This approach can provide detailed insights into the efficiency and technique of the swimmer, which are crucial for performance improvement and injury prevention. Moreover, multimodal data fusion can significantly broaden the scope and capabilities of swimming motion analysis and prediction. It enables the development of advanced models that can interpret complex motion patterns and provide real-time feedback to swimmers and coaches. This, in turn, facilitates the creation of personalized training programs tailored to the individual needs of each swimmer, enhancing their overall performance. Research in this area continues to push the boundaries of what is possible in sports science, promising more sophisticated tools for analyzing and optimizing athletic performance (Nguyen et al., 2016). Overall, the integration of multimodal data represents a significant advancement in the field, offering a richer, more nuanced understanding of swimming motions and contributing to the advancement of sports technology and training methodologies.

3 Methodology 3.1 Overview of our network

This study proposes a deep learning-based method, Swimtrans Net: a multimodal robotic system for swimming action recognition driven via Swin-Transformer, for analyzing and predicting swimming motions. This method combines the Swin-Transformer and CLIP models, leveraging their advantages in image segmentation, feature extraction, and semantic understanding to provide a more comprehensive and accurate analysis and prediction of swimming motions. Specifically, the Swin-Transformer is used to extract and represent features from swimming motion data, capturing the spatial characteristics of the actions. Then, the CLIP model is introduced to understand and interpret the visual information in the swimming motion data, extracting the semantic features and techniques of the actions. Finally, transfer learning is used to apply the pre-trained Swin-Transformer and CLIP models to the swimming motion data, and model parameters are fine-tuned to adapt them to the specific tasks and data of swimming motions.

First of all, Collect datasets containing swimming motions in the form of videos, sensor data, etc., and preprocess the data by removing noise, cropping, and annotating action boundaries to prepare it for model training and testing. Use the Swin-Transformer model to extract and represent features from the swimming motion data, decomposing it into small patches and capturing relational information through a self-attention mechanism to effectively extract spatial features. Introduce the CLIP model and input the swimming motion data into it; by learning the correspondence between images and text, the CLIP model can perform semantic understanding and reasoning of the image data. Applying the CLIP model to the swimming motion data helps the system better understand the action features and techniques in swimming motions. Apply the pre-trained Swin-Transformer and CLIP models to the swimming motion data, and use transfer learning and fine-tuning to adapt them to the specific tasks and data of swimming motions, improving the model's performance in analysis and prediction. Finally, evaluate the trained model by comparing it with actual swimming motions, assessing its performance in analysis and prediction tasks, and apply this method to actual swimmers and coaches, providing accurate technique evaluations and improvement suggestions.

The term “robotic system” was chosen to emphasize the integration of advanced machine learning models with automated hardware components, creating a cohesive system capable of autonomous analysis and prediction of swimming motion data. Our system leverages both the Swin-transformer and CLIP models to process and interpret the data, which is then used by the robotic components to provide real-time feedback and analysis to swimmers. By referring to it as a “robotic system,” we aim to highlight the seamless collaboration between software algorithms and physical devices (such as cameras, sensors, and possibly robotic feedback mechanisms) that together perform complex tasks with minimal human intervention. This terminology helps to convey the sophisticated and automated nature of the system, distinguishing it from purely software-based solutions.

3.2 Swin-Transformer model

Swin-Transformer (Swin Attention Mechanism) is an image segmentation and feature extraction model based on self-attention mechanisms, playing a crucial role in swimming motion analysis and prediction methods (Tsai et al., 2023). Figure 1 is a schematic diagram of the principle of Swin-Transformer Model.

www.frontiersin.org

Figure 1. The swimming action image is input, segmented into small blocks by Swin-Transformer, and the self-attention mechanism is applied to extract features, which are then used for action understanding, semantic extraction and prediction. (A) Architecture. (B) Two successive Swin-Transformer blocks.

The Swin-Transformer leverages self-attention mechanisms to capture the relational information between different regions of an image, enabling image segmentation and feature extraction. Unlike traditional convolutional neural networks (CNNs) that rely on fixed-size convolution kernels, the Swin-Transformer divides the image into a series of small patches and establishes self-attention connections between these patches. The core idea of the Swin-Transformer is to establish a global perception through a multi-level attention mechanism. Specifically, it uses two types of attention mechanisms: local attention and global attention. Local attention captures the relational information within patches, while global attention captures the relational information between patches. This multi-level attention mechanism allows the Swin-Transformer to understand the semantics and structure of images from multiple scales. In the context of swimming motion analysis and prediction, the Swin-Transformer model plays a crucial role in extracting and representing features from swimming motion data. By decomposing the swimming motion data into small patches and applying the self-attention mechanism, the Swin-Transformer captures the relational information between different parts of the swimming motion and extracts spatial features of the motion. These features are then used for subsequent tasks such as motion understanding, semantic extraction, and prediction, enabling accurate analysis and prediction of swimming motions (Figure 2).

Patch Embeddings:X=Reshape(Conv2D(I))    (1) www.frontiersin.org

Figure 2. Schematic diagram of the calculation process of Formula 1-7.

The patch embeddings operation takes an input image I and applies a convolutional operation to extract local features. The resulting feature map is then reshaped to obtain a sequence of patch embeddings X.

Absolute Position Embeddings:P=PositionEmbeddings(X)    (2)

The absolute position embeddings operation generates a set of learnable position embeddings P that encode the absolute position information of each patch in the sequence.

transformerer Encoder Layers:Y=SwinBlock(X, P)    (3)

The Swin-Transformerer encoder layers, implemented as SwinBlocks, take the patch embeddings X and absolute position embeddings P as inputs. These layers apply self-attention and feed-forward neural networks to enhance the local and global interactions between patches, resulting in the transformered feature representations Y.

Patch Merging:Z=PatchMerging(Y)    (4)

The patch merging operation combines neighboring patches in the transformered feature map Y to obtain a lower-resolution feature map Z. This helps capture long-range dependencies and reduces computational complexity.

transformerer Encoder Layers (on merged patches):O=SwinBlock(Z, P)    (5)

The Swin-Transformerer encoder layers are applied again, but this time on the merged patch embeddings Z using the same absolute position embeddings P. This allows for further refinement of the feature representations, considering the interactions between the merged patches.

Reverse Patch Merging:U=ReversePatchMerging(O)    (6)

The reverse patch merging operation restores the feature map resolution by reversing the patch merging process, resulting in the refined high-resolution feature map U.

Output Classification:C=Classify(U)    (7)

Finally, the high-resolution feature map U is fed into a classification layer to obtain the output classification probabilities C.

By introducing the Swin-Transformer model, the swimming motion analysis method can better utilize the spatial information of image data, extracting richer and more accurate feature representations. This helps to improve the performance of swimming motion analysis and prediction, providing swimmers and coaches with more accurate technical evaluations and improvement guidance.

3.3 CLIP

CLIP (Contrastive Language-Image Pretraining) (Kim et al., 2024a) is a model designed for image and text understanding based on contrastive learning, playing a critical role in swimming motion analysis and prediction methods (shown in Figure 3). The model achieves cross-modal semantic understanding and reasoning by learning the correspondence between images and text through a unified embedding space. This capability allows the model to effectively interpret and predict swimming motions by leveraging both visual and textual information, enhancing the accuracy and robustness of the analysis.

www.frontiersin.org

Figure 3. The image is encoded into a vector through Swin Transformer, and the text is converted into a vector through the text encoder. After being fused through the ITC, ITM, and LM modules, the alignment and generation of the image and text are achieved.

This space allows for measuring the similarity between images and text, enabling a combined representation of visual and semantic information. The image encoder utilizes a Swin Transformer to convert input images into vector representations, extracting features through several layers of self-attention and feed-forward operations, and mapping these features into vector representations in the embedding space. The text encoder processes input text into vector representations using self-attention mechanisms and feed-forward networks to model semantic relationships within the text. The Image-Text Contrastive (ITC) module aligns the image and text representations within the embedding space, ensuring that corresponding image-text pairs are closely positioned while non-matching pairs are far apart. The Image-Text Matching (ITM) module fine-tunes this alignment by incorporating cross-attention mechanisms, enhancing the model's ability to match images with their corresponding textual descriptions. The Language Modeling (LM) module uses image-grounded text encoding and decoding mechanisms, leveraging cross-attention and causal self-attention to generate text based on the given image, thereby enhancing the model's language generation capabilities with visual context. In the swimming motion analysis and prediction method, the model interprets visual information from swimming motion data by converting these visual features into vector representations within the embedding space. Textual descriptions of swimming techniques are similarly processed by the text encoder. This unified representation of visual and semantic information facilitates the analysis and prediction of swimming motions. By comparing the vector representation of a swimmer's actions with those of standard techniques or known movements, the model can assess the swimmer's technical level and provide suggestions for improvement. This is achieved by measuring the similarity between image and text vectors in the embedding space, enabling semantic understanding and reasoning of swimming actions.

ITC (Image-Text Contrastive Learning): The ITC module is used for contrastive learning between images and text. By comparing the output features of the image encoder and the text encoder, this module is able to align images and text in the embedding space, thereby achieving cross-modal contrastive learning. ITM (Image-Text Matching): The ITM module is used for image and text matching tasks. This module fuses image and text features through bi-directional self-attention (Bi Self-Att) and cross-attention (Cross Attention) mechanisms to determine whether the image and text match, thereby enhancing the model's cross-modal understanding ability. LM (Language Modeling): The LM module is used for language modeling tasks. This module generates text descriptions based on the contextual information provided by the image encoder through the causal self-attention (Causal Self-Att) mechanism, enhancing the model's text generation ability. Each module in the diagram consists of self-attention and feed-forward neural networks (Feed Forward), and implements specific functions through different attention mechanisms (such as cross-attention and bi-directional self-attention). These modules work together to complete the joint modeling of images and texts, improving the performance of the model in swimming motion analysis and prediction tasks.

Image Encoder:v=Encoderimage(I)    (8)

The image encoder operation takes an input image I and applies an encoder function Encoderimage to obtain the corresponding image embedding vector v.

Text Encoder:t=Encodertext(text)    (9)

The text encoder operation takes an input text text and applies an encoder function Encodertext to obtain the corresponding text embedding vector t.

Similarity Score:score=CosineSimilarity(v, t)    (10)

The similarity score operation calculates the cosine similarity between the image embedding vector v and the text embedding vector t. This score represents the similarity or compatibility between the image and the text.

Optimization Objective:L=-log(score)    (11)

The optimization objective is defined as the negative logarithm of the similarity score. The goal is to maximize the similarity score, which corresponds to minimizing the loss L.

CLIP leverages this framework to enable cross-modal understanding and reasoning between images and text, making it a powerful tool for tasks such as image-text retrieval, image classification based on textual descriptions, and more. By incorporating the CLIP model, the swimming motion analysis method can better utilize the semantic relationships between image and text data, extracting richer and more accurate action features. This helps to improve the performance of swimming motion analysis and prediction, providing swimmers and coaches with more accurate technical evaluations and improvement guidance.

3.4 Transfer learning

Transfer learning (Manjunatha et al., 2022) is a machine learning method that involves applying a model trained on a large-scale dataset to a new task or domain. The fundamental principle of transfer learning is to utilize the knowledge already learned by a model (Zhu et al., 2021), transferring the experience gained from training on one task to another related task. This accelerates the learning process and improves performance on the new task.

Figure 4 is a schematic diagram of the principle of Transfer Learning.

www.frontiersin.org

Figure 4. A schematic diagram of the principle of Transfer Learning.

In traditional machine learning, training a model requires a large amount of labeled data and computational resources. However, obtaining large-scale labeled data and training a complex model is often very expensive and time-consuming. This is why transfer learning has become highly attractive. By using a pre-trained model, we can leverage the parameters learned from existing data and computational resources, thereby quickly building and optimizing models for new tasks with relatively less labeled data and computational resources. The method illustrated in the image applies transfer learning to provide initial model parameters or assist in training the new task by transferring already learned feature representations and knowledge. There are several ways this can be done: using a pre-trained model as a feature extractor, where the initial layers learn general feature representations and the later layers are fine-tuned; fine-tuning the entire pre-trained model to optimize it on the new task's dataset; and domain adaptation, which adjusts the model's feature representation to better fit the new task's data distribution. The diagram demonstrates the use of a Swin-Transformer in conjunction with two models, highlighting the flow of data and the stages where transfer learning is applied. The Swin-Transformer acts as a central component, facilitating the transfer of learned features and knowledge between the pre-trained and trainable components of the models, ultimately optimizing performance for new tasks.

θ′=argminθ′L(θ′,Dtarget)    (12)

In this formula, θ′ represents the model parameters of the new task, L represents the loss function, and Dtarget represents the dataset of the new task.

θ′=argminθ′[λLsource(θ′,Dsource)+(1-λ)Ltarget(θ′,Dtarget)]    (13)

This formula is the transfer learning formula when training with the source domain dataset (Dsource) and the target domain dataset (Dtarget). λ is a hyperparameter that weighs the loss of the source domain and the target domain. Lsource and Ltarget represent the loss functions of the source domain and the target domain, respectively.

In Equation 11, the optimization objective is defined as the negative logarithm of the similarity score. The goal is to maximize the similarity score, which corresponds to minimizing the loss L. Here L is a general loss function used to maximize the similarity score. This loss function is implemented by minimizing the negative logarithm of the similarity score. In Equation 13, represents the loss function on the source data and target data, which are used for optimization of the source domain and target domain, respectively. Therefore, L appears repeatedly in these two places to describe the loss function in different contexts: one is a general similarity score loss, and the other is a specific application loss for the source data and target data.

θ′=arg minθ′[λℒpretrain(θ′,Dpretrain)+(1−λ)ℒtarget(θ′,Dtarget)]    (14)

This formula is the transfer learning formula when training with pre-trained model parameters (Dpretrain) and target domain dataset (Dtarget). Lpretrain represents the loss function of the pre-trained model.

In these formulas, argmin represents the model parameter θ′ that minimizes the loss function. By minimizing the loss function, we can optimize the model parameters of the new task to better fit the data distribution of the target domain.

4 Experiment 4.1 Datasets

This article uses four datasets (Table 1): PKU-MMD Datasets, Sports-1M Dataset, UCF101 Dataset and Finegym Dataset. KU-MMD Dataset: (Liu et al., 2017) Description: PKU-MMD is a large-scale dataset for continuous multi-modality 3D human action understanding. It contains over 1,000 action sequences and covers a wide range of actions performed by different subjects. Usage: This dataset can be used to pre-train models on a variety of human motions, providing a robust foundation for understanding and recognizing complex swimming actions. Sports-1M Dataset: (Li et al., 2021) Description: Sports-1M is a large-scale video dataset with over one million YouTube sports videos categorized into 487 sports labels. It provides a diverse set of sports-related video clips. Usage: The Sports-1M dataset can be utilized for initial training of video recognition models, leveraging the vast diversity of sports actions to enhance the model's generalization capabilities for swimming motion analysis. UCF101 Dataset: (Safaei et al., 2020) Description: UCF101 is an action recognition dataset of realistic action videos collected from YouTube, containing 101 action categories. It is widely used for action recognition tasks. Usage: This dataset can be used to fine-tune models on action recognition tasks, specifically targeting the accurate recognition and classification of swimming strokes and techniques. Finegym Dataset: (Shao et al., 2020) Description: Finegym is a fine-grained action recognition dataset for gymnastic actions. It focuses on high-quality annotated videos of gymnastic routines. Usage: Finegym can be used to further fine-tune models to recognize and differentiate subtle differences in motion techniques, which is critical for detailed swimming motion analysis.

www.frontiersin.org

Table 1. Description and usage of datasets.

4.2 Experimental details

This experiment utilizes 8 A100 GPUs for training. The objective is to compare the performance of various models based on metrics such as Training Time, Inference Time, Parameters, FLOPs, Accuracy, AUC, Recall, and F1 Score. Additionally, we conduct ablation experiments to explore the impact of different factors on model performance. The specific hardware configuration includes 8 NVIDIA A100 GPUs, an Intel Xeon Platinum 8268 CPU, and 1TB of RAM. The experiment is conducted using the PyTorch framework with CUDA acceleration. First, datasets such as PKU-MMD, Sports-1M, UCF101, and Finegym are selected for the experiment. Several classical and latest models are then chosen for comparison, ensuring that these models are trained and evaluated on the same tasks. During training, each model's batch size is set to 32, with an initial learning rate of 0.001. The optimizer used is Adam, and each model is trained for 100 epochs. In the comparative experiments, the training time for each model is recorded. The trained models are then used to perform inference on the dataset, with the inference time for each sample recorded and the average inference time calculated. The number of parameters for each model is counted, and the floating-point operations (FLOPs) are estimated. Each model's performance on the test set is evaluated using metrics such as Accuracy, AUC, Recall, and F1 Score. In the ablation experiments, the impact of different factors on performance is explored. Firstly, the impact of different model architectures is compared by using different architectures or components for the same task and comparing their performance differences. Secondly, the impact of data augmentation is compared by training a model with and without data augmentation and comparing its performance. Thirdly, the impact of different learning rate settings is compared by training a model with various learning rate settings and recording the performance changes. Lastly, the impact of regularization is compared by training a model with and without regularization terms and analyzing the performance differences. Based on the experimental results, the performance differences of various models on different metrics are compared, and the results of the ablation experiments are analyzed to explore the impact of different factors on performance. This comprehensive analysis provides insights into the strengths and weaknesses of each model and highlights the key factors influencing model performance.

To enhance the robustness of our system in handling noise and outlier data, we utilized Bayesian Neural Networks (BNNs), which introduce probability distributions over model parameters to better deal with uncertainty and noise. We employed Bayesian inference methods such as Variational Inference and Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution. These methods enable our model to effectively learn and update parameter distributions, thus adapting better to noise and outlier data in practical applications. Furthermore, through Bayesian learning, we can quantify uncertainty in predictions, helping us identify high uncertainty predictions and dynamically adjust the model during training to mitigate the impac

留言 (0)

沒有登入
gif