Estimating individual minimum calibration for deep-learning with predictive performance recovery: An example case of gait surface classification from wearable sensor gait data

Deep learning has proven helpful in numerous areas. Naturally, practical issues arise in clinical settings due to the novelty of applying machine learning tools to these fields (Miotto et al., 2018, Tobore et al., 2019, Wang et al., 2022, Zemouri et al., 2019). Handling repeated trials from participants is one such issue.

Generally, training and testing sets are derived for training a model and assessing its generalizability on unseen data points. As such, data for testing must be kept isolated from the training set to avoid contaminating the model with information that should not be available (“data leakage”).

When facing a context where multiple data points or trials may be associated with a single participant, how to split the dataset remains to be determined. For example, in a gait study, participants may be asked to perform walking tasks numerous times. Should different trials from the same participant be present in the training and testing datasets, known as a “random-wise”/“record-wise” (intra-subject) split? Or, should all trials from a single participant appear only in either subset, respecting the principle of unseen data points, known as a “subject-wise” (inter-subject) split?

This issue is an ongoing discussion (Saeb et al., 2017) and a debated one (Cao, 2022, Little et al., 2017). As shown by Shah et al. (2022), in the context of surface identification from gait data across multiple participants and trials, evaluation using a random-wise approach led to an over-estimation of the predictive power of models compared to a subject-wise split (F1-scores of 0.96 vs 0.78, respectively). Thus, models trained with some of a participant’s data outperformed those completely naive to the participant. The high evaluation performance of random-wise trained models may lead to overly optimistic assumptions that the model could achieve the same performance when deployed on new, unseen participants.

Training a model on a small sample of data from a new participant, commonly called calibration or transfer learning, may bridge the gap, i.e., achieve an acceptable model performance without overfitting to participants. Calibration is generally understood as training on a primary dataset and then re-training on a more specific/different dataset to maximize performance. These techniques are applied to a variety of data in the biomedical literature. Khazem et al. (2021) and Vidaurre et al. (2011) used transfer learning to brain-computer interface decoding models. Furthermore, Khazem et al. (2021) successfully reduced calibration training set size by selecting the minimal dataset needed for training a priori; however, the number of calibration trials appears to depend on the data type. Cano et al. (2022) obtained a 30% increase in F1-score utilizing one trial of cardiovascular data; while Lehmler et al. (2021) achieved an increased of 35% in F1-score using five gait cycles of electromyography data.

The actual behaviour of calibration and the impact of the number of calibration training trials remains unknown. Thus, this paper aims to investigate the relationship between the number of calibration trials and the prediction accuracy of the corresponding calibrated deep learning model in a classification study based on biomechanical data. We hypothesized that a model’s performance would increase with the number of calibration trials, eventually achieving the same performance as a model trained using a random-wise splitting approach. This research could inform on calibration training set sizes and behaviour for models based on data with multiple trials per participant.

留言 (0)

沒有登入
gif