Objective. Brain–computer interfaces (BCIs) face a significant challenge due to variability in electroencephalography (EEG) signals across individuals. While recent approaches have focused on standardizing input signal distributions, we propose that aligning distributions in the deep learning model's feature space is more effective for classification. Approach. We introduce the Latent Alignment method, which won the Benchmarks for EEG Transfer Learning competition. This method can be formulated as a deep set architecture applied to trials from a given subject, introducing deep sets to EEG decoding for the first time. We compare Latent Alignment to recent statistical domain adaptation techniques, carefully considering class-discriminative artifacts and the impact of class distributions on classification performance. Main results. Our experiments across motor imagery, sleep stage classification, and P300 event-related potential tasks validate Latent Alignment's effectiveness. We identify a trade-off between improved classification accuracy when alignment is performed at later modeling stages and increased susceptibility to class imbalance in the trial set used for statistical computation. Significance. Latent Alignment offers consistent improvements to subject-independent deep learning models for EEG decoding when relevant practical considerations are addressed. This work advances our understanding of statistical alignment techniques in EEG decoding and provides insights for their effective implementation in real-world BCI applications, potentially facilitating broader use of BCIs in healthcare, assistive technologies, and beyond. The model code is available at https://github.com/StylianosBakas/LatentAlignment
Export citation and abstractBibTeXRIS
Subject independent electroencephalography (EEG) decoding models have long been a sought-after goal for enabling the translation of brain–computer interfaces (BCIs) from research into practice. Previous systems relied on extensive calibration sessions for each new subject, which is a tedious process and presents a significant hurdle for more wide-spread use [1, 2].
The need for new developments in EEG subject adaptation spurred the Benchmarks for EEG Transfer Learning (BEETL) competition at the Conference on Neural Information Processing Systems (NeurIPS) in 2021, which aimed to identify novel approaches for addressing subject-independence and training across heterogeneous datasets [3]. This paper presents an extensive analysis of the Latent Alignment method of feature standardization in deep learning models that we developed for the winning competition entry. The proposed method is compared to other statistical alignment techniques with no requirement for labeled calibration examples and of comparable implementation simplicity. Since other techniques, such as fine-tuning, are only applicable in different contexts, they will not be considered here.
Performing feature standardization represents a common class of methods for unifying covariate shifts between subjects that rely on statistical distribution estimates computed on a context set of trials. These are most often applied on the input signal [4–6], but the personalized standardization of classification features has also been performed [7, 8]. We expect that aligning within the latent space will improve accuracy, because the features therein will be more relevant to the classification task.
Considering the practical application of such approaches, we assume that an unbalanced class distribution would reduce their effectiveness, as it will impact the statistical estimates computed on the trial context set. This effect is expected to be more pronounced if alignment is performed in later stages of the model, where the distributions of different classes are more linearly separable. Aiming to mitigate this problem of class imbalance in the context set, we hypothesize that training a model on randomized class distributions, while applying statistical alignment on the latent model stages, reduces the susceptibility to class imbalance encountered during inference.
All top three teams in the BEETL competition relied on deep learning approaches, which promise better transferability across subjects due to their increased capacity. A recent review by Craik et al found that a majority of studies using deep learning on EEG did not mention or chose not to perform explicit artifact removal [9]. This is likely due to the assumption that such models are more robust to noisy signals. We caution however that omitting the artifact removal step can lead deep learning models to associate class-discriminative artifacts, such as saccadic eye movements following the cue presentation, with the classification label [10]. Besides ocular activity, visual evoked potentials elicited by the cue presentation have also been shown to spuriously increase classification performance [11]. Appropriate steps need to be taken to ensure uncontaminated results.
Interestingly, understanding Latent Alignment as a permutation equivariant function applied on the context set of trials from a given subject allows us to reformulate the resulting deep learning model as a deep set architecture [12]. To the best of our knowledge this is the first time the concept of deep sets is encountered in the literature on EEG decoders. The perspective of utilizing the additional information contained in the context set of trials, rather than processing each trial independently, holds promise for improved EEG decoding performance.
In summary, the contributions of this paper are as follows:
We introduce the Latent Alignment approach for subject-independent models and its formulation as a deep set.We show the impact of class-discriminative artifacts when training subject-independent deep learning models on EEG data.We study the impact of class imbalance in relation to the latent space that statistical adaptation techniques are executed within.A number of previous works have suggested the use of deep learning models trained across subjects on relatively large EEG datasets in order to overcome the problem of inter-subject variability [13–16]. While they show promising results, the limited availability of EEG data and the large distribution shifts between subjects result in further performance benefits that can be obtained by personalising a subject-independent model with statistical domain adaptation techniques.
Among the state-of-the-art methods for subject adaptation are Riemannian techniques, which perform standardization of subject-wise distributions on the semi-positive definite (SPD) manifold of EEG covariance matrices [5, 17–19]. Performing classification on the Riemannian manifold requires specialized models however, and cannot straightforwardly be adapted for deep learning techniques, although there are recent efforts such as SPDNet [20], also applied to EEG decoding [21]. We focus here on statistical alignment techniques that are broadly compatible with existing deep learning models.
Euclidean alignment aims to adapt Riemannian alignment techniques for standard deep learning models, such as convolutional neural networks [6]. This approach performs spatial whitening using the Euclidean mean of spatial covariance matrices across multiple trials of a given subject. The aligned signals are then provided as input to the deep learning model.
Alignment on the SPD manifold can be seen as aligning within the classification space for some conventional methods, since these would directly be applied on the resulting manifold, which includes the signal powers. Deep learning methods, however, perform additional steps of feature extraction as part of the model architecture, which results in a discrepancy between the alignment space and the final classification space.
Adaptive Batch Normalization (Adaptive BatchNorm) is a deep learning domain transfer technique developed in the image domain [22], which was introduced to EEG decoding by [8, 23]. This approach simply replaces the statistics from the source dataset applied in every batch normalization layer with statistics obtained from the target dataset. Compared to Euclidean alignment it has the advantage of applying its adaptation step in the latent classification space of a deep learning model instead of the input. During the training stage however, this approach just performs standard batch normalization without considering the inter-subject variability within the training set.
Deep sets are a deep learning framework developed for the classification of sets, rather than fixed dimensional vectors [12]. For a function to operate on a set it must be permutation invariant (or equivariant) with respect to an undefined number of input samples. An invariant deep set produces a single output from a set of input samples, while an equivariant deep set results in the same number of outputs as there are input samples, each transformed by the invariant set function. The original authors propose to use sum or maxpooling as a commutative function to aggregate the information in the set.
3.1. Alignment techniquesFigure 1 provides an overview of the input and latent space processing during training and inference for latent alignment compared to other statistical domain adaptation techniques. In order to employ the alignment techniques studied here, we carefully compose each batch of training data to include a fixed number of trials from each included subject session. Depending on the analysis, these batches include either balanced class distributions or randomly sampled trials for each subject. Note that the choice of sampling strategy for the batch composition does not affect the overall class balance seen during training. Alignment is then straightforwardly performed using the subject-wise statistics obtainable from each batch. During inference, all unlabeled trials of individual subjects are decoded simultaneously, using the full statistics. The class distributions then reflect the distribution in the respective dataset.
Figure 1. Simplified visualization of Latent Alignment and other statistical domain adaptation techniques, highlighting differences in regards to input and latent space processing, as well as differences between model training and inference. The input signal x of N trials from multiple subjects is aligned, followed by feature extraction θ, resulting in a latent space z. After aligning in the latent space, a classifier σ is applied. Only a single layer of feature extraction and the resulting latent space is shown for simplicity. The standard BatchNorm layer, applied on the input and latent space, uses the total dataset statistics across trials and subjects during training and treats them as constants during inference. Euclidean Alignment is applied, only, on the input, and performs subject-wise alignment both during training and inference. The Adaptive BatchNorm method behaves like a standard BatchNorm layer during training, and uses subject-wise statistics on the input and latent space during inference. The proposed Latent Alignment method uses subject-wise statistics for alignment on the input and latent space, applied both during training and inference. Training and inference splits are subject-independent for all experiments in this paper.
Download figure:
Standard image High-resolution imageAll three alignment methods obtain the relevant statistics by averaging across trials as well as across the time dimension of each trial, as is custom with batch normalization. In the following notation, we omit the time dimension for the sake of brevity.
3.1.1. Latent alignmentA preliminary version of the Latent Alignment technique was used in our winning entry to the BEETL NeurIPS competition [3, 24]. We will first introduce its formulation as distribution standardization, followed by the deep set formulation.
The alignment context set is constructed as a batch of n trials from the current subject. A model forward pass is then performed on the batch. At each alignment layer, the mean and standard deviation
across trials with latent features
,
of dimensionality d are computed as
treating each feature dimension independently. The resulting statistics are then applied to standardize the latent distribution across subject trials in the batch , such that
resulting in a batch of trials with standardized latent features . Trainable scale
and bias
parameters, shared across all subjects, are applied on the feature distributions, as is commonly performed in batch normalization [25]. The aligned features are then propagated to the next trainable weight layer of the deep learning model, after which the alignment procedure is repeated. This approach is applied equally during training and validation.
The Latent Alignment approach can alternatively be expressed as a deep set [12]. The function , which is permutation equivariant with respect to the set of n input trials, is applied on the latent features coming from the previous layer. The resulting standardized features are transformed by a trainable weight matrix
, mapping from d input to d' output features, and bias
, such that
where is a non-linear activation function. It is interesting to note that in this view, any deep learning model that applies batch normalization can be seen as a deep set. In that case however, the set readout will be approximately constant across all trials and subjects, representing the latent statistics of the training dataset, which makes it a trivial case.
When applying Latent Alignment during inference, we include each test trial in the context set, which is possible, because the Latent Alignment method does not require class labels. When performing inference on a single new trial, the previously obtained context set can be concatenated to form a growing batch of trials, on which the new statistics can be computed.
Latent Alignment repeatedly standardizes feature distributions following successive layers of feature extraction in a deep learning model, up to and including the final classification space. Following our hypothesis that performing alignment within the classification space rather than only the input, this approach should be optimal. Furthermore, applying alignment both during training and inference proactively addresses inter-subject variability in the training set, and eliminates the difference between training and inference behaviour.
There are no hyperparameters to be set for the proposed Latent Alignment technique, making its practical use more straightforward. A PyTorch implementation is available on GitHub 6, alongside implementations for the baselines and deep learning models.
3.1.2. Euclidean alignmentEuclidean Alignment [6] standardizes the distribution of EEG input signals by performing spatial whitening with the average spatial covariance matrix
. Alignment of the trials is performed with
where we compute the square root of the covariance matrix using the Cholesky decomposition, followed by the matrix inverse. The aligned signals are then passed into the deep learning model. For numerical stability we recenter signals per electrode and rescale them using the total trial standard deviation across electrodes (average global field power) before computing the covariance matrix.
The Adaptive BatchNorm approach applies simple batch normalization layers in the training stage and then replaces the statistics of every normalization layer with statistics obtained on a set of trials from the target subject. We implement this by running simple batch normalization layers during training and then utilizing the subject-specific trials in each composed batch to obtain personalized statistics during inference.
3.2. Models and datasetsThree datasets representative of the variety of classification tasks met in EEG BCI applications were selected to test our Latent Alignment approach. On each dataset, a well-established deep learning model developed specifically for the respective task is trained with varying alignment methods. We added a batch normalization layer on the input for each model, which applies batch normalization per electrode without trainable weight and bias, which is then used for Adaptive BatchNorm or Latent Alignment depending on the experiment.
All models are trained to minimize the cross-entropy loss, using the Adam optimizer [26] with learning rate 1 , weight decay 1
and momentum β1 = 0.9, β2 = 0.999. Dropout regularization with a rate of 0.25 is applied [27]. In all tasks we perform subject-independent cross-validation with 10 folds, such that the subjects in the respective validation sets were not seen during training of the model.
The first dataset is PhysioNet ME/MI, which contains recordings of 109 subjects performing ME and MI of hands and feet [28, 29]. We drop six subjects due to the inconsistent number of trials and sampling rates 7. The dataset contains EEG recordings from 64 electrodes at a sampling rate of 160 Hz, cut into trials of length 4.1 seconds. We apply a third-order zero-phase Butterworth bandpass filter in the range 4–40 Hz and a Notch filter at 60 Hz, as well as common average re-referencing.
Three classes are chosen, including left fist, right fist and both feet. We employ EEGNet, one of the most established deep learning models to date on the motor decoding task [13]. It is parameterized by 8 temporal and 2 spatial filters for a total of 16 channels, and we adapt the kernel sizes of the convolutional layers to 81 and 21 to match the sampling rate of 160 Hz. The pooling layers are parameterized with kernel sizes 4 and 8, respectively. As implemented in the original paper, the spatial filters and the linear classifier layer are constrained by max-norm with limits of 1 and 0.25, respectively. The non-linearities are set to the ELU function. On each task, the model is trained for 100 epochs with each training batch consisting of 4 subjects with 12 trials per subject for a total batch size of 48.
3.2.2. PhysioNet sleepThe PhysioNet Sleep Cassette dataset includes 78 subjects with EEG recordings from 2 bipolar electrodes (Fpz-Cz and Pz-Oz) [28, 30]. We discard any data from the awake condition except for the 30 min before and after sleep. Due to the very small number of trials in deep sleep stages 3 and 4 (figure 2), we combine these classes into one. The dataset is cut into trials of length 30 seconds. The sampling rate of the dataset is 100 Hz, and we apply a third-order zero-phase Butterworth bandpass filter in the range 0.1–45 Hz.
Figure 2. Class distributions for the datasets used in this study. Across the three different tasks, the datasets vary in the number of classes and the degree of imbalance in the class distributions, offering a comprehensive test environment.
Download figure:
Standard image High-resolution imageFor the sleep stage classification task we use the well established DeepSleep model [31]. Batch normalization layers are added after each of the three convolutional layers. We collapse the latent channels of 8 spatial and 2 temporal filters into the same dimension for a total of 16 channels to allow for computation to occur across these channels. The model consists of a linear spatial filter, followed by two convolution layers with kernel size 51, corresponding to half of the sampling rate. Each convolution is followed by a pooling layer with kernel size 16. The non-linearities are set to the ReLU function. The model is trained for 5 epochs with 64 trials per subject session in each batch for a total batch size of 256. To counteract the imbalance of classes, the loss of each trial is weighted with the inverse of the class distribution in the training dataset.
3.2.3. OpenBMI event-related potential (ERP)The last task involves the classification of P300 ERP responses on the OpenBMI dataset [32]. It includes EEG recordings with 62 electrodes from 54 subjects, each including 2 sessions with 2 phases. We drop non-standard electrodes8, leaving 48 electrodes. The dataset is cut into trials of 1 second as proposed by the authors of the dataset, resampled to 100 Hz and a third-order zero-phase Butterworth filter is applied in the range 0.5–45 Hz. This is followed by common average re-referencing.
In this task the EEGInception model is employed with 8 temporal and 2 spatial filters, which taken together with the 3 parallel inception depths amounts to 48 channels [33]. To match the sampling rate of 100 Hz, we adapt the kernel sizes of the inception layers to and respectively, and 7 and 3 for the two remaining convolutions. A pooling layer with kernel size 4, 2, 2, 2 is applied after each of the respective four convolutional layers. The non-linearities are set to the ELU function. The model is trained for 10 epochs with 12 trials per subject session in each batch for a total batch size of 48. All experiments on this dataset are performed with batches containing the fixed class ratio of 5 non-target trials for every target trial, representing the dataset average. To counteract the imbalance of classes, the loss of each trial is weighted with the inverse of the class distribution in the training dataset.
We first examine trained models for contamination with class-discriminative artifacts by investigating their spatial filters. This is followed by the comparison of the different alignment methods and finally the analysis of the impact of class distributions on subject alignment, including comparisons between training on class-balanced and unbalanced data.
4.1. Spatial filter analysisSince deep learning models trained on EEG signals without artifact removal could utilize class-discriminative artifacts for classification, we investigated the linear spatial weights of baseline EEGNet models trained on different parts of the trial (figure 3). It can be observed that models trained on the first second place a focus on F7, F8 and Fpz electrodes. This suggests that the decoder has learned to exploit eye movement artifacts, which inherently carry class-relevant information due to the adopted cue-based experimental design. Models trained on the remaining three seconds of the trial focus on C3, CP3, CPz, C4, CP4 for the ME task, and C3, CP3, FCz, CPz and CP4 for MI, which are in the expected locations over the motor cortex.
Figure 3. Topographic representation of the respective trained spatial filter layer for the motor and event-related potential (ERP) classification tasks, with models trained on different parts of the trial for the motor task. Models trained on the first second exhibit a clear focus on F7, F8 and Fpz electrodes, indicating overfitting on ocular artifacts related to the cue-presentation, while models trained on the rest of the trial exhibit a focus on motor-related central electrodes. Each representation shows mean absolute values of the spatial filter weights averaged across filter channels and cross-validation folds to display the grand average electrode relevance. Three electrodes were omitted for visual clarity (T9, T10, Iz).
Download figure:
Standard image High-resolution imagePlotting averaged EEG signals from PhysioNet motor trials indeed reveals systematic eye movements, presumably following the task presentation on the screen (figure 4). The polarity of the eye movement artifact in F7 and F8 is reversed between left and right hand trials, supporting this interpretation. While details on the dataset are not available, we assume that an instruction stimulus for moving the left hand was presented on the left side of the screen at trial time zero, which the subject tended to follow with a saccadic eye movement. For the right hand and feet classes the cue would have been on the right side or lower part of the screen, respectively. The deep learning model can then classify the task based on the systematic eye movement artifacts in the EEG signal. The amplitude of these artifacts is expected to be proportional to the angle of the saccadic eye movement following the visual stimulus.
Figure 4. Grand average EEG amplitude across subjects for select electrodes containing eye activity on Physionet executed movement trials. Bandpass filtered in 1–8 Hz. Clear differences in eye activity between classes are apparent during the first second of the trial, after the cue appears on the screen. (a) Left hand, (b) right hand, (c) feet.
Download figure:
Standard image High-resolution imageThe strong contribution of these artifacts towards the classification accuracy can be seen in table 1, especially on the MI paradigm. When classifying the signal using only the first second of the trial, which contains the eye movement artifact, the model gets 58.2% accuracy, while it only reaches 45.0% accuracy when classifying the other three seconds excluding the first.
Table 1. Classification results for motor tasks when training the EEGNet baseline model on different parts of the trial window. Models trained on the first second after cue presentation have much higher classification accuracy than models trained on the rest of the trial, both for motor execution (ME) and motor imagery (MI). This indicates the presence of discriminative artifacts following cue presentation. Reported as balanced accuracy, mean (std).
0–1 s0–4 s1–4 sME0.597 (0.056)0.606 (0.060)0.521 (0.060)MI0.582 (0.048)0.567 (0.057)0.450 (0.035)For any further analysis on the motor tasks, we exclude the first second of the trial during model training and testing to avoid the contribution of systematic eye movements to the classification.
Spatial filters of baseline models trained on the P300 speller ERP classification task exhibit a clear focus on Oz (figure 3), with some contribution from neighbouring electrodes up to P7, P8 and Pz. This topography is surprising, since visual speller paradigms target the P300 response, which is located in the central-parietal region [34]. In the OpenBMI dataset used here, the authors introducing the dataset aimed to enhance the ERP response by drawing a human face stimulus over the letter when flashing it, which would contribute to differences in brain activity associated with visual processing between target and non-target stimuli. This is also evidenced in the topographical data validation the authors of the dataset performed, which in addition to the expected P300 around Cz shows strong differences in the occipital region [32].
4.2. Latent distributionsA visualization of latent distributions at the various stages throughout the EEGNet model trained on the ME task can be seen in figure 5. When compared to the baseline model, which exhibits strong inter-subject variability, we observe that every alignment method tested in this study noticeably reduces the inter-subject variability during inference, both on the model input (L0) as well as in the latent stages (L1–L3). Euclidean Alignment is only applied on the input signal, which explains the increasing divergence between subject distributions toward the later stages of the model. In contrast, both Adaptive BatchNorm and Latent Alignment are applied repeatedly throughout the model, and subject distributions therefore remain well aligned. This observation agrees with the increased classification accuracy of Adaptive BatchNorm and Latent Alignment when compared to Euclidean Alignment.
Figure 5. Latent distributions of different subjects at various stages in the EEGNet model trained on the motor execution task, comparing different alignment methods during inference. Shown are the first two components after multidimensional scaling (MDS) dimensionality reduction following each of the four batch normalization or alignment layers. An ellipse is drawn around the 95% confidence interval of the latent distribution of all trials for each of eleven subjects taken from the first fold validation set. The classes are therefore balanced within each distribution. While all alignment methods result in much less inter-subject variability in the latent space of the model, the benefit is more limited to the earlier layers for Euclidean Alignment, with distributions diverging towards the deeper layers of the model. Adaptive BatchNorm and Latent Alignment show benefits even in the final classification space.
Download figure:
Standard image High-resolution image 4.3. Classification performanceClassification results are presented in table 2 (see supplementary table 1 for additional benchmarks). Given that our cross-validation folds were created with controlled random seed and therefore identical across experiments, we performed paired t-tests when comparing classification accuracy. All alignment methods had the highest impact on the motor classification tasks, and the least impact on sleep stage classification. Paired t-tests across validation folds between the respective baseline models and the alignment techniques reveal that while all alignment methods lead to statistically significant increases on the motor tasks, only Latent Alignment lead to a statistically significant increase in performance on the sleep stage task (p < 0.05). Regarding P300 ERP classification, both Adaptive BatchNorm (p < 0.05) and Latent Alignment (p < 0.01) showed significant increases in balanced accuracy.
Table 2. Comparison of the baseline, using standard BatchNorm, and alignment techniques on different EEG classification tasks with their respective deep learning models. The motor tasks are trained either on class-balanced or -unbalanced batches, but always tested on balanced distributions. Latent Alignment consistently leads to higher performance. Results are reported as balanced accuracy, mean (std) across folds. Bold indicates the best performing technique for each task. Paired two-sided t-test (n = 10) over cross-validation folds of alignment methods compared to the baseline: p < 0.10, * p < 0.05, ** p < 0.01, *** p < 0.001.
For the classification of ME and MI trials, we distinguish between models trained on class-balanced and randomly unbalanced batches. We observe that all alignment methods lead to a statistically significant increase in the classification of ME trials (p < 0.001) of about 9%–12%. For MI, all methods lead to an increase of about 3%–8%, albeit with reduced significance levels for Euclidean Alignment (p < 0.05 with class-balanced training and p < 0.01 with class-unbalanced training for Euclidean Alignment; p < 0.001 for Adaptive BatchNorm and Latent Alignment).
We expected that aligning within the latent classification space would lead to increased classification accuracy. The results confirm that the proposed Latent Alignment technique consistently produces the highest performance on all tested tasks, albeit at times with a small margin of difference to the other methods. Consistently second is Adaptive BatchNorm, followed by Euclidean Alignment.
The subject-wise performance impact of applying Latent Alignment for ME classification is shown in figure 6. Of the 103 subjects, all except 17 subjects show a clear performance benefit. There is no apparent trend with regard to which subjects benefit the most from alignment. Rather than bringing low-performing subjects up to an average level, alignment benefits subjects with all performance levels.
Figure 6. Subject-wise change in classification accuracy compared to the baseline model when applying Latent Alignment for motor execution classification with EEGNet. Subjects are sorted by their baseline accuracy, taken from the respective fold in which they were excluded from the training set. The large majority of subjects show an increase in performance, while a few subjects show a decrease. There is no clear trend regarding the change in performance with respect to the baseline level.
Download figure:
留言 (0)