Decoding working memory-related information from repeated psychophysiological EEG experiments using convolutional and contrastive neural networks

Neurofeedback is a therapy aimed at improving cognitive abilities through self-regulation of brain activity in a direction considered desirable by the therapist (Hammond 2011). The current and desired levels of electroencephalogram (EEG) activity (typically power in a chosen frequency band on selected electrodes) are presented to patients in the form of a simple game in which the patient's task is to reach the target state using mental manipulations. Despite many years of development, the training protocols used to treat different disorders and deficits are still based on arbitrary selected frequencies and electrodes. This lack of individual diagnosis and arbitrary selection of signal features seriously limit the method's effectiveness. Moreover, the complex nature and inter and intra-subject variability of the EEG signal hinder improvements using traditional computational techniques.

Machine learning (ML) methods are promising to overcome these limitations because they can incorporate a wide range of signal features and simultaneously generalize the knowledge. Deep neural networks (DNNs) have the potential to learn effective features end-to-end and to classify raw input data. Given their effectiveness in other fields, DNNs are expected to lead to better features and classifiers and thus to a much more robust EEG classification (Lotte et al 2018). Schirrmeister et al (2017) rigorously and convincingly demonstrated that their Shallow ConvNet—a simple convolutional neural network (CNN)—could outperform a classical filter-bank-common-spatial-pattern-based 9 classifier.

However, the practical application of modern ML methods such as deep and CNNs for EEG classification is hindered by two main problems. First, small datasets carry the risk of overfitting. Second, black-box approaches risk using artifacts as features (Comstock et al 1992, Nathan and Contreras-Vidal 2016). While this might not be crucial for Brain-Computer Interfaces, in the case of neurofeedback or diagnostic applications, this may lead to false diagnoses and therapeutic adverse effects. The challenges of small data size and classification interpretability are not new in ML. Mitigation techniques have been developed and applied in different fields, including EEG. In the case of addressing small data sizes, a commonly used technique is transfer learning. It was applied as early as 1998 (Thrun and Pratt 1998) and today is commonly used in EEG classification (for review, see Wan et al 2021). Newer solutions include techniques such as self-supervised contrastive learning (Hyvarinen and Morioka 2016). This solution has already yielded promising results in brain imaging EEG investigations (Mohsenvand et al 2020, Banville et al 2021). The techniques used to overcome the interpretability problem include sensitivity analysis and backpropagation approaches. Sensitivity analysis is a more popular method that is based on the local evaluation of the output gradient with respect to input features. Sensitivity analysis results are presented in heat maps consisting of input features with the most significant impact on the output. These new developments improve classification results comparable to the level of specialists (Schweizer et al 2017) without the burden of tedious work. The application of these methods may have practical assistive applications in diagnostics and therapeutics.

In this work, we study the application of different classifiers trained to detect desired EEG states as potential EEG-neurofeedback protocols. The classifiers have been trained on the data collected from individual participants during a cognitive task conducted prior to the actual EEG-neurofeedback training. The feasibility of CNN to detect different mental states has already been confirmed by several studies (Bird et al 2018, Chakladar et al 2020, Han et al 2020).

We chose memory as the target for the neurofeedback procedure, as memory is one of the basic cognitive functions. Using datasets collected during three diagnostic sessions of each participant, we trained four neural network models to classify trials based on the EEG signal. The application of four different models to the same dataset allowed us to compare the effects of different architectures, training strategies, and input representations on classification results and features' importance. Furthermore, we performed sanity checks to verify that the extracted features important for classification were plausibly related to the memory task. This study contributes to the field by demonstrating that:

(a)  

perturbation analysis (section 2.5.1) allows for identifying physiologically relevant features for neurofeedback training

(b)  

the application of different training strategies and models to the same problem may result in classification based on different sets of features; still, it allows the identification of a robust subset of features common to all the models,

(c)  

the relationship between classification results and behavioral performance in cognitive tests (often significant in diagnostic applications) is moderate and varies between investigated models.

2.1. Data

In the current study we used two different sets of data. The first one comes from the delay matched to sample experiment which we refer to as the experimental data. The experiment was designed to closely resemble a neurofeedback session. The second set was a big dataset of resting state clinical recordings which we used to train the model using a transfer learning approach.

2.1.1. Experimental procedure and data2.1.1.1. Participants

We examined 87 healthy adults (including 42 women; age range: 23–44 years). Participants were recruited through announcements at local universities and work agencies. A neurological screening and questionnaires were administered to all potential participants; exclusion criteria included neurological disorders, brain injury, current use of analgesic medication, substance abuse or dependence and mental disorders. All participants were right-handed and had a normal or corrected-to-normal vision. The experimental procedures were approved by the Local Bioethics Committee at Nicolaus Copernicus University in Torun. All participants gave their written informed consent to participate in the experiment in accordance with the WMA Declaration of Helsinki—Ethical Principles for Medical Research Involving Human Subjects. All experiments were performed under all relevant guidelines and regulations. Neurofeedback set-up included virtual reality (VR) and standard monitor environments. Participants were randomly assigned to two experimental groups with different environment (2D or VR). Finally, the 2D group included 33 participants (including 12 women) and the VR group 54 participants (including 30 women).

2.1.1.2. Procedure

The current experimental procedure was designed for an EEG-neurofeedback experiment aimed at memory improvement in VR and standard monitor (2D) environments. The first three diagnostic sessions were aimed at decoding participants' EEG state during tasks requiring the use of working memory. During these sessions, participants used a keyboard to play a computer game implementing a working memory task while their EEG was recorded. The data collected from the three sessions were then used for decoding EEG activity. The game was identical with the one used in the proper neurofeedback training except for the participants' response which, during the proper neurofeedback training, was decoded from the modulations of their EEG. In this paper, we concentrate on identifying the working-memory-related EEG features from the three diagnostic sessions.

The game was based on a delayed match-to-sample (DMTS) task, a task which tests both attention and working memory. The original version of the DMTS has three phases which are sample, delay, and choice. During the sample phase, a participant is presented with a sample stimulus, which has to be maintained in memory during the delay period following the sample phase. When the delay phase is over, the choice phase follows. During this phase, another stimulus is presented to the participant. At this point, the participant has to decide whether or not the stimulus matches the sample.

We modified this paradigm by adding control trials that did not require retention of information (See figure 1 for the trial design). In control trials, the sample and delay phases were the same as non-control phases. However, in the choice phase, participants were instead asked to indicate the orientation (left or right) of a simple shape. The type of trial was indicated by different background colors. Therefore, participants were always aware whether or not a given trial required attention and retention of information.

Figure 1. The scheme of the experimental procedure. Time course of (a) the retention and (b) control trials. In each case, the type of trial was signaled to the participant by the color of the background elements, purple for retention and green for control type. The 5 s delay phase indicated in the upper plot was used in the analysis described below.

Standard image High-resolution image

Each trial lasted 10.5 s and began with a 'wait' phase lasting 1.5 s. In the 'wait' phase, no action from the participants was required. Next, in the sample phase, the silhouette of a spaceship was presented to the participants for 2 s. This was followed by a 5 s 'delay' phase (retention trial) used in further analyses for memory load detection. Finally, the 'choice' phase lasted for 2 s. Participants' responses were executed by pressing assigned buttons on the computer keyboard. They earned one point per each correct response and lost a point per each incorrect response. The three sessions of the DMTS game combined with EEG recording were separated by 2–3 d. Each EEG session lasted up to 25 min and consisted of 50 control and 50 retention trials, shown in randomized order.

2.1.2. Clinical data

The clinical data used for transfer learning in one of the approaches explored in the present paper came from University of Warsaw's database. The database contains over 100 000 healthy and abnormal anonymized clinical EEG recordings of patients of all ages collected from hospitals across Poland. The subset used in the experiment included 12 000 randomly selected recordings. The use of this database for research purposes was approved by the Local Bioethics Committee at Nicolaus Copernicus University in Torun.

2.2. Data pre-processing2.2.1. Experimental data preprocessing

The EEG signal was recorded using Digitrack software (Elmiko Ltd) with 19 electrodes arranged in a 10–20 system, referenced to electrodes located at the earlobes. The sampling rate of EEG signals was 500 Hz and the impedance of electrodes was kept below 10 kΩ. Data were high-pass filtered with a 0.5 Hz cutoff frequency. The experimental data were subjected to a typical EEG cleaning procedure to remove known artifacts that could bias classification results and consequently the EEG-neurofeedback therapy. To assure the same sampling rate of experimental recordings (the original sampling rate slightly differed between recordings), the data were downsampled to 400 Hz. Next, the signal was epoched into 10.5 s windows (covering a whole single trial). For automatic bad trials (heavily contaminated) detection, the ERPLAB function (Lopez-Calderon and Luck 2014) was used with adaptable threshold parameters, leading to removal of no more than 25% of threshold-exceeding trials per participant. Eye movement, electrocardiogram (ECG), and muscle artifacts components were removed using independent component analysis (EEGLAB (Delorme and Makeig 2004) plugin ICLabel with classification threshold set to 0.65 for eyes and ECG and 0.85 for muscles). The delay phase (cf figure 1) was extracted from correct trials for further analysis. Since VR and 2D groups did not differ in the behavioral results (see section 3 for details), we pooled the two groups to increase the size of the training and testing sets. The final dataset consisted of 6207 trials in session 1, 6509 trials in session 2, and 6587 trials in session 3. There were 10 937 correct control and 8366 correct retention trials in total.

2.2.2. Clinical data preprocessing

The signals were recorded using a 10–20 electrode setup, and sampled at 256 Hz. Data were bandpass filtered using 1 Hz highpass and 40 Hz lowpass filters. Next, after cropping the continuous recording into non-overlapping 5 s long samples, the ones that exceeded 500 µV were discarded. Finally, we obtained $3.5\times10^6$ 5 s long samples.

2.3. Model architectures

In the current study, the effect of different architectures and training methods on classification results and set of important features was evaluated using the following four models:

(a)  

Shallow ConvNet—a reference model developed originally by Schirrmeister et al (2017).

(b)  

Parallel ConvNet and Hybrid models—both models are using channel-frequency-time input representation and share the same shallow convolutional part of the architecture. Additionally, the Hybrid model is tuned to individual participants. Both models were developed specifically for the current neurofeedback project.

(c)  

Contrastive model with gated multilayer perceptron (gMLP-MoCo)—also developed for the current neurofeedback project, aimed at assessment of transfer learning using contrastive training phase.

All models were trained to perform a binary classification of the input data as coming from the retention or control trials (classes) (c.f. figure 1). The details of the models are described below. The models were implemented using Pytorch 1.8.

2.3.1. Shallow ConvNet

The rationale behind choosing Shallow ConvNet (Schirrmeister et al 2017) as a reference model was twofold. First, the original Shallow ConvNet model was developed for classification of EEG dataset similar in size to ours (High Gamma Dataset: 14 participants consisting in total 12 320, 4 s long trials, Our dataset: 89 participants, 3 sessions per participant, in total 19 303, 5 s long trials). Second, Shallow ConvNet is a widely used reference model (over 1200 citations in Google Scholar). We implemented a variant of the original model adjusted for binary classification. In short, the first two layers of the network perform respectively: a temporal convolution and a spatial filtering analogous to Filter Bank Common Spatial Patterns (Ang et al 2008). The convolutional layers are followed by a squaring non-linearity, a mean pooling layer and a logarithmic activation function. The last layer is a fully connected layer with sigmoid non-linearity. In our implementation the size of the temporal filters was set to 40 to obtain the same length (in seconds) as in the original work (Schirrmeister et al 2017). The network's input size was $E\timesT$, where E is the number of electrodes, and T is the number of time steps. The scheme of the model is presented in figure 2.

Figure 2. Outline of the Shallow ConvNet. C is the number of kernels; conv 1: temporal filters, conv 2: spatial filter mixing all electrodes, and all previous filters, FC: the fully connected layer.

Standard image High-resolution image 2.3.2. Parallel ConvNet and Hybrid models

Parallel ConvNet and Hybrid models were developed explicitly for the current study. For these models, we considered explicit time-frequency parametrization to study the impact of the input representation. After prepossessing described in section 2.2.1, the signal was converted into time-frequency representation using Morlet transform (wave number = 7) for central wavelet frequencies 3, 5, 8, 11, 15, 20, 25, 30, and 35 Hz. Chosen frequencies overlap with canonical EEG bands (i.e. theta 4–7 Hz, alpha 8–12 Hz, beta 1 13–20 Hz, beta 2 21–30 Hz and gamma 31–45 Hz). We used only the real part of the complex representation for further computations and decimated it by 5 in the time domain to reduce the input size. This procedure yielded a $9\times400$ frequency-time map for each channel. Finally, these maps were stacked, forming a tensor of rank 3 with dimensions $19\times9\times400$ (channel × frequency × time). Consequently, the input of our network (see figure 3) had a dimension of E × F × T, where E is the number of electrodes, F the number of different frequency bands, and T the number of time steps. The number of electrodes (E) corresponds to the channel's dimension of the input tensor, i.e. to its 'depth'. Within each block, the output of convolution layer is followed by a rectified linear activation, $\mathrm(x) = \max(0, x)$, and then by a batch normalization layer (Ioffe and Szegedy 2015). Both models contain two convolution layers with three parallel paths of different kernel sizes (time steps) (see figure 3). The concept of parallel paths was inspired by the Inception architecture (Szegedy et al 2015) implemented in the context of computer vision. The Inception architecture (Szegedy et al 2015) introduced parallel computational pathways with different convolutional filters. We implement an analogous idea but adapted this architecture to the context of time-series multi-channel EEG data. The precise employed variant was fixed using cross-validation (CV) on a relatively small subset of the data as giving reasonable performance. We did not, however, attempt to perform a full-fledged architecture scan, as the main point of the present work was to use the neural network as a tool for neurofeedback training using interpretable EEG features. The kernel sizes in Parallel ConvNet and Hybrid models are (p, q), where p corresponds to the number of adjacent frequencies, and q to the length of the time window. Thus, the model parses the signal at three different 'speeds' depicted in kernel sizes: $(5, 10)\rightarrow(5,5)$, $(5,5)\rightarrow(5,10)$, and $(5,1)\rightarrow(5, 10\;$ with $2~\times$ dilation). The convolution-activation-normalization blocks are followed by an average pooling layer which aggregates features from the whole duration of the trial. Outputs of these three paths are then concatenated forming a 24-dimensional representation of the input.

Figure 3. Parallel ConvNet architecture. Operations indicated in the blocks: conv: convolution with a kernel, ReLU: rectified linear activation function, BN: batch normalization, FC: fully connected layer, C: number of channels, i.e. kernels applied at the same location. The network returns the probability that the input trial class is retention.

Standard image High-resolution image

In the Parallel ConvNet this representation is passed to two hidden fully connected layers. All fully connected layers receive an additional scalar input marking the type of experimental environment: 2D or VR. The model output value is computed in the third fully connected layer with a sigmoid function $\sigma(x) = 1/(1+\exp(-x))$ $\sigma(x)\in(0,1)$. This output can be interpreted as the probability of whether the trial is in the retention class.

In the Hybrid model we took advantage of multisession characteristics of our dataset to construct individual models to minimize adverse effect of large inter-subject EEG variability on classification results (Lotte et al 2018). To this end, the 24-dimensional representation (output of the 'concatenate' block) of the trained Parallel ConvNet was used as a fixed feature extractor input to individual logistic regression models trained on each participant's data.

2.3.3. Contrastive model

The motivation for this specific choice was twofold. First, we sought to evaluate the effectiveness of the transfer learning method using unlabeled raw clinical EEG data. The experimental EEG datasets are usually small in size, which reduces the training efficiency and carries the risk of overfitting. On the other hand, clinical EEG data are recorded according to standardized procedures, and the dataset sizes are much larger (e.g. a TUH dataset of 3000 recordings). Therefore, transfer learning from clinical EEG data may provide a solution to the problem of small size of experimental recordings, enabling more accurate classification without the risk of overfitting. Second, the application of the gMLP with self-attention mechanism to train targeted dataset using features extracted from raw clinical data could reveal importance of other features than standard analytical methods and increase the fine-tuning capabilities of the model.

The structure of the contrastive model (gMLP-MoCo) used in this study is shown in figure 4. The input is the channel-time representation of EEG signal lasting 5 s (matrix of 1280 time steps of raw EEG data in 19 channels, at 256 Hz sampling rate). This input is fed to a one-dimensional convolutional layer which translates the signal into a sequence of 40 embeddings (corresponding to tokens), each representing the subsequent fragment of signal as a lower dimensional (here, 128) vector. The embeddings are then processed by the gMLP module. The gMLP (Liu et al 2021) is built of 30 identical blocks. In each block, embeddings are first normalized and then projected by a shared linear layer into a higher dimension. These expanded tokens enter the Spatial Gating Unit module and are split into two equally sized chunks.

Figure 4. The gMLP network architecture. Left—overview of the network, center—structure of a gMLP block, right—zoom into the Spatial Gating Unit. See text for details. Adapted with permission from Liu et al (2021).

Standard image High-resolution image

The first chunk after the normalization layer enters a trainable one-dimensional convolutional network and then after combining it with Self-Attention layer, it is pointwise multiplied by the second unprocessed chunk assuring interaction between tokens. The Self-Attention layer uses non-expanded tokens that enters the gMLP block. Processing finishes with normalization layer across embeddings and average pooling. The original gMLP model was adjusted to EEG signal processing: instead of a two-dimensional convolution, we used a one-dimensional along the time axis and electrodes acting as channels. Kernel size was doubled in linear size from 16 to 32, with appropriate changes to the strides.

2.4. Training and evaluation of the models

The low signal-to-noise ratio of EEG results in high instability of predictions in successive training epochs (i.e. iterations through the whole training dataset). To reduce the adverse effect of high variability, we averaged the predictions from the last three epochs.

2.4.1. Shallow, Parallel and Hybrid models

The Shallow ConvNet, Parallel ConvNet and pretrained part of the Hybrid model were trained using the AdamW optimizer. Training was performed in batches of N = 64 with standard binary cross-entropy loss. The number of training epochs was chosen experimentally to mitigate overfitting and resulted in 20 epochs for Shallow ConvNet and 10 epochs for the Parallel ConvNet and Hybrid models.

2.4.2. Contrastive model

The gMLP-MoCo model was trained in two steps:

In both steps we used AdamW optimizer, with a learning rate of $1\times 10^ $ and the batch size of 64.

In our implementation of the MoCo framework, outlined in figure 5, the original training example z (from the clinical dataset) were augmented using T1 and T2 transformations constructed by automated augment policy RandAugment from the list of basic transformations (table 1). Next each transformation was fed to one of two subnetworks: momentum or backprop. The subnetworks learned representations by maximizing contrastive loss InfoNCE on samples organized into similar and dissimilar pairs. The parameters of gMLP and Head modules of both sub-networks were updated with the back-propagation algorithm. However, in the case of the momentum sub-network (gMLPm and Headm), we used momentum mechanism (Chen et al 2020), i.e. the parameter vector, θm , of the momentum network was updated using past values of the θb parameter vector of non-momentum network according to $\theta_m = \alpha \cdot\theta_m+(1-\alpha)\cdot\theta_b$. The smoothly evolving momentum networks enabled us to reuse the old batches during the calculation of contrastive loss, which inherently requires a large batch size to work properly. We used 100 epochs for gMLP pre-training.

Figure 5. Training of the gMLP Momentum Contrastive gMLP-MoCo network occurs in two stages. In the self-supervised pretraining stage (left), the z is a training example from the clinical database that is distorted with T1 and T2 transformations, transformed with network and compared using a contrastive loss function as distance. In the supervised stage (right) an example x from the current experiment is passed through the network.

Standard image High-resolution image

Table 1. The list of the basic transformations used in contrastive training.

TransformationDescription IdentityIdentity transformationNoiseAdd noise generated by a normal distributionSignal cutoutZero the signal across all electrodes on a randomly selected continuous sectionMean shiftChange the mean of an electrode by adding a randomly sampled numberSensor dropoutZero the signal on a randomly chosen set of electrodesSensor flippingFlip upside down the signal on a randomly chosen set of electrodesBandstop filteringBandstop randomly selected range of frequencyConstant scalingMultiply signal by a number randomly chosen for each electrodeIrregular scalingMultiply signal by a cubic spline

In the second step, the pre-trained gMLPb module was coupled with a linear layer and tuned to the data examples x from the current experiment using cross-entropy loss. To prevent overfitting, the augment policy from the previous step was reused with the same set of basic transformations. The experimental data were downsampled to 256 Hz to match the sampling rate of clinical recordings. We used eight epochs for training.

2.4.3. Evaluation of models' performance

For estimation of models' performance we applied 3-fold CV on the experimental data from three individual experimental sessions. To further increase the number of estimates of the investigated measures, we repeated the CV for five random neural networks' initializations. Thus, we have finally results for $3\times 5 = 15$ instantiations of each model. All models were evaluated using accuracy (ACC) and Matthews correlation coefficient (MCC). The latter was chosen for its insensitivity to class imbalance.

2.5. Feature importance2.5.1. Evaluation of feature importance

In order to determine EEG features related to information retention from the neural network perspective, we isolate features which are relevant for the classification of the trial as retention versus control. Indeed, the only difference between the control and retention trials is the retention of information in memory during the analyzed segment of the trial.

To determine the features' importance for the classification results, we applied perturbation analysis using automatic gradient evaluation in Pytorch. We focused on the power of EEG signal in the canonical frequency bands across electrodes in order to facilitate comparison with classical EEG methods of analysis.

As part of the analysis, the input to the trained model was perturbed by multiplying the amplitude at a given electrode and a given frequency band by a factor $c_$. Next, the derivative of the class probability p over the perturbation parameter at $c_ = 1$ was averaged over all versions of a given model (trained on 3 folds and for 5 random initializations) and evaluated on the relevant test set for the given fold to obtain the average feature importance index $\mathrm}_$ defined as:

Equation (1)

A positive value of $\mathrm}_ $ means that an increase of power at electrode e in frequency band around f increases the probability of classification of a given input trial as the retention one.

In the case of Parallel ConvNet and Hybrid models, the gradients were directly evaluated for the Morlet coefficient. For the models using raw signal as input (Shallow ConvNet and gMLP-MoCo), data were bandpass filtered in frequencies corresponding to the central wavelet frequencies of Morlet transforms and then signal was reconstructed by summing the bands with the weights c = 1, with gradient computation turned on for the weights c, c.f. figure 6. In this way we preserved information about the direction of class probability change with the change of the given feature.

Figure 6. Methodology of perturbation analysis for time-channel signal representation.

Standard image High-resolution image 2.5.2. Testing of feature importance2.5.2.1. Statistical tests

To test if the $\mathrm}_$ was significantly different from zero, we used a one-sample t-test on the set of perturbation results of all versions of a given model (i.e. individual $\mathrm}_$). To account for the multiple comparisons, we applied a false discovery rate (FDR) correction (Benjamini and Yekutieli 2001).

2.5.2.2. Sanity checks

Comparison of perturbation analyses results obtained for different subgroups, i.e. participants with best and worst classification scores, gives opportunity to further validate the obtained feature importance with a sanity check. To this end we tested the significance of differences of values of feature importance between 30 best and 30 worst classified participants using the Mann–Whitney test. To account for the multiple comparisons, we applied FDR correction. Another simple sanity test was correlating the classification results to task performance.

2.6. Classical EEG analyses

As a reference to perturbation analyses, we used classical spectral EEG analyses. The power in each of the frequency bands was estimated by summing periodograms in ranges corresponding to the frequency bands of Morlet wavelets used in our models, namely: 1–5, 3–7, 7–9, 10–12, 13–17, 15–25, 20–30, 25–30, and 30–40 Hz. We performed a series of permutation tests, shuffling the labels 'retention' and 'control' trials for each combination of (electrode, frequency band) with FDR correction for multiple comparisons setting a 0.05 p-value threshold (we used the standard implementation from EEGLAB Matlab toolbox).

3.1. Behavioral results

The average number of points scored by a participant in all three diagnostic sessions equaled M = 35.7, SD = 5.17. The VR and 2D group results did not differ significantly for all three sessions together (2 tailed Wilcoxon test, p = .61), nor for individual sessions (2 tailed Wilcoxon test $p_1 = .49$, $p_2 = .97$ and $p_3 = .69$ for sessions 1 through 3 respectively).

3.2. Model performance

ACC and MCC scores of the tested models are shown in table 2. For the sake of reference, using just the 'most frequent class' as a naive classifier yielded an ACC of 56% and 0 MCC. All the evaluated models performed better.

Table 2. ACC and MCC obtained in 3-fold CV and five random initializations training of the models.

ModelACCMCC# Trainable parameters Shallow ConvNet $61.50~\pm 2.33$ $0.216~\pm 0.030$ $ 3.5~\times 10^4 $ Parallel ConvNet

留言 (0)

沒有登入
gif