Neural Decoders Using Reinforcement Learning in Brain Machine Interfaces: A Technical Review

Introduction

Reinforcement learning (RL) has been actively considered in robotics (Kober et al., 2013) to accomplish industrial automation (Meyes et al., 2017; Stricker et al., 2018) and humanoid robot behaviors (Peters et al., 2003; Navarro-Guerrero et al., 2012) and in business management to guide decision making (Huang et al., 2011; García et al., 2012), pricing strategies (Kim et al., 2016; Krasheninnikova et al., 2019), and stock price prediction (Jae Won, 2001; Wu et al., 2020). The unique mechanism of RL tries to mimic the human learning process that acquires knowledge based on experience in a trial-and-error manner. That is, in RL, the learning system not only observes but also interacts with the environment to collect information to accomplish the goal of a task. This unique mechanism provides a general framework for a system to adapt to novel environments.

Due to its advantages, flexibility for adaptation, and successful performances in difficult domains such as those mentioned above (robotics and business management), RL has been incorporated in a wide variety of domains, including autonomous driving (Zhao et al., 2020), natural language processing (Sharma and Kaushik, 2017), and search engines (Hu et al., 2018). In addition, RL has started to get more attention in medical applications (Gottesman et al., 2019; Coronato et al., 2020), including clinical decision support (Liu et al., 2020) and brain machine interfaces (BMIs).

Research in BMIs is a multidisciplinary effort that involves fields such as neurophysiology and engineering. Developments in this area have a wide range of applications, especially for people with neuromuscular disabilities, for whom BMIs may become a significant aid. Neural decoding of neural signals is one of the main tasks that need to be executed by the BMI.

In a neural decoder, various signal-processing and machine-learning techniques that find a map from neural signals to control commands for external devices have been explored (Kao et al., 2014; Xu et al., 2019). Conventional signal-processing techniques, including the Kalman filter (Kim et al., 2008), Kalman filter variations (Li et al., 2009; Gilja et al., 2012; Pandarinath et al., 2017), and Wiener filter (Salinas and Abbott, 1994; Carmena et al., 2003; Hochberg et al., 2006), have shown successful performances in neural decoding. An impressive example describing closed-loop BMI cursor control experiments on humans with tetraplegia can be found in Kim et al. (2008), where an average error rate of 13.8% was reported for one subject using the Kalman filter, called velocity Kalman filter, to decode the subject’s intracortical neural signals into two-dimensional velocity vectors of the cursor, (vx,vy). In addition, a variant of the Kalman filter, called recalibrated feedback intention-trained Kalman filter, has been integrated with a hidden Markov model-based state classifier to control a computer cursor that types on a virtual keyboard. This closed-loop experiment was conducted by decoding intracortical neural signals from subjects with amyotrophic lateral sclerosis and spinal cord injury, and the neural decoder showed competitive performances on typing tasks (average typing rate of 28.1 correct characters per minute and bitrate of 2.4 bits per second) (Pandarinath et al., 2017).

Moreover, supervised learning algorithms, such as support vector machines (Hortal et al., 2015; Toderean and Chiuchisan, 2017; Skomrock et al., 2018) and artificial neural networks, particularly recurrent neural networks (Oliver and Gedeon, 2010; Sussillo et al., 2012), have been actively considered in BMIs for neural decoding. It has been shown that a recurrent neural network can outperform the velocity Kalman filter in a closed-loop intracortical BMI (Sussillo et al., 2012). In addition, the closed-loop decoder adaptation strategy allows synergistic online adaptation for both user and neural decoder providing better interaction of the user with the environment through the BMIs and improved performance (Orsborn et al., 2011, 2012; Gilja et al., 2012; Shanechi et al., 2016; Brandman et al., 2018). Furthermore, following recent advances in deep-learning techniques, researchers have started investigating various deep-learning algorithms in BMIs (Mahmood et al., 2019; Mansoor et al., 2020).

Although these learning approaches have been applied to neural decoding in real-time control of BMIs, this is probably not the most appropriate methodology for paraplegic users because of the absence of ground truth. The basic mechanism of the above-mentioned signal processing and machine learning approaches is as follows: given a training set of neural signals and synchronized movements, the problem is posed as finding a mapping between these two signals, which can be solved by applying supervised learning techniques. That is, the kinematic variables of an external device are set as desired signals, and the system is trained to obtain the regression model. Unfortunately, the desired signal is determined by the experimenter, not by the user. In practice, since the user cannot move, the required information of the desired signal at each time instant to update the external device’s movement is missing. In addition, even if the desired signal is available, functionality is still limited to various task types or changing environments since frequent calibration (retraining) becomes necessary.

RL is one of the representative learning schemes, which provides a general framework for adapting a system to a novel environment inspired by how biological organisms interact with the environment and learn from experience. RL allows learning using only information from the environment, and thus there is no need for an explicit desired signal. Although RL does require a reward signal to guide the learning process, it is important to note that the reward can be obtained based on the user’s neural activity (Schultz et al., 1998; Marsh et al., 2015; An et al., 2019). These characteristics are well suited for the neural decoding task in BMI applications since BMIs need to have direct communication between the central nervous system and the computer that controls external devices such as a prosthetic arm for disabled individuals. Moreover, BMIs should be able to continuously adapt and adjust to subtle neural variations.

In this article, we focus on various RL methods that have been used in BMIs for neural decoding. Although preprocessing of the acquired neural data is an important step in BMIs, in this study, we do not place emphasis on the data preprocessing steps. In addition, interactive RL, which uses human guidance to optimize learning procedures, has been highlighted in BMIs (Cruz and Igarashi, 2020; Poole and Lee, 2022). The human feedback has been largely related to modeling rewards in RL. Modeling reward is another important step in RL, and there have been various attempts to model reward based on neural signals (Iturrate et al., 2010; Marsh et al., 2015; An et al., 2018; Shen et al., 2019). However, in this article, we focus on RL models used as a neural decoder in BMIs. Thus, studies solely based on modeling the rewards are out of the scope of this review.

To the best of our knowledge, this work is the first attempt to provide an exhaustive review of neural decoding algorithms applied to RLBMIs. In this article, we describe various RL methods that have been used in BMIs to adjust the parameters of the neural decoders and provide a summary of their advantages and limitations. It is expected that this review will not only serve as a reference guide for researchers already working in RL-based BMIs but also as an introductory tool to those that may be considering incorporating RL algorithms into their BMI work. The contributions of the authors include listing update rules and diagrams from different RL neural decoders with unified notation over different studies and providing a taxonomy for various neural decoders by categorizing their RL base model and type of function approximation algorithms. Experimental set up and details are also summarized along with reported neural decoder’s performances. This article is organized as follows: Section “Search Methodology” shows the methodology for the literature review process. Section “Background on Reinforcement Learning” provides the taxonomy and problem formulations in RL. Section “Reinforcement Learning Brain Machine Interfaces: Basic Mechanism” provides an overview of RLBMIs. Section “Reinforcement Learning in Brain Machine Interfaces: Neural Decoding Algorithms” reviews various types of neural decoders applied in RLBMIs. Section “Discussion” discusses future directions for research in RLBMIs.

Search Methodology

We chose to search for relevant literature through the following databases: PubMed, JSTOR, Academic Search Complete, and Google Scholar. The phrases we employed were “Reinforcement Learning Brain Machine Interfaces” and “Error Related Potentials and Brain Machine Interfaces.” Once all seemingly relevant papers were gathered across the different databases based on their abstracts, replicates were removed, i.e., the same paper from different databases. From there, articles were removed after full-text analysis revealed they were not appropriate for our review, in the sense that the phrases used above were only superficially related to the paper (Figure 1).

FIGURE 1

www.frontiersin.org

Figure 1. A review flow chart, by following the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines (Page et al., 2021).

In addition, Table 1 displays an itemized summary of the reviewed neural decoders in RLBMIs. The first column shows the main author and the publication year of the reported study. Neural decoder type is divided into three subcategories including RL base model, function approximator, and learning algorithm. Neural signal and subject types are listed in the subsequent columns, along with the number of subjects considered in the RLBMI experiments. The “Subject” column provides gender and specific species if available, when an animal study was conducted. The eighth column shows the type of task the subject conducted while the neural signal was acquired. “External device” shows the type of device that the subject was controlling. The tenth column shows the type of BMI experiments, if the subject was manually controlling the external device and pre-recorded neural signal was used with the neural decoder, it was listed as “Open,” and when the subjects’ neural signals were directly controlling the external device regardless of their behavior, it was marked as “Closed.” The highlighted performance was summarized under “Key reported performance.” The best reported performance is summarized in terms of success rate, for fair comparisons of all reported studies, and the data amount for evaluation is listed to provide an understanding of the learning speed. It should be noted that all provided information from the published studies has been summarized. However, there are fields that are missing some information as it was not available in the corresponding published studies.

TABLE 1

www.frontiersin.org

Table 1. A summary of reviewed neural decoders in RLBMIs.

Background on Reinforcement Learning

In RL, a controller, called an agent, interacts with a system, called the environment, over time and modifies its behavior to improve performance. This performance is assessed in terms of cumulative rewards, which are assigned based on the task goal. The agent tries to adjust its behavior by taking actions that will increase the cumulative reward in the long run; these actions are directed toward the accomplishment of the task goal.

An RL framework can be formalized with the following components: a set of states ?, a set of actions ?, a reward function ℛ, and a transition probability ?. The basic RL mechanism is as follows: at an arbitrary time t, the agent observes a state xt ∈ ?, from the environment and outputs an action at ∈ ?. This action changes the environment and a new state xt+1 is observed. Upon transitioning to this new state, a reward rt+1 is presented from the environment to the agent. The process repeats either indefinitely or until a terminal state is reached. In RL, it is possible that the agent receives delayed reward information from the environment by unspecified time amounts.

Policy and Value Functions

Two important concepts associated with the agent are the policy and value functions. The policy π is a function that maps a state xt to an action at, π : ? → ?. That is, the action taken by the agent is selected based on the agent’s policy. Moreover, the value function is a measure of the long-term performance of an agent following a policy π starting from a state xt. There are two types of value functions: a state-value function and an action-value function. The state-value function is defined as an expected value of a cumulative reward Rt, which an agent receives when it starts in a particular state at time t, xt and follows a policy π:

Vπ(xt)=?π[Rt|xt].(1)

This state-value function indicates the expected cumulative reward that an agent can collect from a state xt. In addition, an action-value function considers the expected cumulative reward obtained by performing an action at while the agent is in the state xt and following the policy π thereafter:

Qπ(xt,at)=?π[Rt|xt,at].(2)

A discounted infinite-horizon model is popularly chosen for the cumulative reward Rt:

Rt=∑k=0∞γkrt+k+1, 0<γ<1,(3)

where the discount factor γ provides emphasis on recently acquired reward values and prevents the function from growing unbounded as k → ∞.

The objective of RL is to find a good policy that maximizes the expected reward of all future actions given the current knowledge. By maximizing the rewards made available to an agent, the goal behavior can be realized. This duality is of course present by design and is commonly referred to as the Reward Hypothesis. Since the value function represents the expected cumulative reward given a policy, the optimal policy π*, can be obtained based on the value functions; a policy π is better than another policy π′ when the policy π gives a greater expected return than the policy π′. In other words, π ≥ π′ when Vπ (xt) ≥ Vπ′ (xt) or Qπ (xt, at) ≥ Qπ′ (xt, at) for all xt ∈ ? and at ∈ ?. Therefore, the optimal state-value function Vπ* (xt) is defined by,

Vπ*(xt)=maxπVπ(xt),(4)

and the optimal action-value function Qπ* (xt, at) can be obtained by,

Qπ*(xt,at)=maxπQπ(xt,at).(5)

The following Bellman optimality equations are obtained by evaluating the Bellman equation for the optimal value function,

Vπ*(xt)=maxat∈?(xt)∑xt+1?xx′a[ℛxx′a+γVπ*(xt+1)],(6)

Qπ*(xt,at)=∑xt+1?xx′a[ℛxx′a+γmaxat+1Qπ*(xt+1,at+1)],(7)

where ?xx′a=P(xt+1=x′|xt=x,at=a) and ℛxx′a=?[rt+1|xt=x,at=a,xt+1=x′]. The solution to these Bellman optimality equations can be obtained using dynamic programming (DP) methods. However, this procedure is infeasible when the number of variables increases due to the exponential growth of the state space, the curse of dimensionality. In addition, solving this equation requires explicit knowledge of the environment including the state transition probability ?xx′a and reward distribution ℛxx′a (Sutton and Barto, 1998).

Functional Approximation of the Value Function and Policy

It is noteworthy that all published works on neural decoding within RLBMI use some form of functional approximation for either the value function or the policy. Therefore, in this section, we provide further details on how the functional approximation can be considered in RL. Moreover, this is another reason why we present in separate columns in Table 1, the RL base model and the function approximation strategies, along with the learning algorithms.

Various methods can approximately solve the Bellman optimality equations for each of the value functions. The approximate solutions often require far less time to resolve, with the added advantage of requiring less memory. The estimated value functions will allow comparisons between policies and thus guide the optimal policy search:

V˜π(xt)=fv(xt;θfv),(8)

Q˜π(xt,at)=fq(xt,at;θfq),(9)

where fv and fq represent arbitrary functions, and θfv and θfq are their corresponding parameters that define the function. Furthermore, following the same functional approximation strategy, the approximated policy can also be represented as follows:

π:at≈fπ(xt;θfπ),(10)

where fπ and θfπ are an arbitrary function and its corresponding parameters, respectively. Therefore, to avoid high computational complexity and the need for having explicit knowledge of the environment including ?xx′a and ℛxx′a, this functional approximation strategy has been mainly considered in RLBMIs to model neural decoders.

While there exist various functional approximation methods, there are mainly two functional approximation methods that have been considered in RLBMI to approximate the value functions or policy. One is kernel basis expansion, and the other is artificial neural networks, specifically, feedforward networks and convolutional neural networks (CNNs).

Kernel Expansions

The basic idea of kernel methods is to nonlinearly map the input data to a high-dimensional feature space of vectors. Let ? be a nonempty set. For a positive definite function, κ : ? × ? → ℝ (Scholkopf and Smola, 2001; Liu et al., 2010), there exists a Hilbert space ℋ and a mapping ϕ : ? → ℋ, such that κ(x1, x2) = ⟨ϕ(x1), ϕ(x2)⟩. The inner product in the high-dimensional feature space can be calculated by evaluating the kernel function in the input space. Here, ℋ is called a reproducing kernel Hilbert space (RKHS) because it satisfies the following property,

f(x)=⟨f,ϕ(x)⟩=⟨f,κ(x,⋅)⟩,∀f∈ℋ.(11)

This property enables the transformation of conventional linear algorithms in the feature space into nonlinear systems without explicitly computing the inner product in the high-dimensional space. The function f can take the role of fv, fq, or fπ in RL as follows:

f(x)=∑i=1nαiκ(xi,x),(12)

where n corresponds to the number of available units to compute and αi is the weighting factor for the unit centered at xi. In many cases, the number of available units corresponds to the number of data points that have been seen during training. We can think about kernel expansions as function approximators where the number of parameters can grow as more data become available.

Feedforward Neural Networks and Convolutional Neural Networks

An artificial feedforward neural network is composed of input, hidden (possibly multiple), and output layers, and each layer contains a certain number of units which are design parameters that depend on the problem set up. Let x(ℓ) denote the activation vector at layer ℓ so that for a network with L layers, the input to the network is denoted as x(0) and the output of the network as x(L). The output of each unit in layer ℓ can be computed as follows:

xj(ℓ)=gj(ℓ)(∑i=1dℓ-1wij(ℓ)xi(ℓ-1)+bj),(13)

where gj(ℓ) represents an activation function, wij(ℓ) are the weights connecting each layer’s units, bj is the bias term to be added, and dℓ resents the number of units in layer ℓ. The indexes i and j represent input to output units, respectively. In addition, xi(ℓ-1)shows the ith input to the unit j and xj(ℓ) the unit’s output. Note that when L = 1 and g is the identity function, this neural network corresponds to a linear function approximator.

A convolutional neural network is one type of artificial neural network where additional structure in the units can be used to group and restrict the weighted sum above to a convolution. For instance, an electroencephalogram (EEG) signal over a short time window has channel and time structure and can be seen as a single input array, similarly, an image can be seen as an input array with spatial structure and possibly also channel structure, RGB image as an example.

Along with these different function approximation strategies, various learning methods have been implemented in RLBMI. They are summarized in Table 1 and details are provided in the following sections, specifically section “Reinforcement Learning in Brain Machine Interfaces: Neural Decoding Algorithms.”

Reinforcement Learning Brain Machine Interfaces: Basic Mechanism

What makes RL most viable for BMIs is the ability of the agent to respond with continuous adaptations to a dynamic environment. In RLBMIs, the environment includes the subject, external device, and task-related information (Figure 2). RLBMIs consider the state of the environment xt as the neural signals of the subject. The action at generated from the agent is treated as a representation to control an external device, such as direction, position, or velocity. Moreover, the agent finds a mapping from the subject’s neural signal to the action, so the agent takes the role of the neural decoder.

FIGURE 2

www.frontiersin.org

Figure 2. RLBMI architecture with labeled RL components. This figure is modified based on Figure 1 in DiGiovanna et al. (2009).

In the RLBMI architecture, there are two intelligent systems: the BMI decoder in the agent and the user in the environment (DiGiovanna et al., 2009). The two intelligent systems learn co-adaptively based on closed-loop feedback. The agent updates the state of the environment, namely, the location of a cursor on a screen or a robotic arm’s position, based on the user’s neural activity and the received rewards. At the same time, the subject produces the corresponding brain activity. Through iterations, both systems learn how to earn rewards based on their joint behavior. The BMI decoder learns a control strategy based on the user’s neural state and performs actions in goal-directed tasks that update the action of the external device in the environment. In addition, the user learns the task based on the state of the external device. Notice that both systems act symbiotically by sharing the external device to complete their tasks, and this co-adaptation allows for continuous synergistic adaptation between the BMI decoder and the user even in changing environments.

Environment in Reinforcement Learning Brain Machine Interface

Various experimental setups, including different types of subjects, external devices, and tasks, have been investigated to define the environment in RLBMIs, and Table 1 summarizes how each study is unique.

The reviewed studies showed variations of the subjects such as Sprague-Dawley rat, Bonnet Macaque, Rhesus Macaque, Marmoset monkey, and human. The neural signal type that has been used in RLBMI research also varies. However, our literature survey method identified that only two types of data acquisition technologies have been used with RLBMIs, namely, intracortical neural signals and EEG. Although these two types of signals differ in many ways, good performance of RLBMIs has been achieved with both neural signal modalities. In addition, it was also found that in some cases, the neural data were artificially generated. The simulated neuron’s activities may fail to capture all variations present in real-world scenarios but yield a viable method to showcase various theoretical properties or characteristics of an algorithm. Moreover, various dimensions of neural signals have been considered. The values listed inside of the parenthesis in “Neural signal types” in Table 1 contain details of the signal dimensions.

Different types of external devices have been employed in RLBMI experiments. A cursor on a 2D screen, a robotic arm, and a lever are the three different types of devices being reported. Moreover, numerous tasks have been investigated. A multi-target center-out reaching task and its variations, such as a multi-target reaching task and multi-target reaching and grasping task, have been the most commonly considered in RLBMIs, but go no-go task, lever pressing task, and obstacle avoidance task have also been applied.

Agent in Reinforcement Learning Brain Machine Interface

Agent in RLBMI can be considered as a neural decoder since it provides a mapping from a state to action. Various RL algorithms have been considered in RLBMIs. We categorize the neural decoding algorithms based on the fundamental RL approaches each study considered. Q-learning, Watkin’s Q(λ), Attention-Gated Reinforcement Learning, and Actor-Critic are the main four RL algorithms considered in RLBMIs. The following section explains in further detail how each neural decoder works differently and points out each algorithm’s uniqueness.

In addition, each neural decoder’s reported performance is also summarized. We categorize its performance based on task type and open- or closed-loop experimental setups. It is notable that even though most of the studies implement RLBMI in open-loop setups, similar types of neural decoders have been implemented in both open- and closed-loop experiments. The open-loop experiments allow more resource intensive investigations, yet the closed-loop experiments provide the most applicable setup for real-world deployments.

Reinforcement Learning in Brain Machine Interfaces: Neural Decoding Algorithms

Table 1 provides an itemized summary of reviewed neural decoders integrated in RLBMI. This section provides further details of each neural decoder, along with Table 1. We first categorize each neural decoder based on the RL base model in sections “Approximation of the Action-Value Function, Q” and “Actor-Critic.” We then list learning algorithms for each model under their corresponding subsections. Specific neural signal type is identified and the type of task, which the external device needed to complete, is summarized. In addition, key-reported performances are listed in terms of success rates.

For the best comparison of overall reviewed neural decoders in RLBMI, we chose success rates as the evaluation metric. Since the function approximation algorithms are typically applied to approximate the value functions in RLBMI, it is common to show how the value function is estimated to evaluate the neural decoder’s performance. However, the estimated value is not always directly associated with how an actual movement is selected. Furthermore, confusion matrix and precision-recall curves are commonly considered evaluation metrics in typical classification tasks, but not all reviewed studies report them. Note that these metrics are only suitable when a single step reaching task is considered because an action, a choice of direction that can match a class label, happens at each step in multi-step tasks. In addition, we only report the best performances in each study. Generalization of the reported performance is still limited due to neural and measurement variability. Each study reports the neural decoder’s performance on each subject and session separately. Since each study has a different number of subjects and recording sessions, we describe the best reported performance.

Approximation of the Action-Value Function, Q

A recently published study introduces how a linear approximation of the action-value function Q can be used to detect Chinese symbols under the P300 brain–computer interface paradigm (Huang et al., 2022). The P300 brain–computer interface paradigm uses a unique setting that requires stimulations to produce synchronization of EEG patterns. This study uses different visual stimulations to represent each row and column that can be associated with a symbol location in a 6 × 6 (row × column) display. A linear relationship is used to approximate the action-value function, Q = θTx, where θ is a coefficient vector, and x is constructed from a d-dimensional feature vector based on the EEG epoch. The θ values are optimized by minimizing the difference between the expected and the actual Q values, Q*-Q˜. For an action selection strategy, an upper confidence bound (UCB) is used. This study also provides transferred P300 linear upper confidence bound (TPLUCB), by transferring θ information from different subjects to a new subject. PLUCB and TPLUCB showed improved performance over a conventional algorithm called stepwise linear discriminant analysis (SWLDA); their reported overall symbol accuracies are 80.4 ± 12.8% and 79.6 ± 14%, respectively.

Q-learning and Its Variations

Temporal difference (TD) learning is an incremental learning method specialized for multi-step prediction problems. It provides an efficient learning procedure that can be applied to RL. TD learning allows learning directly from new experiences without having a model of the environment. In addition, it employs temporal difference error, in composition with previous estimations, to provide updates to the current predictor (Sutton, 1988).

Q-learning is an off-policy TD algorithm based on the following incremental TD update rule for the action-value function.

Q(xt,at)←Q(xt,at)+η[rt+1+γmaxaQ(xt+1,a)-Q(xt,at)],(14)

where η and γ are the step-size and discount factors, respectively, and η, γ ∈[0,1]. The current action at is selected based on a policy derived from the current Q(xt, at), and ϵ-greedy is a commonly considered policy. Despite the policy, this update rule allows selecting the next action at+1, which results in the greatest valuation of Q given the state and action pair. Q-learning does not require a model of the environment to converge upon an optimal policy and is, therefore, invaluable in stochastic and dynamical learning situations.

The Q(λ) algorithm is an extension of Q-learning by adding the eligibility trace λ, which allows learning, based on a sequence of actions selected. Although there are two different Q(λ) algorithms, including Watkins’ Q(λ) (Watkins, 1989) and Peng’s Q(λ) (Peng and Williams, 1996), the RLBMI studies showed a specific focus on Watkin’s Q(λ) algorithm. Watkin’s Q(λ) algorithm uses the following cost function Jt:

Jt=12(TDerrortλ)2,(15)

TDerrortλ=TDerrort+∑n=1T-1(γλ)nTDerrort+n,(16)

TDerrort=rt+1+γQ(xt+1,at+1)-Q(xt,at),(17)

where T is the length of a trial. Its update rule is derived by ∂⁡Jt∂⁡Q(xt,at)=0.

Attention-Gated Reinforcement Learning was introduced as a biologically realistic learning scheme by integrating feedback connections, called attention effects, and synaptic plasticity (Roelfsema and van Ooyen, 2005). Attention-Gated Reinforcement Learning is a policy-based learning method with an instantaneous reward. Two unique components of Attention-Gated Reinforcement Learning are global error signal δ, which reflects changes in reward expectancy, and an attention signal, which feeds back from the output layer to the previous layers. The global error signal δ is defined in such a way, that it increases learning when unexpected actions are taken. Another key difference between the Attention-Gated Reinforcement Learning is a form of policy π for which the units in the output layer engage in a competition. That is, the new form of policy introduces that in each iteration, one output unit is selected, based on the stochastic Softmax rule, and only the winning unit is updated (Roelfsema and van Ooyen, 2005).

It is notable that compared to Q(λ) algorithms, Attention-Gated Reinforcement Learning considers the same mechanisms of state and action relations; that is, a neural signal is treated as an input state, xt, and the output is represented as the action, at, to control an external device. Moreover, the Attention-Gated Reinforcement Learning network is set to estimate the action-value function, Q. The unique difference in the Attention-Gated Reinforcement Learning network is that a new form of policy is applied to select one corresponding action.

Q-Learning via Kernel Temporal Difference(λ)

The value functions can be estimated adaptively using the TD(λ) algorithm, which approximates the value functions using a linear function approximator. However, this may be a limitation in practice. A nonlinear variant of the TD algorithm, called Kernel Temporal Difference(λ), was introduced by integrating kernel methods (Bae et al., 2011, 2015).

Bae et al. (2011) showed how the action-value function Q can be approximated using Kernel Temporal Difference(λ) in Q-learning, Q˜π(xt,at)=f(xt,at;θf). The function f can be optimized using the following update rule:

f←f+η∑i=1mΔf˜i,(18)

Δf˜i=(ri+1+⟨f,γϕ(xi+1)-ϕ(xi)⟩)∑k=1iλi-kϕ(xk).(19)

Here, η is the stepsize, and m is the length of a trial. We should note that differently from Q(λ), this algorithm uses the eligibility trace λ as in TD(λ) (Sutton, 1988). That is, the λ value is not set to zero depending on the chosen greedy policy but takes a main role as a memory to trace more recent trials. Figure 3 shows how this algorithm can be considered in the basic RL structure.

FIGURE 3

www.frontiersin.org

Figure 3. The decoding structure of RLBMI using a Q-learning via Kernel Temporal Difference(λ). This figure is modified based on Figure 1 in Bae et al. (2015).

Bae et al. (2011) showed that using female Bonnet Macaque’s intracortical recordings, this algorithm properly finds matching directions on a 2-target center-out reaching task after 2 epochs of training. The application of Kernel Temporal Difference(λ) was extended and a convergence property was explained in Bae et al. (2015). This study investigated the algorithm’s performance on various setups in open-loop experiments and presented results from closed-loop RLBMI experiments, using monkey’s intracortical signals. Considering that most of the reviewed studies implemented RL-based neural decoding algorithms on a single-step task, which allows one step from the initial location to the target, a distinctive feature of this study is that it investigated multi-step reaching tasks, as well. In addition, the best performance on the closed-loop 2-target reaching task to control a robotic arm showed 90% accuracy.

Q-Learning via Correntropy Kernel Temporal Difference

A new cost function, called Correntropy, has been integrated in Kernel Temporal Difference, to address possible issues under noise-corrupted environments (Bae et al., 2014). Highly noise-corrupted environments lead to difficulties in learning, and this may result in failure to obtain the desired behavior of the agent. The generalized correlation function, Correntropy, was first introduced by Liu et al. (2007). Correntropy is defined in terms of inner products of vectors in the kernel feature space,

Correntropy(X1,X2)=?[κ(X1-X2)],(20)

where X1 and X2 represent two random variables, and κ is a translation invariant kernel. When Correntropy is set as a cost function in Kernel Temporal Difference(λ), Q-learning via Correntropy Kernel Temporal Difference approximates the action-value function Q for an action k in the following way;

Q˜(xt,at=k)=η∑i=1t-1e-(TDerrori22hc2)TDerroriIikκ(xt,xi),(21)

where η is the stepsize, hc is the Correntropy kernel size, and TDerrori denotes a Temporal Difference error defined as TDerrori = ri+1 + γmaxaQ (xi+1, a) − Q(xi, ai = k). Recall that the reward ri + 1 corresponds to the action selected by the current policy with input xi because it is assumed that this action causes the next input state xi+1. Here, Iik is an indicator vector with the same size as the number of outputs; only the kth entry of the vector is set to 1, and the rest of the entries are 0. The selection of the action unit k at time i can be based on an ϵ-greedy method. Therefore, only the parameter vector corresponding to the winning action gets updated. Correntropy Kernel Temporal Differences showed slightly faster learning speed than Kernel Temporal Difference(λ = 0) when intracortical recordings from a female Bonnet Macaque were decoded to control a cursor on a screen in a 4-target center-out reaching task. In addition,

留言 (0)

沒有登入
gif