Heterogeneous recurrent spiking neural network for spatio-temporal classification

1. Introduction

Acclaimed as the third generation of neural networks, spiking neural networks (SNNs) have become very popular. In general, SNN promises lower operating power when mapped to hardware. In addition, recent developments of SNNs with leaky integrate-and-fire (LIF) neurons have shown classification performance similar to deep neural networks (DNN). However, most of these works use supervised statistical training algorithms such as backpropagation-through-time (BPTT) (Jin et al., 2018; Shrestha and Orchard, 2018; Wu et al., 2018). These backpropagated models are extremely data-dependent and show poor trainability with less training data, and generalization characteristics (Tavanaei et al., 2019; Lobo et al., 2020). In addition, BPTT-trained models need highly complex architecture with a large number of neurons for good performance. Though unsupervised learning methods like the STDP have been introduced, they lack performance compared to their backpropagated counterparts. This is attributed to the high training complexity of these STDP dynamics (Lazar et al., 2006). Therefore, there is a need to explore SNN architectures and algorithms that can improve the performance of unsupervised learned SNN.

This paper introduces a Heterogeneous Recurrent Spiking Neural Network (HRSNN) with heterogeneity in both the LIF neuron parameters and the STDP dynamics between the neurons. Recent works have discussed that heterogeneity in neuron time constants improves the model's performance in the classification task (Perez-Nieves et al., 2021; She et al., 2021b; Yin et al., 2021; Zeldenrust et al., 2021). However, these papers lack a theoretical understanding of why heterogeneity improves the classification properties of the network. Current literature primarily looks into how heterogeneity in neuronal timescales improves the model performance. They do not study how heterogeneity can be leveraged to engineer sparse neural networks. In addition, the previous papers do not study the effect of heterogeneity on the amount of training data needed for the model. In this paper, we studied how the heterogeneity in both the neuronal and synaptic parameters can help us engineer models that can perform well with less training data and fewer synaptic connections.

Our work also uses a novel BO method to optimize the hyperparameter search process, making it highly scalable for larger heterogeneous networks that can be used for more complex tasks like action recognition, which was not possible earlier. First, we analytically show that heterogeneity improves the linear separation property of unsupervised SNN models. We also empirically verified that heterogeneity in the LIF parameters and the STDP dynamics significantly improves the classification performance using fewer neurons, sparse connections, and lesser training data. We use a Bayesian Optimization (BO)-based method using a modified Matern Kernel on the Wasserstein metric space to search for optimal parameters of the HRSNN model and evaluate the performance on RGB (KTH, UCF11, and UCF101) and event-based datasets (DVS-Gesture). The HRSNN model achieves an accuracy of 94.32% on KTH, 79.58% on UCF11, 77.33% on UCF101, and 96.54% on DVS-Gesture using 2,000 LIF neurons.

2. Related works 2.1. Recurrent spiking neural network 2.1.1. Supervised learning

Recurrent networks of spiking neurons can be effectively trained to achieve competitive performance compared to standard recurrent neural networks. Demin and Nekhaev (2018) showed that using recurrence could reduce the number of layers in SNN models and potentially form the various functional network structures. Zhang and Li (2019) proposed a spike-train level recurrent SNN backpropagation method to train the deep RSNNs, which achieves excellent performance in image and speech classification tasks. On the other hand, Wang et al. (2021) used the recurrent LIF neuron model with the dynamic presynaptic currents and trained by the BP based on surrogate gradient. Some recent works introduces heterogeneity in the LIF parameters using trainable time constants (Fang et al., 2021). However, these methods are supervised learning models and also do not scale with a greater number of hyperparameters.

2.1.2. Unsupervised learning

Unsupervised learning models like STDP have shown great generalization, and trainability properties (Chakraborty and Mukhopadhyay, 2021). Previous works have used STDP for training the recurrent spiking networks (Gilson et al., 2010). Nobukawa et al. (2019) used a hybrid STDP and Dopamine-modulated STDP to train the recurrent spiking network and showed its performance in classifying patterns. Several other works have used a reservoir-based computing strategy, as described above. Liquid State Machines, equipped with unsupervised learning models like STDP and BCM (Ivanov and Michmizos, 2021) have shown promising results.

2.1.3. Heterogeneity

Despite the previous works on recurrent spiking neural networks, all these models use a uniform parameter distribution for spiking neuron parameters and their learning dynamics. There has been little research leveraging heterogeneity in the model parameters and their effect on performance and generalization. Recently, Perez-Nieves et al. (2021) introduced heterogeneity in the neuron time constants and showed this improves the model's performance in the classification task and makes the model robust to hyperparameter tuning. She et al. (2021b) also used a similar heterogeneity in the model parameters of a feedforward spiking network and showed it could classify temporal sequences. Again, modeling heterogeneity in the brain cortical networks, Zeldenrust et al. (2021) derived a class of RSNNs that tracks a continuously varying input online.

2.2. Action detection using SNNs

SNNs can operate directly on the event data instead of aggregating them, recent works use the concept of time-surfaces (Lagorce et al., 2016; Maro et al., 2020). Escobar et al. (2009) proposed a feed-forward SNN for action recognition using the mean firing rate of every neuron and synchrony between neuronal firing. Yang et al. (2018) used a two-layer spiking neural network to learn human body movement using a gradient descent-based learning the mechanism by encoding the trajectories of the joints as spike trains. Wang W. et al. (2019) proposed a novel Temporal Spiking Recurrent Neural Network (TSRNN) to perform robust action recognition from a video. Using a temporal pooling mechanism, the SNN model provides reliable and sparse frames to the recurrent units. Also, a continuous message passes from spiking signals to RNN helps the recurrent unit retain its long-term memory. The other idea explored in the literature is to capture the temporal features of the input that are extracted by a reservoir network of spiking neurons, the output of which is trained to produce certain desired activities based on some learning rule. Recent research learned video activities with limited examples using this idea of reservoir computing (Panda and Srinivasa, 2018; George et al., 2020; Zhou et al., 2020). We observed that driven/autonomous models are good for temporal dependency modeling of a single-dimensional pre-known time series, but it cannot learn spatio-temporal features together needed for action recognition. Soures and Kudithipudi (2019) used a the deep architecture of a reservoir connected to an unsupervised Winner Take All (WTA) layer, which captures input in a higher dimensional space and encodes that to a low dimensional representation by the WTA layer. All the information from the layers in the deep network is selectively processed using an attention-based neural mechanism. They have used ANN-based spatial feature extraction using ResNet but it is compute-intensive. Some of the recent works also study the effect of heterogeneity in the neuronal parameters (Perez-Nieves et al., 2021; She et al., 2021a). Fang et al. (2021) introduced a learnable leak factor and membrane time constants to introduce heterogeneity in the neurons.

3. Methods 3.1. Recurrent spiking neural network

SNN consists of spiking neurons connected with synapses. The spiking LIF is defined by the following equations:

τmdvdt=a+RmI-v;v=vreset,if v>vthreshold    (1)

where Rm is membrane resistance, τm = RmCm is time constant and Cm is membrane capacitance. a is the resting potential. I is the sum of current from all input synapses connected to the neuron. A spike is generated when membrane potential v crosses the threshold, and the neuron enters refractory period r, during which the neuron maintains its membrane potential at vreset. We construct the HRSNN from the baseline recurrent spiking network (RSNN) consisting of three layers: (1) an input encoding layer (I), (2) a recurrent spiking layer (R), and (3) an output decoding layer (O). The recurrent layer consists of excitatory and inhibitory neurons, distributed in a ratio of NE:NI = 4:1. The PSPs of post-synaptic neurons produced by the excitatory neurons are positive, while those produced by the inhibitory neurons are negative. We used a biologically plausible LIF neuron model and trained the model using STDP rules.

From here on, we refer to connections between I and R neurons as SIR connections, inter-recurrent layer connections as SRR, and R to O as SRO. We created SRR connections using probabilities based on Euclidean distance, D(i,j), between any two neurons i,j:

P(i,j)=C·exp(-(D(i,j)λ)2)    (2)

with closer neurons having higher connection probability. Parameters C and λ set the amplitude and horizontal shift, respectively, of the probability distribution. I contains excitatory encoding neurons, which convert input data into spike trains. SIR only randomly chooses 30% of the excitatory and inhibitory neurons in R as the post-synaptic neuron. The connection probability between the encoding neurons and neurons in the R is defined by a uniform probability PIR, which, together with λ, will be used to encode the architecture of the HRSNN and optimized using BO. In this work, each neuron received projections from some randomly selected neurons in R.

We used unsupervised, local learning to the spiking recurrent model by letting STDP change each SRR and SIR connection, modeled as:

dWdt=A+Tpre∑oδ(t-tposto)-A-Tpost∑iδ(t-tprei)    (3)

where A+, A− are the potentiation/depression learning rates and Tpre/Tpost are the pre/post-synaptic trace variables, modeled as,

τ+*dTpredt=-Tpre+a+∑iδ(t-tprei)    (4) τ-*dTpostdt=-Tpost+a-∑oδ(t-tposto)    (5)

where a+, a− are the discrete contributions of each spike to the trace variable, τ+*,τ-* are the decay time constants, tprei and tposto are the times of the pre-synaptic and post-synaptic spikes, respectively.

3.1.1. Heterogeneous LIF neurons

The use of multiple timescales in spiking neural networks has several underlying benefits, like increasing the memory capacity of the network. In this paper, we propose the usage of heterogeneous LIF neurons with different membrane time constants and threshold voltages, thereby giving rise to multiple timescales. Due to differential effects of excitatory and inhibitory heterogeneity on the gain and asynchronous state of sparse cortical networks (Carvalho and Buonomano, 2009; Hofer et al., 2011), we use different gamma distributions for both the excitatory and inhibitory LIF neurons. This is also inspired by the brain's biological observations, where the time constants for excitatory neurons are larger than the time constants for the inhibitory neurons. Thus, we incorporate the heterogeneity in our Recurrent Spiking Neural Network by using different membrane time constants τ for each LIF neuron in R. This gives rise to a distribution for the time constants of the LIF neurons in R.

3.1.2. Heterogeneous STDP

Experiments on different brain regions and diverse neuronal types have revealed a wide variety of STDP forms that vary in plasticity direction, temporal dependence, and the involvement of signaling pathways (Sjostrom et al., 2008; Feldman, 2012; Korte and Schmitz, 2016). As described by Pool and Mato (2011), one of the most striking aspects of this plasticity mechanism in synaptic efficacy is that the STDP windows display a great variety of forms in different parts of the nervous system. However, most STDP models used in Spiking Neural Networks are homogeneous with uniform timescale distribution. Thus, we explore the advantages of using heterogeneities in several hyperparameters discussed above. This paper considers heterogeneity in the scaling function constants (A+, A−) and the decay time constants (τ+, τ−).

3.2. Classification property of HRSNN

We theoretically compare the performance of the heterogeneous spiking recurrent model with its homogeneous counterpart using a binary classification problem. The ability of HRSNN to distinguish between many inputs is studied through the lens of the edge-of-chaos dynamics of the spiking recurrent neural network, similar to the case in spiking reservoirs shown by Legenstein and Maass (2007). Also, R possesses a fading memory due to its short-term synaptic plasticity and recurrent connectivity. For each stimulus, the final state of the R, i.e., the state at the end of each stimulus, carries the most information. Figure 1 shows the heterogeneous recurrent spiking neural network model with heterogeneous LIF neurons and heterogeneous STDP synapses used for the classification of spatiotemporal data sequences. The authors showed that the rank of the final state matrix F reflects the separation property of a kernel: F = [S(1) S(2) ⋯ S(N)]T where S(i) is the final state vector of R for the stimulus i. Each element of F represents one neuron's response to all the N stimuli. A higher rank in F indicates better kernel separation if all N inputs are from N distinct classes.

www.frontiersin.org

Figure 1. An illustrative example showing the heterogeneous recurrent spiking neural network structure. First, we show the temporal encoding method based on the sensory receptors receiving the difference between two time-adjacent data. Next, the input sequences are encoded by the encoding neurons that inject the spike train into 30% neurons in R.R contains a 4:1 ratio of excitatory (green nodes) and inhibitory (orange nodes), where the neuron parameters are heterogeneous. The synapses are trained using the heterogeneous STDP method.

The effective rank is calculated using Singular Value Decomposition (SVD) on F, and then taking the number of singular values that contain 99% of the sum in the diagonal matrix as the rank. i.e. F = UΣVT where U and V are unitary matrices, and Σ is a diagonal matrix diag(λ1, λ2, λ3, …, λN) that contains non-negative singular values such that (λ1 ≥ λ2⋯ ≥ λN).

Definition: Linear separation property of a neuronal circuit Cfor m different inputs u1, …, um(t) is defined as the rank of the n×m matrix M whose columns are the final circuit states xui(t0) obtained at time t0for the preceding input stream ui.

Following from the definition introduced by Legenstein and Maass (2007), if the rank of the matrix M = m, then for the inputs ui, any given assignment of target outputs yi ∈ ℝ at time t0 can be implemented by C.

We use the rank of the matrix as a measure for the linear separation of a circuit C for distinct inputs. This leverages the complexity and diversity of nonlinear operations carried out by C on its input to boost the classification performance of a subsequent linear decision-hyperplane.

Theorem 1: Assuming Suis finite and contains s inputs, let rHom, rHetare the ranks of the n×s matrices consisting of the s vectors xu(t0) for all inputs u in Sufor each of Homogeneous and Heterogeneous RSNNs respectively. Then rHom ≤ rHet.

Short Proof: Let us fix some inputs u1, …, ur in Su so that the resulting r circuit states xui(t0) are linearly independent. Using the Eckart-Young-Mirsky theorem for low-rank approximation, we show that the number of linearly independent vectors for HeNHeS is greater than or equal to the number of linearly independent vectors for HoNHoS. The detailed proof is given in the Supplementary material.

Definition : Given Kρis the modified Bessel function of the second kind, and σ2, κ, ρ are the variance, length scale, and smoothness parameters respectively, we define the modified Matern kernel on the Wasserstein metric space Wbetween two distributions X,X′given as

k(X,X′)=σ221-ρΓ(ρ)(2ρW(X,X′)κ)ρHρ(2ρ(X,X′)κ)    (6)

where Γ(.), H(.) is the Gamma and Bessel function, respectively.

Theorem 2: The modified Matern function on the Wasserstein metric space Wis a valid kernel function

Short Proof: To show that the above function is a kernel function, we need to prove that Mercer's theorem holds. i.e., (i) the function is symmetric and (ii) in finite input space, the Gram matrix of the kernel function is positive semi-definite. The detailed proof is given in the Supplementary material.

3.3. Optimal hyperparameter selection using Bayesian Optimization

While BO is used in various settings, successful applications are often limited to low-dimensional problems, with fewer than twenty dimensions (Frazier, 2018). Thus, using BO for high-dimensional problems remains a significant challenge. In our case of optimizing HRSNN model parameters for 2,000, we need to optimize a huge number of parameters, which is extremely difficult for BO. As discussed by Eriksson and Jankowiak (2021), suitable function priors are especially important for good performance. Thus, we used a biologically inspired initialization of the hyperparameters derived from the human brain (see Supplementary material).

This paper uses a modified BO to estimate parameter distributions for the LIF neurons and the STDP dynamics. To learn the probability distribution of the data, we modify the surrogate model and the acquisition function of the BO to treat the parameter distributions instead of individual variables. This makes our modified BO highly scalable over all the variables (dimensions) used. The loss for the surrogate model's update is calculated using the Wasserstein distance between the parameter distributions.

BO uses a Gaussian process to model the distribution of an objective function and an acquisition function to decide points to evaluate. For data points in a target dataset x ∈ X and the corresponding label y ∈ Y, an SNN with network structure V and neuron parameters W acts as a function fV,W(x) that maps input data x to predicted label ỹ. The optimization problem in this work is defined as

minV,W∑x∈X,y∈YL(y,fV,W(x))    (7)

where V is the set of hyperparameters of the neurons in R (Details of hyperparameters given in the Supplementary material) and W is the multi-variate distribution constituting the distributions of (i) the membrane time constants τm−E, τm−I of the LIF neurons, (ii) the scaling function constants (A+, A−) and (iii) the decay time constants τ+, τ− for the STDP learning rule in SRR.

Again, BO needs a prior distribution of the objective function f(x→) on the given data D1:k=. In GP-based BO, it is assumed that the prior distribution of f(x→1:k) follows the multivariate Gaussian distribution, which follows a Gaussian Process with mean μ→D1:k and covariance Σ→D1:k. We estimate Σ→D1:k using the modified Matern kernel function, which is given in Equation 6. In this paper, we use d(x, x′) as the Wasserstein distance between the multivariate distributions of the different parameters. It is to be noted here that for higher-dimensional metric spaces, we use the Sinkhorn distance as a regularized version of the Wasserstein distance to approximate the Wasserstein distance (Feydy et al., 2019).

D1:k are the points that have been evaluated by the objective function, and the GP will estimate the mean μ→Dk:n and variance σ→Dk:n for the rest unevaluated data Dk:n. The acquisition function used in this work is the expected improvement (EI) of the prediction fitness as:

EI(x→k:n)=(μ→Dk:n-f(xbest))Φ(Z→)+σ→Dk:nϕ(Z→)    (8)

where Φ(·) and ϕ(·) denote the probability distribution function and the cumulative distribution function of the prior distributions, respectively. f(xbest)=maxf(x→1:k) is the maximum value that has been evaluated by the original function f in all evaluated data D1:k and Z→=μ→Dk:n-f(xbest)σ→Dk:n. BO will choose the data xj= argmax as the next point to be evaluated using the original objective function.

4. Experiments 4.1. Training and inference

We use a network of leaky integrate and fire (LIF) neurons and train the synapses using a Hebbian plasticity rule called the spike timing dependent plasticity (STDP). The complete network is shown in Figure 5. First, to pre-process the spatio-temporal data and remove the background noise which arises due to camera movement and jitters, we use the Scan-based filtering technique as proposed by Panda and Srinivasa (2018) where we create a bounding box and center of gravity of spiking activity for each frame and scan across five directions as shown in Figure 2. Hence, the output of this scan-based filter is fed into the encoding layer, which encodes this information into an array of the spike train. In this paper, we use a temporal coding method. Following Zhou et al. (2020), we use a square cosine encoding method which employs several cosine encoding neurons to convert real-valued variables into spike times. The encoding neurons convert each real value to several spike times within a limited period of encoding time. Each real value is primarily normalized into [0, π], and then converted into spike times as ts = T · cos(d + i · π/n), d ∈ [0, π] i = 1, 2, …, n, where ts is the spiking time, T is the maximum encoding time of each spike, d denotes the normalized data, i is the sequence number of the encoding neuron, n is the number of encoding neurons.

www.frontiersin.org

Figure 2. Figure showing a flowchart for the input processing and model training. The figure shows selected frames from a video of the UCF101 dataset (Soomro et al., 2012).

The sensory receptors used for the spatial-temporal data are designed to receive the difference between time-adjacent data in a sequence. The data in each sequence is processed as follows:

Ms=‖[Δ(D1,D2),…,Δ(DN−1,DN)]‖    (9) Δ(Dn−1,Dn)={1 if Δ(Dn−1,Dn)≥ threshold ·max(Ms(·))0 else     (10)

where MS represents a sequence, and Dn represents an individual data in that sequence. If the difference exceeds the threshold, the encoding neuron will fire at that moment. We use a max-pooling operation before transferring the spike trains to post-synaptic neurons, where each pixel in the output max-pooled frame represents an encoding neuron. This helps in the reduction of the dimensions of the spike train.

The recurrent spiking layer extracts the features of the spatio-temporal data and converts them into linearly separable states in a high-dimensional space. O abstracts the state from R for classification. The state of R is defined as the membrane potential of the output neurons at the end of each spike train converted from the injected spatio-temporal data. After the state is extracted, the membrane potential of the output neuron is set to its initial value. After injecting all sequences into the network, the states of each data are obtained. A linear classifier is employed in this work to evaluate pattern recognition performance. Further details regarding the training and inference procedures are elicited in the Supplementary material.

4.2. Baseline ablation models

We use the following baselines for the comparative study:

• Recurrent Spiking Neural Network with STDP:

• Homogeneous LIF Neurons and Homogeneous STDP Learning (HoNHoS)

• Heterogeneity in LIF Neuron Parameters and Homogeneous STDP Learning (HeNHoS)

• Homogeneous LIF Neuron Parameters and Heterogeneity in LTP/LTD dynamics of STDP (HoNHeS)

• Heterogeneity in both LIF and STDP parameters (HeNHeS)

• Recurrent Spiking Neural Network with Backpropagation:

• Homogeneous LIF Neurons trained with Backpropagation (HoNB)

• Heterogeneous LIF Neurons trained with Backpropagation (HeNB)

5. Results 5.1. Ablation studies

We compare the performance of the HRSNN model with heterogeneity in the LIF and STDP dynamics (HeNHeS) to the ablation baseline recurrent spiking neural network models described above. We run five iterations for all the baseline cases and show the mean and standard deviation of the prediction accuracy of the network using 2,000 neurons. The results are shown in Table 1. We see that the heterogeneity in the LIF neurons and the LTP/LTD dynamics significantly improve the model's accuracy and error.

www.frontiersin.org

Table 1. Table comparing the performance of RSNN with homogeneous and heterogeneous LIF neurons using different learning methods with 2,000 neurons.

5.2. Number of neurons

In deep learning, it is an important task to design models with a lesser number of neurons without undergoing degradation in performance. We empirically show that heterogeneity plays a critical role in designing spiking neuron models of smaller sizes. We compare models' performance and convergence rates with fewer neurons in R.

5.2.1. Performance analysis

We analyze the network performance and error when the number of neurons is decreased from 2,000 to just 100. We report the results obtained using the HoNHoS and HeNHeS models for the KTH and DVS-Gesture datasets. The experiments are repeated five times, and the observed mean and standard deviation of the accuracies are shown in Figure 3. The graphs show that as the number of neurons decreases, the difference in accuracy scores between the homogeneous and the heterogeneous networks increases rapidly.

www.frontiersin.org

Figure 3. Comparison of performance of HRSNN models for the (A) KTH dataset and (B) DVS128 dataset for varying number of neurons. The bar graph (left Y-axis) shows the difference between the accuracies between HeNHeS and HoNHoS models. The line graphs (right Y-axis) shows the accuracies (%) for the four ablation networks (HoNHoS, HeNHoS, HoNHeS, and HeNHeS).

5.2.2. Convergence analysis with lesser neurons

Since the complexity of BO increases exponentially on increasing the search space, optimizing the HRSNN becomes increasingly difficult as the number of neurons increases. Thus, we compare the convergence behavior of the HoNHoS and HeNHeS models with 100 and 2,000 neurons each. The results are plotted in Figures 4A, B. Despite the huge number of additional parameters, the convergence behavior of HeNHeS is similar to that of HoNHoS. Also, it must be noted that once converged, the standard deviation of the accuracies for HeNHeS is muc

留言 (0)

沒有登入
gif