Adaptive learning rate in dynamical binary environments: the signature of adaptive information processing

Appendix A: Variational Bayesian method

Variational Bayesian method is to convert intractable Bayesian inference into a variational optimization problem. Under variational Bayesian framework, the exact Bayesian posterior probability densities are approximated by the optimal variational posterior probability densities. These variational posteriors can be obtained by maximizing a negative free energy bound on the log-model evidence. The negative free energy bound takes its name from a mathematically analogous bound in statistical mechanics Feynman (1998).

Given a Bayesian model $p(\mathbb , x \vert \mathbb )$ with a set of all hidden states $\mathbb $, a set of all fixed parameters $\mathbb $ and observed data x (denoted by m), the logarithm of model evidence $\ln p(x \vert m)$ has a lower bound given by negative free energy $\mathcal (q(\mathbb ), x)$

$$\begin \begin&\ln p( x \vert m) \\ =&\ln \int q(\mathbb ) \frac \vert \mathbb ) } )} \text\mathbb \\&\ge \int q(\mathbb ) \ln \frac \vert \mathbb ) }) } \text \mathbb \\ =&\int q(\mathbb ) \ln p(x, \mathbb \vert \mathbb ) \text \mathbb \\&- \int q(\mathbb ) \ln q(\mathbb ) \text \mathbb \\ =&\ln p(x \vert m) - D_[q(\mathbb ) \vert \vert p(\mathbb \vert x, \mathbb )]\\ =&\mathcal (q(\mathbb ),x) \end \end$$

(A1)

where the lower bound of the log-model evidence is negative free energy $\mathcal (q(\mathbb ),x)$. $q(\mathbb )$ is an approximation for the true posterior $p(\mathbb \vert x, \mathbb )$. The Kullback-Leibler divergence $D_[q(\mathbb ) \vert \vert p(\mathbb \vert x, \mathbb )] \ge 0$ measures the difference between an approximation and the true posterior. The better the approximation $q(\mathbb )$ is, the smaller the divergence is. The minimum divergence 0 occurs when the ideal approximation $q(\mathbb )$ is equal to $p(\mathbb \vert x, \mathbb )$. The agent therefore could obtain the optimal approximation posterior $q(\mathbb )$ by maximizing negative free energy $\mathcal (q(\mathbb ),x)$.

$$\begin \begin q^*(\mathbb )&= \arg \max \limits _ \mathcal (q(\mathbb ),x) \end \end$$

(A2)

We can use the Lagrange method to solve this problem. The Lagrangian functional is defined by

$$\begin \begin&\mathcal }(q(\mathbb ),x) = \mathcal (q(\mathbb ),x) + \upsilon [ \int q(\mathbb ) \text\mathbb - 1] \end \end$$

(A3)

where $\upsilon$ is a Lagrange multiplier. The solution of the optimal problem (Eq. A2) is also the solution of the variational equation

$$\begin \begin \frac}(q(\mathbb ),x) } = 0 \end \end$$

(A4)

Here, we use an important assumption that the joint variational posterior distribution factorizes with respect to all marginal posteriors (Eq. A5)

$$\begin \begin q(\mathbb ) = \prod _ } q( \chi ) \end \end$$

(A5)

The Lagrangian functional $\mathcal }(q(\mathbb ),x)$ is defined as

$$\begin \begin&\mathcal }(q(\mathbb ),x) = \bar}(q( \mathbb \backslash \left\ ), q(\chi ) ) \\ \triangleq&- \int q( \mathbb \backslash \left\ )q(\chi ) \ln p( \mathbb , x \vert \mathbb ) \text \mathbb \backslash \left\ \text \chi \\&+ \int q( \mathbb \backslash \left\ )q(\chi ) \ln q( \mathbb \backslash \left\ )q(\chi ) \text \mathbb \backslash \left\ \text \chi \\&+ \upsilon _ \left( \int q(\chi ) \text \chi - 1 \right) + \sum _ \backslash \left\ } \upsilon _ \left( \int q(\zeta ) \text \zeta - 1 \right) \end \end$$

(A6)

where the notation $\chi \in \mathbb $ is an arbitrary hidden state. The notation $\upsilon _$ is a Larangian multiplier. We use the notation $\mathbb \backslash \left\$ to denote a set defined by subtraction of two sets: $\mathbb - \left\$. The variation of Eq. A6 with respect to $q(\chi )$ is

$$\begin \begin&\frac} (q( \mathbb ) )}=\frac} (q(\mathbb \backslash \left\ ), q(\chi ) )}\\ =&-\int q(\mathbb \backslash \left\ ) \ln p( \mathbb , x \vert \mathbb ) \text \mathbb \backslash \left\ \\&+ \int q(\mathbb \backslash \left\ ) \ln q(\mathbb \backslash \left\ ) \text \mathbb \backslash \left\ \\&+ \ln q(\chi ) + \upsilon _ + 1 \\ =&0 \end \end$$

(A7)

The optimal variational approximation posterior $q^*(\chi )$ is the form of Boltzmann distribution

$$\begin \begin&q^*(\chi ) = \frac } \exp \left( V(\chi ) \right) \\&V(\chi ) = \int q^*( \mathbb \backslash \left\ ) \ln p(\mathbb , x \vert \mathbb ) \text \mathbb \backslash \left\ \\&H(\mathbb \backslash \left\ ) = - \int q^*(\mathbb \backslash \left\ ) \ln q^*(\mathbb \backslash \left\ ) \text \mathbb \backslash \left\ \\&Z_ = \exp (-H(\mathbb \backslash \left\ ) + \upsilon _ + 1) \end \end$$

(A8)

where $V(\chi )$ is often called as variational energy.

Appendix B: Derivations of variational updates

Now, we first focus on the variational energies in the Gaussian random walk with a Bernoulli distribution and transform the general variational energy $V(\chi _t)$ in Eq. A8 into the next equation

$$\begin \begin&V(\chi _t) = \int q^*( \mathbb _t \backslash \left\ ) \ln p(\mathbb _t, x_t \vert \mathbb ) \text \mathbb _t \backslash \left\ \\ =&E__t \backslash \left\ \right)} [ \ln p(\mathbb _t, x_t \vert \mathbb ) ]\\ =&E__t \backslash \left\ \right)} \left\ \\ =&E__t \backslash \left\ \right)} \left[ \ln p(z_t \vert \theta ) + \ln p(y_t \vert z_t) + \ln p(x_t \vert y_t, \beta ) \right] \end \end$$

(B9)

To compute variational energy of $y_t$, let $\chi _t$ be $y_t$ according to the above equation (Eq. B9), then

$$\begin \begin V(y_t) =&E_ \left[ \ln p(z_t \vert \theta ) + \ln p(y_t \vert z_t) + \ln p(x_t \vert y_t, \beta ) \right] \\ =&\ln p(x_t \vert y_t, \beta ) + E_[\ln p(z_t \vert \theta ) ] + E_[\ln p(y_t \vert z_t) ] \\ \triangleq&\ln p(x_t \vert y_t, \beta ) + E_[\ln p(y_t \vert z_t) ] + const \\ \end \end$$

(B10)

To solve the second item $E_[\ln p(y_t \vert z_t) ]$: first, we need sufficient statistics for the posterior probability density of $z_t$, but it is currently unknown. Therefore these sufficient statistics can be approximated by the sufficient statistics for the last posterior density $q^*(z_) = \mathcal (z_; \mu _, \sigma _)$; second, we expand $\ln s(z_t)$ to second order around the expectation $\mu _$ of the random variable $z_t$

$$\begin \begin \ln s(z_t) =&\ln s(\mu _) + \left( 1-s(\mu _) \right) \left( z_t - \mu _ \right) \\&- \frac s(\mu _) \left( 1-s(\mu _) \right) \left( z_t - \mu _ \right) ^2 \end \end$$

(B11)

Now, the variational energy $V(y_t)$ of the bottom $y_t$ is

$$\begin \begin V(y_t) =&y_t \left( \ln s(\mu _) - \frac \right) \\&+ (1-y_t) \left( \ln (1 - s(\mu _)) - \frac \right) \\&+ const \\ \end \end$$

(B12)

With the equations Eq. 16 and Eq. 17, The update rule for $p_t$ is

$$\begin \begin p_t =&q^*(y_t=c) \\ =&\frac) s(\mu _)}) s(\mu _) + \exp ( - \frac ) (1 - s(\mu _)) } \end \end$$

(B13)

To compute variational energy of $z_t$, let $\chi _t$ be $z_t$ according to Eq. B9, then

$$\begin \begin V(z_t) =&E_ \left[ \ln p(z_t \vert \theta ) + \ln p(y_t \vert z_t) + \ln p(x_t \vert y_t, \beta ) \right] \\ =&\ln p(z_t \vert \theta ) + E_[ \ln p(y_t \vert z_t) ] + E_[ \ln p(x_t \vert y_t, \beta ) ] \\ \triangleq&\ln p(z_t \vert \theta ) + E_[ \ln p(y_t \vert z_t) ]+ const \end \end$$

(B14)

where the first item $\ln p(z_t \vert \theta )$ can be calculated by the next two steps:

$$\begin \begin p(z_t \vert \theta ) =&\int p(z_t \vert z_, \theta ) p(z_ \vert x_) \text z_ \\ \approx&\int p(z_t \vert z_, \theta ) q^*(z_) \text z_ \\ =&\int \mathcal (z_t;z_, \theta ) \mathcal (z_;\mu _, \sigma _) \text z_ \\ =&\mathcal (z_t; \mu _, \sigma _ + \theta ) \\ \ln p(z_t \vert \theta ) =&\ln \mathcal (z_t; \mu _, \sigma _ + \theta ) \\ =&\ln \frac + \theta ) } } - \frac)^2} + \theta )}. \end \end$$

(B15)

and the second item $E_[ \ln p(y_t \vert z_t) ]$ is

$$\begin \begin&E_[ \ln p(y_t \vert z_t) ] \\ =&E_[ y_t \ln \left( s(z_t) \right) + \left( 1-y_t \right) \ln \left( 1- s(z_t) \right) ] \\ =&p_t \ln \left( s(z_t) \right) + \left( 1-p_t \right) \ln \left( 1- s(z_t) \right) \end \end$$

(B16)

We substitute Eq. B15 and Eq. B16 into Eq. B14

$$\begin \begin V(z_t) =&\ln p(z_t \vert \theta ) + E_[ \ln p(y_t \vert z_t) ] + const \\ =&- \frac)^2} + \theta )} \\&+ p_t \ln \left( s(z_t) \right) + \left( 1-p_t \right) \ln \left( 1- s(z_t) \right) \\&+ const \end \end$$

(B17)

We expect $V(z_t)$ to be a quadratic form. Here, we use one-step Newton method to make a Laplace approximation to $q^*(z_t) = \frac} \exp (V(z_t))$. We expand $V(z_t)$ to second order around the last expectation $\mu _$.

$$\begin \bar(z_ ) = V(\mu _} ) + \frac}V(z_ )}}}z_ }}\Big|_ = \mu _} }} (z_ - \mu _} ) \hfill \\ \quad + \frac\frac}^ V(z_ )}}}z_^ }}\Big|_ = \mu _} }} (z_ - \mu _} )^ \hfill \\ \quad \approx V(z_ ) \hfill \\ \end$$

(B18)

where the derivative $V^(z_) = \frac V(z_t)} z_t}$ and the second derivative $V^(z_) = \frac^2 V(z_t)} z^2_t}$ are

$$\begin \begin V^(z_) =&- \frac} + \theta } + p_t - s(z_t) \\ V^(z_) =&- \frac + \theta } - s(z_t)(1 - s(z_t)) \end \end$$

(B19)

The quadratic function $\bar(z_t)$ has a unique maximum point where the derivative is zero.

$$\begin \begin \frac \bar(z_t)} z_t} =&V^(\mu _) + V^(\mu _) (z_t - \mu _) \\ =&0 \end \end$$

(B20)

The approximation variance $\sigma _t$ satisfies

$$\begin \begin - \frac= \frac^2 \bar(z_t)} z^2_t} = V^(\mu _) \end \end$$

(B21)

The solution $\mu _t$ of this equation is

$$\begin \begin \mu _ = \mu _ - \frac(\mu _)} V^(\mu _) \end \end$$

(B22)

Therefore, we can get a pair of updations

$$\begin \begin&\frac = \frac + \theta } + s(\mu _)(1 - s(\mu _))\\&\mu _ = \mu _ + \sigma _t \left( p_t - s(\mu _) \right) \end \end$$

(B23)

Appendix C: Evaluating negative free energy

For a Bayesian agent m with parameters $\varvec$, the posterior $p(\varvec \vert x_, m)$ on parameters $\varvec$ is approximated by a multivariate Gaussian distribution $q(\varvec)$ under the Laplacian approximation

$$\begin p(\varvec \vert x_, m) \approx q(\varvec) = \mathcal (\varvec; \varvec_},\varvec_}), \end$$

where $\varvec_}$ is a covariance matrix. The mean $\varvec_}$ is determined by maximizing the quantity $p(\varvec \vert x_, m)$

$$\begin \begin \varvec^*_} =&\arg \max \limits _} p(\varvec \vert x_, m) \\=&\arg \max \limits _} \frac, x_ \vert m)} \vert m)} \\=&\arg \max \limits _} p(\varvec, x_ \vert m). \end \end$$

(C24)

The optimal $q^*(\varvec)$ is determined by maximizing the negative free energy $\mathcal ( q(\varvec))$

$$\begin \begin&\max \limits _} \ln p( x_ \vert \varvec, m) \\ \ge&\max \limits _)} \mathcal ( q(\varvec)) =\max \limits _)} \int q(\varvec) \ln p( x_, \varvec \vert m) - q(\varvec) \ln q(\varvec) \text\varvec \end \end$$

(C25)

We use the notation $\mathcal (\varvec)$ to denote the joint $\ln p( x_, \varvec \vert m)$ and then use Taylor’s theorem to expand $\mathcal (\varvec)$ at the point $\varvec^*_}$

$$\begin \begin \mathcal ( \varvec) \approx&\mathcal ( \varvec^*_} ) + \frac( \varvec^*_} ) } } (\varvec - \varvec^*_} ) \\ &+ \frac (\varvec - \varvec^*_} )^T \frac( \varvec^*_} ) } \partial \varvec^T } (\varvec - \varvec^*_} ). \end \end$$

(C26)

The first term $\int q(\varvec) \mathcal ( \varvec) \text\varvec$ in the negative free energy $\mathcal ( q(\varvec))$ is evaluated by

$$\begin \begin&\int q(\varvec) \mathcal ( \varvec) \text\varvec \\ \approx&\mathcal ( \varvec^*_} ) + \frac( \varvec^*_} ) } } E_ \vert \varvec^*_}, \varvec_}) } [\varvec - \varvec^*_} ] \\&+\frac E_ \vert \varvec^*_}, \varvec_}) } [(\varvec - \varvec^*_} )^T \frac( \varvec^*_} ) } \partial \varvec^T} (\varvec - \varvec^*_} )] \\ =&\mathcal ( \varvec^*_} ) + \frac \text \left( \varvec_} \frac( \varvec^*_} ) } \partial \varvec^T } \right) \end \end$$

(C27)

The last term $H_e(\varvec) = - \int q(\varvec) \ln q(\varvec) \text\varvec$ is given by

$$\begin \begin H_e(\varvec)&= - \int q(\varvec) \ln q(\varvec) \text\varvec \\&= - E_ \vert \varvec^*_}, \varvec_}) } \left[ \ln q(\varvec \vert \varvec^*_}, \varvec_}) \right] \\&= - E_ \vert \varvec^*_}, \varvec_} ) } [ \\ &\quad \quad -\frac}}\ln 2\pi - \frac \ln \det (\varvec_} ) \\ &\quad \quad -\frac(\varvec - \varvec^*_})^T (\varvec_})^ (\varvec - \varvec^*_}) \\ &\quad ] \\&=\frac}}\ln 2\pi + \frac \ln \det (\varvec_} ) + \frac \text(I_}}) \\&= \frac}}\ln 2\pi e + \frac \ln \det (\varvec_} ) \end \end$$

(C28)

Therefore, the optimal negative free energy $\mathcal _}= \mathcal ( q(\varvec))$ is calculated as

$$\begin \begin \mathcal _}&= E_) } [\mathcal ( \varvec^*_} ) ] +H_e(\varvec) \\&= \mathcal ( \varvec^*_} ) + \frac \text \left( \varvec_} \frac( \varvec^*_} ) } \partial \varvec^T} \right) \\ &+ \frac}}\ln 2\pi e + \frac \ln \det (\varvec_} ) \end \end$$

(C29)

$\mathcal _}$ is a scalar function of the covariance $\varvec_}$. The optimal point or a stationary point $\varvec^*_}$ is found where $\mathcal _}$ reaches the maximum. This is done by making the partial derivative $\frac_}}_} }$ to be a zero matrix $\varvec$.

$$\begin \begin&\frac_}}_} } = \frac \frac( \varvec^*_} ) } \partial \varvec^T} + \frac \varvec_}^ = \varvec \\ \Longrightarrow&\varvec^*_} = - \left( \frac( \varvec^*_} ) } \partial \varvec^T} \right) ^ \end \end$$

(C30)

At the optimal point $\varvec^*_}$, the maximal value of $\mathcal _}$ is

$$\begin \begin \mathcal ^*_}&= \mathcal ( \varvec^*_} ) + \frac}}\ln 2\pi e + \frac \ln \det (\varvec^*_} ) \\ \approx&\max \limits _} \ln p( x_ \vert \varvec, m) \\ =&\ln p( x_ \vert \varvec_}^, m). \end \end$$

(C31)

Appendix D: Bayesian model selection

Grounded on probability theory, Bayesian model selection is to evaluate different models based on the observed data, favoring the model with balanced tradeoff between complexity and flexibility. Given a series of input observations $x_$, Bayesian model selection is to select the optimal model $m^$ to best interpret input observations

$$\begin \begin m^* = \arg \max \limits _ p(m \vert x_). \end \end$$

(D32)

Taking two different models $m_2, m_1$ into account, we can define Bayesian Factor as

$$\begin \begin&p(m_2 \vert x_) = \frac \vert m_2 )}) } \\&p(m_1 \vert x_) = \frac \vert m_1 )}) } \\&\frac) }) } = BF\frac \\&BF = \frac \vert m_1 )} \vert m_2 ) }, \end \end$$

(D33)

where $p(m_i)$ is the prior distribution of $m_i$. Here, we make a general assumption that the prior distribution of a model is a non-informative prior. Under the assumption of non-informative priors, the prior distribution is equivalent to a uniform distribution $\frac = 1$. Then the ratio of the posterior distributions $\frac) }) }$ is simply given by the Bayesian Factor.

Bayesian model selection problem is reduced to selecting a model with maximal model evidence $p(x_ \vert m_i )$. In Bayesian learning framework, log-model evidence $\ln p(x_ \vert m_i)$ can be approximated by the Bayesian Information Criterion (BIC) Schwarz (1978):

$$\begin \begin&\ln p(x_ \vert m_i ) \approx \ln p( x_ \vert \varvec__i}^, m_i ) - \frac_i}} \ln (K) \\ \Rightarrow&\ln p(x_ \vert m_i ) = \mathcal ^*__i} - \frac_i}} \ln (K) \end \end$$

(D34)

where K is the number of the observation in $x_$. $d__i}$ is the number of free parameters estimated by the model. By computing the negative free energies of two different models $\mathcal ^*__1}, \mathcal ^*__2}$, Bayesian Factor is given by

$$\begin \begin BF(m_1, m_2)=&\frac \vert m_1 )} \vert m_2 ) } \\ =&\exp ( \ln p(x_ \vert m_1 ) -\ln p(x_ \vert m_2 ) ) \\ \approx&\exp \left( \mathcal ^*__1} - \mathcal ^*__2} - \frac_1} - d__2}} \ln (K) \right) . \end \end$$

(D35)

Table 4 Bayes factors and interpretations

For the ease of using Bayesian Factor, Harold Jeffreys gave a scale for the interpretation of Bayesian Factor (Table 4) Jeffreys (1998). If $BF >1$, the model $m_1$ is more strongly supported by the observed data, and vice versa (if $0< BF <1$, the model $m_2$ is more strongly supported).

View original article

COGNITIVE NEURODYNAMICS

Like

分享书签

0 0 0 0 0 0 0

More from this channel

Adaptive learning rate in dynamical binary environments: the signature of adaptive information processing

留言 (0)