Population coding of strategic variables during foraging in freely moving macaques

Monkeys (n = 2) were exposed to two concurrent reward sources on a variable interval (VI) schedule10. We made it costly for the animal to switch between reward sources by placing them 120 cm apart (Fig. 1a, left). Animals freely interacted with the task equipment, and we did not impose a trial structure or a narrow response window (Methods). A multi-electrode Utah array was chronically implanted in the dlPFC (Extended Data Fig. 1), and measured spiking activity was collected using a lightweight, energy-efficient wireless device (Fig. 1a, right and Fig. 1b)26. The experimental setup was designed for the effective transmission of a low-power electromagnetic signal (Methods)27,28.

Fig. 1: Foraging in freely moving monkeys while population activity in the prefrontal cortex is recorded wirelessly.

a, Left: schematic of experimental setup with two reward boxes, two buttons and an overhead camera. Right: the location of the Utah array in the dlPFC (area 46) and wireless transmitter. b, Press-averaged firing rates of 80 single and multi-units recorded simultaneously. c, An illustration of task dynamics with eight hypothetical presses (vertical lines) in the concurrent variable-interval foraging task. In this illustration, the monkey responds six times on box 1, then switches to box 2 and responds twice. Therefore, press 6 is considered a press with a switch choice. The first two rows show the independent telegraph processes determining the reward (rew.) availability at boxes 1 and 2. In the example shown, press numbers 2, 5, 7 and 8 were rewarded (third row, red). The time dependence of the probability of reward availability is shown in the fourth row (see d for a different representation). d, An alternative illustration to clarify the relationship between the probability (prob.) of reward availability (avail.) and the waiting time. The shaded area shows when a reward is available after each of the 20 presses on box 1 (some of them shown in c). The black trace associated with the y axis on the right shows the probability of reward availability (Methods and Extended Data Fig. 2a), which at each time is in fact the proportion of pink bars, out of 20 trials with pink bars. e, The spike train of one example neuron on the timescale of four consecutive presses showing a variety of events (top row). Event-locked average firing rates of the same neuron are shown in the bottom row for conditions with reward/no reward and stay/switch choices. For ease of visualization, this example used a neuron with a relatively low firing rate compared to others in the population (compared with b).

Rewards on both sides (box 1 and box 2) became available at exponentially distributed random times after the animal obtained a previous reward. The reward availability was hidden from the monkey. Once becoming available, each reward remained available until the animal pressed a button, at which time the reward was delivered (Fig. 1c). The distribution of waiting times before a reward became available could have different mean times or ‘schedules’ for each side (that is, constant hazard rates; Fig. 1c). Schedules were chosen from 10, 15, 25 or 30 s and were constant for a block of rewards. Multiple schedules allowed us to diversify the response dynamics of the animals10. Each experimental session contained two to four blocks with 34 or 66 rewards in each block. Given the constant hazard rate and the fact that rewards never disappeared once available, the probability of reward availability increased exponentially toward 1 with the time elapsed since the last press (the waiting time), with a time constant given by the reward schedule (Methods, Fig. 1d and Extended Data Fig. 2). Since the monkey chose when to respond, its decisions influenced the probability of reward availability (Fig. 1c). An ideal observer that did not know the schedule or availability should track the time and reward histories, so we hypothesize that animals attempt to maximize their reward by tracking these quantities, referred to as the reward predictors, to determine when and where to respond.

We examined whether the firing rates of neurons in dlPFC represent the reward predictors as they are continuously evolving in time (monkey G: 11 sessions and monkey T: 19 sessions; n = 1,323 single and multi-units). Additionally, we extracted the neurons’ press-locked events, that is, firing rates a few seconds before and after each press (Fig. 1b). The continuous-time neural activity allowed us to understand how continuous representations of task variables in dlPFC leads to the animal’s choice to press. The press-locked neural activity explained how the state of these representation, before a press, combined with the new information, which is the reward outcome, predict where and when the animals press next. The continuous spike raster and the press-locked firing rate of a sample neuron (Fig. 1e) are shown for four consecutive box presses with different reward/choice outcomes: an unrewarded press followed by a switch to the other box, an unrewarded press when the animal stayed at the same box and two rewarded presses when the animal stayed at the same box. The fourth outcome, switching to the other box after a rewarded press, accounted for only 2% of the presses, so we do not show it in this example.

Here we explain how we identified reward predictors, variables that the animal can either observe or control and that they potentially use to estimate the chances of rewards. Consequently, we determined whether these variables empirically predicted the next reward outcome in our experiment. As the stochastic rewards do not always match the prediction, we examined the consequences of prediction violations on animals’ choice of box and time of press. Next, we used canonical correlation analysis (CCA) to identify the neural representation of these variables in the population of recorded neurons in dlPFC. Finally, we tested whether these representations predicted the animal’s choices in advance.

Predictors of the next reward

According to the marginal value theorem of foraging theory1, an animal could optimize its reward while minimizing travel costs by estimating the box schedules, tracking the temporal evolution of the probability of reward availability and using them to choose when and where to search for reward. Although the probability of the reward availability is the best predictor of the randomly generated reward, it was completely unobservable to the animals in our experiment. However, other predictive variables were observable or controllable by the animals, such as the waiting time between the presses or the reward ratio, defined as the proportion of the current option’s recently delivered reward compared to all recently delivered rewards from either box. The recent history was defined by applying a causal half-Gaussian filter to the binary sequence of delivered (1) or denied (0) rewards7,8. The waiting time, together with the scheduled reward rate, determines the probability of reward availability (Methods and Extended Data Fig. 2a). The reward ratio, when tracked on a timescale relevant to the volatility of the environment7, is a proxy for the scheduled reward ratio, defined as the ratio of the scheduled reward rate on the current box and the sum of the scheduled reward rates of two boxes.

As the scheduled reward ratio changes without warning from block to block, we maximized the correlation of the scheduled reward ratio with the animal’s observed reward ratio by tuning the width of the causal half-Gaussian filter mentioned above (Extended Data Fig. 2b). We assessed how well each variable predicted the reward by correlating the rewarded fraction of presses with that variable before each press. Specifically, we pooled 8,862 behavioral presses from 30 sessions of two monkeys, binned them according to each hidden or observable/controllable variable so that there were 50 presses in each bin, calculated the fraction of rewarded presses within each bin (Fig. 2a), and computed the Pearson correlation between the binned variable and rewarded fraction of presses. Naturally, the probability of reward availability was highly correlated (r = 0.93; Fig. 2a) with the rewarded fraction of presses. The scheduled reward rate was correlated with the fraction of rewarded presses as well (r = 0.43; Fig. 2a). This correlation is weaker than the correlation of the waiting time with the fraction of rewarded presses (r = 0.92; Fig. 2a) because the probability of reward availability is determined by both waiting time and the scheduled reward rate, and the animals choose a wide range of the waiting times, diluting the prediction of the scheduled reward rate alone.

Fig. 2: Reward predictors, together with the reward outcome, determine the choices and the next waiting time.

a, The predictability of the next reward from experimental and behavioral variables: 8,862 presses from 30 sessions were pooled together and binned into 50 press bins according to each experimental variable. The rewarded fraction of presses was calculated in each bin, then the Pearson correlation coefficient was calculated across bins between the average of the experimental variable and the rewarded fraction of presses. b, A correlation matrix of the task variables in a. c, Histograms of the next waiting time for rewarded and unrewarded presses that were made after a short, medium or long wait, determined by equal intervals in the percentile of the presses. Inset: an increase in the probability of switching when not rewarded after a short, medium or long wait. The probability of switching after being rewarded was less than 2% and therefore excluded from this analysis. d, The same as c, but for reward ratio instead of waiting time.

Although the waiting time was highly predictive of the next reward, the reward ratio potentially plays an important role in animal’s subjective reward expectation8. The reward ratio was not correlated with the fraction of rewarded presses (Fig. 2a, r = −0.012). However, it was positively correlated with the scheduled reward rate on the side that the animal pressed (r = 0.32), meaning that it might be used by the animals as an observable estimation of the hidden reward rates. Moreover, it was only weakly correlated with the log of waiting time (r = −0.140.14), meaning that it may be considered by the animals as a source of information, independent from the waiting time. We refer to the waiting time and the reward ratio as the reward predictors because they may be used by the animals to predict the reward, and therefore may play a role in determining the animals’ reward expectation (for the analysis of other observable reward predictors, see Extended Data Fig. 2c–e).

Do reward predictors determine ‘when’ and ‘where’ to press?

Although the subjective reward expectation is not directly measurable, we might infer changes in the animals’ reward expectation from the animals’ next choice, after a reward is delivered or denied. For example, an animal may realize that waiting longer increases its chances of receiving a reward, so we expect that an unrewarded press after a long wait might lead it to wait even longer between presses at the current box. Alternatively, the animal may realize that the waiting time for getting a reward at the current box is too long. Therefore, it may switch to the other box anticipating a better reward rate. We thus hypothesized that the animals’ decision on where and when to press depends upon the reward predictors, as the basis of animals’ reward expectation. We evaluated this hypothesis by analyzing the effect of such reward predictors on the probability distribution of the next waiting time and the probability of switching. These events were grouped depending on whether presses were rewarded and occurred after a short (3–5 s), medium (5–8 s) or long (8–60 s) wait (Fig. 2c and Extended Data Fig. 3, separated for monkeys). An unrewarded press increased the next wait by 10% (area under the receiver operating characteristic curve (AUC) of 0.53 ± 0.03) after a short wait, by 28% (AUC of 0.59 ± 0.02) after a medium wait and by 42% (AUC of 0.59 ± 0.02) after a long wait, each compared with the corresponding average waiting times for rewarded presses. Moreover, the probability of switching to the other box increased with the duration of unrewarded waits (9.5%, 10.2% and 16.5% more switches after a short, medium and long waiting time; Fig. 2c, insets). These choice differences (to continue pressing the button for the same box or switch to the other box) and the next waiting time when choosing to press on the same box, demonstrate that animals base their expectation of reward on their waiting time and adjust their behavior by waiting longer before the next press or switching to the other box when this expectation is not met. While previous studies point to melioration, that is, following the current flow of reward delivery9, we provide evidence of more temporally structured computations: the animals predict the chance of the next reward as they choose how long to wait before making the next press and adjust the waiting time when their expectation is not met. A key to this finding was a trial-free task, allowing animals to experience a wide range of waiting times and spontaneously discovering that longer intervals yielded a higher chance of receiving a reward.

The animals might also develop expectations about the quality of the current box from the reward ratio. Again, we can infer these expectations indirectly through changes in the next waiting time and choices. After unrewarded presses, animals waited longer and switched more, with the smallest changes for biggest reward ratios (Fig. 2d; 23%, 18% and 15% longer unrewarded waits and 19%, 12% and 3% switches after a low, medium and high reward ratio, respectively). This suggests that animals require stronger evidence to override a better reward history.

Altogether, this provides evidence that an animal’s policy on when and where to press depends on whether the box delivers a reward, as expected after a long waiting time or a high reward ratio. We inferred that animals update their expectation when those expectations are violated by the lack of an expected reward. This policy is a case of ‘learning a guess from a guess’29, which is useful in the absence of sensory evidence directly cueing the probability or availability of reward. To provide further evidence that the waiting time and reward ratio underlie animals’ reward expectation, we examined their encoding in the recorded neural population.

Task-relevant activity in dlPFC

Before a motor action, the activity of neurons in the dlPFC is correlated with the value of a visually cued expected reward20 or the probability of reward, estimated by the recent history of reward delivery13. Therefore, we hypothesized that the activity of dlPFC neurons, before each press, encodes the reward expectation for that press, for the range of the reward predictors variables observed or generated by each animal. For example, the neuron in Fig. 3a, left, activates more before a press following a long wait (top 20% of waiting times in that session) compared with a short waiting time (bottom 20%; Wilcoxon rank-sum test, \(P\) ≪ \(^\)). Similarly, the neuron in Fig. 3a, right, activates more when the reward ratio before a press is in the bottom 20% compared with when it was in the top 20% (Wilcoxon rank-sum test, P ≪ 10−3).

Fig. 3: Neuronal populations encode variables of the reward dynamics.

a, Sample neurons for which the pre-press firing rate covaries with either waiting time (left) or reward ratio (right). The firing rate was calculated for a 200 ms sliding window starting 2 s before and ending 1 s after presses. Firing rates were averaged across presses with low (<20th percentile, gray) and high (>80th percentile, colored) of either the waiting time or the reward rate. Data are presented as mean values ± s.e.m. b, Decoded and measured waiting time (left) and reward ratio (right) for two sample sessions. Forty-five neurons in the session on the left and 60 neurons in the session on the right were used. The shown value of Pearson correlation coefficient is the average value across the cross-validated sets in each session. c, Decoding the waiting time (top) or the reward ratio (bottom) for 30 sessions as a function of the number of neurons used as predictors. The predictor neurons were chosen randomly from the population. Data are presented as mean values ± s.e.m. across 20 randomly selected subset of units. Sessions of monkey G are shown with a darker color, and sessions of monkey T are shown with a lighter color.

As task-irrelevant variables such as locomotion, limb and eye movement and pupil size before or after presses may influence dlPFC activity30, we performed control experiments to quantify the correlation between task-irrelevant variables and neural activity. First, our control experiments in which animals moved to receive reward from the same boxes as in Fig. 1a revealed that eye movements have only a minor influence on neuronal activity while animals interacted with the box, although they have a stronger influence during locomotion (r = 0.16, t-test, \(P\) ≪ \(^\), for eye velocity, and r = 0.13, t-test, \(P\) ≪ \(^\), for fixation rate27). We thus decorrelated the neural activity from the locomotion by projecting neural activity onto the subspace orthogonal to locomotion (Methods) such that the remaining neural activity was uncorrelated with locomotion (Extended Data Fig. 4). Second, one animal performed the same task as presented here, while its arm movements, pupil diameter and eye velocity were monitored using the same eye tracking method as in ref. 27. We found ≤9% of neurons in dlPFC with significant (P < 0.01) correlation with the arm movement (Extended Data Fig. 5) in 1 s time intervals starting 2 s before and ending 2 s after presses. Pupil diameter was correlated with ≤10% of neurons. However, after we decorrelated the neural activity from the locomotion, the percentage of neurons with a significant correlation with the pupil diameter dropped to ≤7%. Similarly, the percentage of neurons with a significant correlation with the eye velocity dropped from ≤9% to ≤4%. As decorrelating the neural activity from locomotion also decreases the correlation between the neural activity and other task-irrelevant variables, we focused our analysis for the rest of this study on the neural activity that was decorrelated from the locomotion.

Decoding reward predictors from the neural population

Since the waiting time influences both future behavior and the reward probability when the button is pressed, we examined how the neural activity encodes waiting time just before a button press. We measured the spike counts in a 1 s interval (that is, a ‘pre-press’ interval from −1.1 to −0.1 s) for each neuron (n = 1,323 single and multi-units). This time interval was selected since the arm movement starts approximately 0.5 s before the press is recorded, and the modulation of neural activity typically starts around 0.5 s before that movement31.The pre-press firing rate of the neuron in Fig. 3a, left, was correlated with the waiting time (Spearman correlation coefficient of 0.24; t-test, \(P\) ≪ \(^\); Fig. 3b, left). For the entire population of cells, around 35% of neurons exhibited a significant Spearman correlation (t-test, P < 0.01; 31% positively correlated and 4% negatively correlated; monkey G: 27%, and monkey T: 37%).

To further examine how information about the waiting time is distributed across neurons, we decoded the waiting time from population activity before each press using the spike counts of randomly subsampled sets of neurons (for a description of the regression-based decoder analysis, see Methods). Our decoder analysis revealed that even random neural subpopulations encode the waiting time (Fig. 3c; Wilcoxon rank-sum test with false discovery rate, with multiple comparison correction (WRFDR), P ≤ 0.01).

Furthermore, consistent with previous reports8,13,21,32, we found that dlPFC neurons encode the reward ratio. Over the entire population, there was a significant correlation between the pre-press firing rate and reward ratio (t-test, P < 0.01) for 23% of the neurons (9% positively correlated and 14% negatively correlated; monkey G: 12%, and monkey T: 26%). Decoder performance for the reward ratio was higher than chance (WRFDR, P < 0.01) when we used a subpopulation of one or more neurons as the predictors. Taken together, these results indicate that both reward predictors are encoded in the pre-press neural activity at the individual neuron and population levels. This finding provides further evidence that the animals’ reward expectation is founded on the chosen reward predictors.

Identifying continuous task variables in a latent space

Unlike waiting time, the reward ratio jumps discretely at press times. We aimed to gauge the waiting time’s explanatory power for the continuously evolving neuron activity. We attempted to fit the variability in a neuron population using a weighted sum of task-related variables and basis functions33,34,35. Some of these variables were event based (presses, reward delivery and choice to stay or switch location), while others evolved continuously (waiting time, reward ratio and location within the cage). For event-based task variables, each event raster was filtered with a 200 ms boxcar and then shifted to a variety of offsets36 (Fig. 3e). For continuously evolving task variables, we used monomial basis functions with powers of 0.5, 1, 2, 3 and 5 (Fig. 4a). Neural activity was smoothed by a 1 s sliding window. To concentrate our analysis on times when animals were engaged in the task, we excluded time bins preceding or following any presses by more than 5 s.

Fig. 4: Canonical components of the neural population represent task variables in continuous time.

a, An illustration of the CCA for finding a reduced-dimensional space in the task space and the corresponding subspace in the neural activity space. The canonical components define the maximally correlated subspaces between the task variables and the neural (neur) activity. The 51-dimensional task space was made from 6 task variables by passing each variable through a set of basis functions (Methods). In brief, the basis functions were pulse-shaped temporal delay filters for press, reward and choice events. For continuously evolving task variables (the waiting time, the reward ratio and two-dimensional location), the basis functions were a set of instantaneous power functions. Overall, 51 predictors were made using these six task variables. The neural space was made using all simultaneously recorded neurons. b, Left: the weight of the contribution of each task variable in the first ten canonical components, sorted in the descending order of the correlation between the projection of each component in the task and neural spaces. The indices of the components representing waiting time, reward ratio, reward and choice are color coded for easier association. Right: neural representation of four task variables: reward, choice, waiting time and reward ratio for the same sample session on the left. The component that was associated with each of these four task variables was identified as the component for which the absolute value of the weights was highest, compared with the weights for the other task variables. c, Cross-validated Pearson correlation coefficient between the reward predictors, waiting time (left) and reward ratio (right; *P < 0.005), with either individual neurons or clusters of five or more neurons (Extended Data Fig. 6) that are maximally correlated with each reward predictor, compared with the correlation coefficient between the reward predictors and the canonical components (WRFDR; left: \(P\) ≪ \(^\) for clusters and 0.003 for neurons; right: P > 0.1). Each data point is associated with one session (n = 30).

To identify the latent representation of task variables in the neural space, we used CCA to find components that are shared between the task and the neural spaces. CCA finds these canonical components by applying singular value decomposition to the cross-correlation matrix between two spaces37. To favor interpretable latent components such that each component is associated with a small subset of the task variables, we imposed a sparsification penalty (least absolute shrinkage and selection operator with fullness constant of 0.3 on the weights of the task variables37). This regularization helps reduce overfitting the model. We calculated ten components for each training set (Fig. 4b), and then identified neural components making the greatest contributions to rewards, choices, waiting time and reward ratio. Interestingly, the waiting time neural component ramps up between the consecutive presses (Fig. 4b, third row), suggesting that the latent representation of the waiting time might be used by the brain to generate the next press, in a similar fashion to the evidence accumulation models proposed in decision making38. The reward ratio component followed the difference between the reward ratio of the boxes (Fig. 4b, fourth row). The reward and choice components showed sharp post-press elevated activity (Fig. 4b, first and second rows).

We asked whether fitting a model to reconstruct the activity of individually recorded neurons34 or sites33, then clustering the neurons based on the similarity between the reconstructed activity (Extended Data Fig. 6) yields a better representation than the latent variables that we found using the CCA. We calculated the Pearson correlation coefficient between reward predictors and their associated canonical components analysis and compared them with the correlation between the reward predictors with the neuronal clusters or individual neurons in each session that was maximally correlated with each reward predictor. The average correlation coefficient between the waiting time and the neural components was higher than that with the individual neurons or neural clusters of >5 neurons (WRFDR, P < 0.005; Fig. 4c, left). The correlation between the reward ratio and the neural components was the same as that with the individual neurons or clusters (WRFDR, P > 0.1; Fig. 4c, right). This indicates that the latent neural components provide better correlates of reward predictors relative to individual neurons or the average activity of groups of neurons that were clustered together based on their task-relevant activity. Furthermore, the latent neural components were uncontaminated by movement-related confounds (Extended Data Fig. 7).

Predicting reward, choice and the next waiting time

Since the animal cannot know the true hidden reward dynamics, its choices can only be driven by its subjective beliefs about these variables, rather than the objective truth from the experiment. For instance, if the monkey overestimates reward probability (perhaps due to misjudging waiting time or scheduled reward rate), he is more likely to switch boxes after an unrewarded push. We predicted switching based on neuronal components corresponding to task variables, interpreting them as current estimates of the animal’s subjective beliefs. We decoded the pre-press neural activity by projecting the population activity onto the subspace formed by the first ten canonical components for the reward predictors. This projection accounts for latent representation of reward predictors that could potentially influence the choices or the next waiting time or predict the eventual reward outcome.

We attempted to predict rewards, choices and the next waiting times from three distinct types of predictors: (1) the pre-press reward predictors (canonical components in the reward predictors’ space), (2) neural representations of the reward predictors (canonical components in the neural space) and (3) the entire simultaneously recorded neural population (Fig. 5a). For a fair comparison between the components and the entire neural population, we did not sparsify the weights of task variables in canonical components. To predict the reward, we trained binomial logistic regressions on the same data used to find the canonical components, then tested on the held-out data. To assess the prediction performance, we calculated the AUC showing the discriminability of the predictors’ output for the rewarded presses from the unrewarded presses. The same method was used for the choice to stay or switch. To predict the next waiting time, we used generalized linear models instead of logistic regression and evaluated the performance by calculating the Pearson correlation coefficient between the real and the predicted values. All predictors were trained and tested for each 200 ms time bin, starting 3 s before each press and ending 1 s after.

Fig. 5: Neural population analysis.

a, Prediction of rewards, choice or next waiting time from task components, neural components and the entire simultaneously recorded population. The task components are defined as the projection of the canonical component in the ten-dimensional space of waiting time and reward rate (each passed through five basis functions). The neural components are defined as the projection of the canonical component into the neural population space. The prediction was trained and tested for each 200 ms time bin, starting 3 s before and ending 1 s after the presses. b–d, The prediction results for rewards (b), choice (c) or next waiting time (d) for example sessions (left) or summarized across all sessions for the peak within the 2 s time interval before the presses (right). Left: the prediction performances using the post-press components or population activity are shown for comparison. Data are presented as mean values ± s.e.m. Right: the peak was calculated as the average of five time bins with the highest prediction performances. The prediction performance for the pre-press time bins provide evidence that the post-press choices or the rewards were in fact predictable, even before the presses were made.

In the example session shown in Fig. 5b, left, the prediction of the reward outcomes using the task components improved as the analysis windows approached the time of the press. The reward outcomes are determined by the actual experimental task variables, and indeed we confirmed that the true pre-press task components (the projection of the canonical component in the space of reward predictors) predict actual rewards better than either their neural representations (the projection of the canonical components on the neural population space) or the entire neural population (Fig. 5b, right, and for monkey-separated results, see Extended Data Fig. 8).

In contrast, the choices and the next waiting time should follow the animal’s subjective estimation of the reward dynamic variables. Fascinatingly, the neural activity before a press predicted the subsequent choice (Fig. 5c) and waiting time (Fig. 5d). As the animal’s movement to switch to the other box or press the button again occurred after the current press (Extended Data Fig. 9), the prediction of either of these actions by the neural components precedes the execution of the predicted actions. Moreover, the head, arm and eye movements within the pre-press time window were not significantly different between presses after which the animal stayed and the presses after which the animal switched to the other box (Extended Data Fig. 5c; P > 0.12 for all the comparisons, Wilcoxon rank-sum test). Therefore, we provide further evidence that the animals construct an expectation of reward before a press, based on their subjective understanding of the temporal structure of the task. Subsequently, animals decide when and where to press next based on the expected reward and the actually observed reward. Interestingly, the ten-dimensional neural representation of the pre-press task components predicted the choice and the next waiting time as well as the entire neural population (Fig. 5c,d), indicating that these few canonical neural components successfully capture the relevant signals within the larger neural population space.

It might seem obvious that neural features should be better predictors of when and where to press than experimental variables, after all, the animal’s brain is making its choice and not the experimental equipment. However, it is not evident a priori that the relevant neural representations would be found within our recorded dlPFC population, nor whether we record enough neurons to capture enough of the animal’s choice-relevant information. Furthermore, even if the dlPFC does contain the choice-relevant signals, it is not obvious that the neural components for our specific hypothesized reward predictors would be the right ones to predict the choices. It is thus noteworthy that these neurally decoded reward predictors predict choices significantly better than the task variables from which they are derived, and equally well as the full neural population. Evidently, our analysis identifies a neural subspace containing correlates of latent variables that are relevant for subsequent choices. This subspace also tends to avoid neural dimensions that contain choice-irrelevant variability, since if present, these variations could contribute to overfitting and would only hinder our ability to predict choice. We conclude that we are capturing neural correlates of the animals’ subjective beliefs about the latent reward dynamics that inform their choices.

View original article

NATURE NEUROSCIENCE

分享书签

0 0 0 0 0 0 0

More from this channel

Population coding of strategic variables during foraging in freely moving macaques

留言 (0)