Natural language instructions induce compositional generalization in networks of neurons

Model architectureSensorimotor-RNN

The base model architecture and task structure used in this paper follows18. All networks of sensorimotor units denoted sensorimotor-RNN are gated recurrent units (GRU)49 using rectified linear unit (ReLU) nonlinearities with 256 hidden units each. Inputs to the networks consist of (1) sensory inputs, Xt and (2) task-identifying information, It. We initialize hidden activity in the GRU as $^\in }}^$ with values set to 0.1. All networks of sensorimotor units use the same hidden state initialization, so we omit h0 in network equations. At each time step, a readout layer Linearout decodes motor activity, $\hat_}$, from the activity of recurrent hidden units, ht, according to:

$$\begin_=}}}\Big(_,_;_\Big)\qquad\qquad_\in }}^\\ }_=\sigma \Big(}}}}_}}}}(_)\Big)\qquad\qquad\qquad\qquad\qquad\quad}_\in }}^\end$$

where σ denotes the sigmoid function. Sensory inputs Xt are made up of three channels, two sensory modalities $_}}\,1,t}$ and $_}}\,2,t}$, and a fixation channel xfix,t. Both $_}}\,1,t},_}}\,2,t}\in }}^$ and stimuli in these modalities are represented as hills of activity with peaks determined by units’ preferred directions around a one-dimensional circular variable. For an input at direction θ, the activity of a given input unit ui with preferred direction θi is

$$_=str \times 0.8\exp \left[-0.5 \times _| }\right)}^\right]$$

where str is the coefficient describing stimulus strength. The fixation channel $_}}},t}\in }}^$ is a single unit simulating a fixation cue for the network. In all, sensory input $_=(_,_,_)\in }}^$. Motor output, $}_}$ consists of both a 32-dimensional ring representing directional responses to the input stimulus as well as a single unit representing model fixation, so that $}_}\in }}^$.

For all models, task-identifying information $_\in }}^$. Task-identifying information is presented throughout the duration of a trial and remains constant such that $_=_ }\forall t,t$. For all models, task-identifying info It and sensory input Xt are concatenated as inputs to the sensorimotor-RNN.

Nonlinguistic models

For SIMPLENET, we generate a set of 64-dimensional orthogonal task rules by constructing an orthogonal matrix using the Python package scipy.stats.ortho_group, and assign rows of this matrix to each task type. For STRUCTURENET, we generate a set of ten orthogonal, 64-dimensional vectors in the same manner, and each of these represents a dimension of the task set (that is, respond weakest versus strongest direction, respond in the same versus opposite direction, pay attention only to stimuli in the first modality, and so on). Rule vectors for tasks are then simple combinations of each of these ten basis vectors. For a full description of structure rule vectors, see Supplementary Note 3.

We also test SIMPLENETPLUS and STRUCTURENETPLUS, which use an additional hidden layer with 128 units and ReLU nonlinearities to process orthogonal tasks rules It into a vector $\bar_}$ which is used by sensorimotor-RNN as task-identifying information.

$$\begin_}}^ }=\rm(}}}}_}}}1}(_))&_}}^ }\in }}^\\ _}}^ }=\rm(}}}}_}}}2}(_^ }))&_}}^ }\in }}^\\ \bar_}=\rm(}}}}_}}}3}(_}}^ }))&\bar_}\in }}^\end$$

Full results for these models are included in Supplementary Fig. 4.

Pretrained transformers

The main language models we test use pretrained transformer architectures to produce I. Importantly, transformers differ in the type of pretraining objective used to tune the model parameters. GPT is trained to predict the next word given a context of words9. GPT (XL) follows the same objective but trains for longer on a larger dataset50. Both models are fully autoregressive. BERT, by contrast, takes bidirectional language inputs and is tasked with predicting masked words that appear in the middle of input phrases. Additionally, BERT is trained on a simple sentence prediction task where the model must determine if input sentence 1 is followed by input sentence 2 in the training corpus. Extending this principle, SBERT is explicitly trained to produce fixed-length embeddings of whole sentences21. It takes pretrained BERT networks and uses them in a siamese architecture51, which allows the weights of the model to be tuned in a supervised fashion according to the Stanford Natural Language Inference dataset22. Natural language inference is a three-way categorization task where the network must infer the logical relationship between sentences: whether a premise sentence implies, contradicts or is unrelated to a hypothesis sentence. Finally, CLIP is trained to jointly embed images and language23. It uses data from captioned images and is asked to properly categorize which text and images pairs match or are mismatched in the dataset via a contrastive loss.

Importantly, the natural output of a transformer is a matrix of size $_}}}.}\times }}}$, the inherent dimensionality of the transformer by the length of the input sequence. To create an embedding space for sentences it is standard practice to apply a pooling method to the transformer output, which produces a fixed-length representation for each instruction.

For GPT, GPT (XL), BERT and SBERT, we use an average pooling method. Suppose we have an input instruction $_\ldots _}}}}$. Following standard practice with pretrained language models, the input to our transformers is tokenized with special ‘cls’ and ‘eos’ tokens at the beginning and end of the input sequence. We then compute I as follows:

$$\begin^}=}}}\Big(}}\,,_\ldots _}}}},\,}}\Big),\qquad^}\in }}^_}}}.}\times }}}+2}\\ ^=}}}(^}),\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad^\in }}^_}}}}.}}\\ I=}}}}_}}}}(^)\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad I\in }}^\end$$

We chose this average pooling method primarily because a previous study21 found that this resulted in the highest-performing SBERT embeddings. Another alternative would be to simply use the final hidden representation of the ‘cls’ token as a summary of the information in the entire sequence (given that BERT architectures are bidirectional, this token will have access to the whole sequence).

$$\begin^}=}}}\Big(\,}}\,,_\ldots _}}}},\,}}\,\Big),\qquad^}\in }}^_}}}}.}\times }}}+2}\\ ^=(_}}}}^})\qquad\qquad\qquad\qquad\qquad\qquad\quad\qquad\quad\;\;^\in }}^_}}}}.}}\end$$

Where $_}}}}^}$ denote the last hidden representation for the ‘cls’ token. Ref. 21 found this pooling method performed worse than average pooling, so we don’t include these alternatives in our results. For GPT and GPT (XL), we also tested a pooling method where the fixed-length representation for a sequence was taken from the transformer output of the ‘eos’ token. In this case:

$$\begin^}=}}}\Big(\,}}\,,_\ldots _}}}},\,}}\,\Big),\qquad^}\in }}^_}}}.}\times }}}+2}\\ ^=(_}}}}^}),\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\quad^\in }}^_}}}.}}\\ I=}}}}_}}}}(^),\qquad\qquad\qquad\qquad\qquad\qquad\quad I\in }}^\end$$

We found that GPT failed to achieve even a relaxed performance criterion of 85% across tasks using this pooling method, and GPT (XL) performed worse than with average pooling, so we omitted these models from the main results (Supplementary Fig. 11). For CLIP models we use the same pooling method as in the original multiModal training procedure, which takes the outputs of the [cls] token as described above.

For all the above models, we also tested a version where the information from the pretrained transformers is passed through a multilayer perceptron with a single hidden layer of 256 hidden units and ReLU nonlinearities. We found that this manipulation reduced performance across all models, verifying that a simple linear embedding is beneficial to generalization performance.

For GPT, BERT and SBERT, $_}}}.}=768$ and each model uses a total of ~100 million parameters; for SBERT (L) $_}}}.}=1,024$ and the model uses ~300 million parameters; GPT (XL) $_}}}.}=1,600$ and the model uses ~1.5 billion parameters; for CLIP, $_}}}.}=512$ and the model uses ~60 million parameters. Full PyTorch implementations, including all pretrained weights and model hyperparameters, can be accessed at the Huggingface library (https://huggingface.co/docs/transformers/)52.

BoW model

For our BoW model, instructions are represented as a vector of binary activations the size of the instruction vocabulary, where each unit indicates the inclusion or exclusion of the associated word in the current instruction. For our instruction set, ∣vocab∣ = 181. This vector is then projected through a linear layer into 64-dimensional space.

$$\begin_^}}}}=\left\1\quad\,}}\,\,_\in (_\ldots _}}}})\\ 0\quad\,}}\,\end\right.\qquad\qquad^}}}}\in }}^| }\\ I=}}}}_}}}}(^}}}}),\qquad\qquad\qquad\qquad\qquad\quad I\in }}^\end$$

Blank slate language models

Given that tuning the last layers of language models resulted in improved performance (Fig. 2e), we tested two additional models to determine if training a blank slate language model trained exclusively on the loss from sensorimotor tasks would improve performance. These models consist of passing BoW representations through a multilayer perceptron and passing pretrained BERT word embeddings through one layer of a randomly initialized BERT encoder. Both models performed poorly compared to pretrained models (Supplementary Fig. 4.5), confirming that language pretraining is essential to generalization.

Tasks sets

Tasks were divided into five interrelated subgroups: ‘go’, ‘decision-making’, ‘matching’, and ‘comparison’ and ‘duration’. Depending on the task, multiple stimuli may appear during the stimulus epoch. Also, depending on the task, models may be required to respond in a particular direction or repress response altogether. Unless otherwise specified, zero-mean Gaussian noise is added independently at each time step and to each input unit and the variance of this noise is drawn randomly from $}[0.1,0.15]$. The timing of stimuli differs among the tasks type. However, for all tasks, trials can be divided into preparatory, stimulus and response epochs. The stimulus epoch can be subdivided into three parts—stim1, delay and stim23—although these distinct parts aren’t used by all tasks. A trial lasts for a total of T = 150 time steps. Let durepoch denote the duration in simulated time steps of a given epoch. Then

$$\begin&&du_}}}} \sim\Big\}\Big\}\\ &&du_}}}1},du_}}}2} \sim\Big\}\Big\}\\ &&du_}}}} \sim\Big\}\Big\}\\ &&du_}}}.}=150-\Big(du_}}}}+du_}}}1}+du_}}}2}+du_}}}}\Big)\end$$

For tasks that don’t utilize a delay structure, stim1, stim2 and delay epochs are grouped together in a single stimulus epoch where $du_}}}}=du_}}}1}+du_}}}2}+du_}}}}$. Unless otherwise specified, a fixation cue with a constant strength strfix = 1 is activated throughout the preparatory and stimulus epochs. For example trials of each task, see Supplementary Fig. 13.

‘Go’ tasks

The ‘Go’ family of tasks includes ‘Go’, ‘RTGo’, ‘AntiGo’, ‘AntiRTGo’ and modality-specific versions of each task denoted with either ‘Mod1’ and ‘Mod2’. In both the ‘Go’ and ‘AntiGo’ tasks, a single stimulus is presented at the beginning of the stimulus epoch. The direction of the presented stimulus is generated by drawing from a uniform distribution between 0 and 2π, that is, $_}}}} \sim }[0,2\pi ]$. The stimulus will appear in either modality 1 or modality 2 with equal probability. The strength of the stimulus is given by $st_}}}} \sim }[1.0,1.2]$. In the ‘Go’ task, the target response is in the same direction as the presented stimulus, that is, $_}}}}=_}}}}$, while in the ‘AntiGo’ task the direction of the response should be in the opposite of the stimulus direction, $_}}}}+\pi =_}}}}$. For modality-specific versions of each task, a stimulus direction is drawn in each modality $_}}},}}}1} \sim }[0,2\pi ]$ and $_}}},}}}2} \sim }[0,2\pi ]$ and for modality-specific Go-type tasks

$$_}}}}=\left\_}}},}}}1} &}}\,\,\,}}\\ _}}},}}}2} &}}\,\,\,}}\end\right.$$

while for modality-specific AntiGo-type tasks

$$_}}}}=\left\_}}},}}}1}+\pi &}}\,\,\,}}\,\\ _}}},}}}2}+\pi &}}\,\,\,}}\end\right.$$

For ‘RT’ versions of the ‘Go’ tasks, stimuli are only presented during the response epoch and the fixation cue is never extinguished. Thus, the presence of the stimulus itself serves as the response cue and the model must respond as quickly as possible. Otherwise, stimuli persist through the duration of the stimulus epoch.

‘Decision-making’ tasks

The ‘decision-making’ family of tasks includes ‘DM’ (decision-making), ‘AntiDM’, ‘MultiDM’ (multisensory decision-making), ‘AntiMultiDM,’ modality-specific versions of each of these tasks and, finally, confidence-based versions of ‘DM’ and ‘AntiDM.’ For all tasks in this group, two stimuli are presented simultaneously and persist throughout the duration of the stimulus epoch. They are drawn according to $_}}}1} \sim }[0,2\pi ]$ and $_}}}2} \sim }$$[(_}}}1}-0.2\pi ,_}}}1}-0.6\pi )\cup (_}}}1}+0.2\pi ,_}}}1}+0.6\pi )]$. A base strength applied to both stimuli is drawn such that $st_} \sim }[1.0,1.2]$. A contrast is drawn from a discrete distribution such that c ~ so the stimulus strength associated with each direction in a trial are given by $st_}}}1}=st_}+c$ and $st_}}}2}=$ $_}-c$.

For the ‘DM’ task,

$$_}}}}=\left\_}}}1}\quad &\,}}\,\,st_}}}1} > st_}}}2}\\ _}}}2}\quad &\,}}\,\end\right.$$

and for the the ‘AntiDM’ task,

$$_}}}}=\left\_}}}1}\quad &\,}}\,\,st_}}}1} < st_}}}2}\\ _}}}2}\quad &\,}}\,\end\right.$$

For these versions of the tasks, the stimuli are presented in either modality 1 or modality 2 with equal probability. For the multisensory versions of each task, stimuli directions are drawn in the same manner and presented across both modalities so that $_}}}1,}}}1}=_}}}1,}}}2}$ and $_}}}2,}}}1}=_}}}2,}}}2}$. Base strengths are drawn independently for each modality. Contrasts for both modalities are drawn from a discrete distribution such that $_}}\,1},_}}\,2} \sim \left\$. If both $| _}}\,1}| -| _}}\,2}| =0$ then contrasts are redrawn to avoid zero-contrast trials during training. If both $_}}\,1}$ and $_}}\,2}$ have the same sign, then contrasts are redrawn to ensure that the trial requires integrating over both modalities as opposed to simply performing a ‘DM’ task in a single modality. Criteria for target responses are measured as the strength of a given direction summed over both modalities. So, for ‘MultiDM’

$$_}}}}=\left\_}}}1,}}\,1}\quad &\,}}\,\,st_}}}1,}}}1}+st_}}}1,}}}2} > st_}}}2,}}}1}\\&+st_}}}2,}}}2}\\ _}}}2,}}}1}\quad &\,}}\,\end\right.$$

and for ‘AntiMultiDM’

$$_}}}}=\left\_}}}1,}}\,1}\quad &\,}}\,\,st_}}}1,}}}1}+st_}}}1,}}}2} < st_}}}2,}}}1}\\&+st_}}}2,}}}2}\\ _}}}2,}}}1}\quad &\,}}\,\end\right.$$

Stimuli for modality-specific versions of each task are generated in the same way as multisensory versions of the task. Criteria for target response are the same as standard versions of ‘DM’ and ‘AntiDM’ tasks applied only to stimuli in the relevant modality.

In confidence-based decision-making tasks (‘ConDM’ and ‘ConAntiDM’), the stimuli directions are drawn in the same way as above. Stimuli are shown in either modality 1 or modality 2 with equal probability. In each trial, strbase = 1. The contrast and noise for each trial is based on the thresholded performance of a SIMPLENET model trained on all tasks except ‘ConDM’ and ‘ConAntiDM’. Once this model has been trained, we establish a threshold across levels of noise and contrasts for which the model can perform a ‘DM’ or an ‘AntiDM’ task at 95% correct. We then draw contrasts and noises for trials from above and below this threshold with equal probability during training. In trials where the noise and contrast levels fell below the 95% correct threshold, the model must repress response, and otherwise perform the decision-making task (either ‘DM’ or ‘AntiDM’).

‘Comparison’ tasks

Our comparison task group includes ‘COMP1’, ‘COMP2’, ‘MultiCOMP1’, ‘MultiCOMP2’, ‘Anti’ versions of each of these tasks, as well as modality-specific versions of ‘COMP1’ and ‘COMP2’ tasks. This group of tasks is designed to extend the basic decision-making framework into a setting with more complex control demands. These tasks utilize the delay structure in the stimulus epoch so that stim1 appears only during the stim1 epoch, followed by a delay, and finally stim2. This provides a temporal ordering on the stimuli. In ‘COMP1’, the model must respond to the first stimulus only if it has greater strength than the second and otherwise repress a response that is

$$_}}}}=\left\_}}}1}\quad &\,}}\,\,st_}}}1} > st_}}}2}\\ }}}\quad &\,}}\,\end\right.$$

Likewise, in ‘COMP2’, the model must respond to the second direction if it presented with greater strength than the first otherwise repress response that is

$$_}}}}=\left\_}}}2}\quad &\,}}\,\,st_}}}2} > }}}}_}}}1}\\ }}}\quad &\,}}\,\end\right.$$

In ‘Anti’ versions of the task the ordering criteria is the same except for stimuli with least strength, that is, for ‘AntiCOMP1’

$$_}}}}=\left\_}}}1}\quad &\,}}\,\,}}}}_}}}1} < }}}}_}}}2}\\ }}}\quad &\,}}\,\end\right.$$

and for ‘AntiCOMP2’

$$_}}}}=\left\_}}}2}\quad &\,}}\,\,}}}}_}}}2} < }}}}_}}}1}\\ }}}\quad &\,}}\,\end\right.$$

In multisensory settings, the criteria for target direction are analogous to the multisensory decision-making tasks where strength is integrated across modalities. Likewise, for modality-specific versions, the criteria are only applied to stimuli in the relevant modality. Stimuli directions and strength for each of these tasks are drawn from the same distributions as the analogous task in the ‘decision-making’ family. However, during training, we make sure to balance trials where responses are required and trials where models must repress response.

‘Duration’ tasks

The ‘duration’ family of tasks includes ‘Dur1’, ‘Dur2’, ‘MultiDur1’, ‘MultiDur2’, ‘Anti’ versions of each of these tasks and modality-specific versions of ‘Dur1’ and ‘Dur2’ tasks. These tasks require models to perform a time estimation task with the added demand or stimuli ordering determining relevance for response. Like in ‘comparison’ tasks, stim1 is presented followed by a delay and then stim2. For ‘Dur1’ trials

$$_}}}}=\left\_}}}1}\quad &\,}}\,\,du_}}}1} > du_}}}2}\\ }}}\quad &\,}}\,\end\right.$$

Likewise, for ‘Dur2’

$$_}}}}=\left\_}}}2}\quad &\,}}\,\,du_}}}2} > du_}}}1}\\ }}}\quad &\,}}\,\end\right.$$

In ‘Anti’ versions of these tasks, the correct response is in the direction of the stimulus with the shortest duration given the ordering criteria is met. Hence, for ‘AntiDur1’

$$_}}}}=\left\_}}}1}\quad &\,}}\,\,du_}}}1} < du_}}}2}\\ }}}\quad &\,}}\,\end\right.$$

and for ‘AntiDur2’

$$_}}}}=\left\_}}}2}\quad &\,}}\,\,du_}}}2} < du_}}}1}\\ }}}\quad &\,}}\,\end\right.$$

Across these tasks directions are drawn according to $_}}}1} \sim }[0,2\pi ]$ and $_}}}2} \sim }[(_}}}1}-0.2\pi ,_}}}1}-0.6\pi )\cup (_}}}1}+0.2\pi ,_}}}1}+0.6\pi )]$. Stimulus strengths are drawn according to $st_}}}1},st_}}}2} \sim }[0.8,1.2]$. To set the duration of each stimulus, we first draw $du_}}}} \sim$ $\}\}$ and $du_}}}} \sim \_}}}}-8),i\in }\}$. During training, we determine which trials for a given task should and should not require a response in order to evenly balance repress and respond trials. We then assign durlong and durshort to either stim1 or stim2 so that the trial requires the appropriate response given the particular task type.

Again, criteria for correct response in the multisensory and modality-specific versions of each tasks follow analogous tasks in the ‘decision-making’ and ‘comparison’ groups where multisensory versions of the task require integrating total duration over each modality, and modality-specific tasks require only considering durations in the given task modality. For multisensory tasks, we draw duration value $du_}}}} \sim \}\}$ and then split this value durlong0 = durlong × 0.55 and durlong1 = durlong × 0.45. We also draw a value durshort = durlong − Δdur where $\Delta dur \sim \}\}$. This value is then subdivided further into durshort0 = durlong1 + Δdurshort where $\Delta du_}}}} \sim$ $\}\}$ and durshort1 = durShort − durshort0. Short and long durations can then be allocated to the ordered stimuli according to task type. Drawing durations in this manner ensures that, like in ‘decision-making’ and ‘comparison’ groups, correct answers truly require models to integrate durations over both modalities, rather than simply performing the task in a given modality to achieve correct responses.

‘Matching’ tasks

The ‘matching’ family of tasks consists of ‘DMS’ (delay match to stimulus), ‘DNMS’ (delay non-match to stimulus), ‘DMC’ (delay match to category) and ‘DMNC’ (delay non-match to category) tasks. For all tasks, stim1 is presented at the beginning of the stimulus epoch, followed by a delay, and the presentation of stim2. The stimulus strength is drawn according to $st_}}}1},st_}}}2} \sim }[0.8,1.2]$. The input modality for any given trial is chosen at random with equal probability. In both ‘DMS’ and ‘DNMS’ tasks, trials are constructed as ‘matching stim’ trials or ‘mismatching stim’ trials with equal probability. In ‘matching stim’ trials $_}}}1} \sim }[0,2\pi ]$ and $_}}}2}=_}}}1}$. In ‘mismatch stim’ trials, $_}}}1} \sim }[0,2\pi ]$ and

$$_}}}2} \sim }[(_}}}1}-0.2\pi ,_}}}1}-0.6\pi )\cup (_}}}1}+0.2\pi ,_}}}1}+0.6\pi )].$$

For ‘DMS’, models must respond in the displayed direction if the stimuli match, otherwise repress response,

$$_}}}}=\left\_}}}1}\quad &\,}}\,\,_}}}1}=_}}}2}\\ }}}\quad &\,}}\,\end\right.$$

and for ‘DNMS’, models must respond to the second direction if both directions are mismatched,

$$_}}}}=\left\_}}}2}\quad &\,}}\,\,_}}}1}\ne _}}}2}\\ }}}\quad &\,}}\,\end\right.$$

‘DMC’ and ‘DNMC’ tasks are organized in a similar manner. The stimulus input space is divided evenly into two categories such that cat1 = and cat2 = . For ‘DMC’ and ‘DNMC’ tasks, trials are constructed as ‘matching cat.’ trials or ‘mismatching cat.’ trials with equal probability. In ‘matching cat.’ trials $_}}}1} \sim }[0,2\pi ]$ and $_}}}2} \sim }(}}}_}}}1})$, where $}(}}}_}}}1})$ is a uniform draw from the category of stim1. In ‘mismatch stim’ trials, $_}}}1} \sim }[0,2\pi ]$ and $_}}}2} \sim }(-}}}_}}}1})$ where $-}}}_}}}1}$ is the opposite category as stim1. For ‘DMC’, the model must respond in the first direction if both stimuli are presented in the same category otherwise repress response,

$$_}}}}=\left\_}}}1}\quad &\,}}\,\,}}}_}}}1}=}}}_}}}2}\\ }}}\quad &\,}}\,\end\right.$$

and for ‘DNMC’, the model should respond to the second direction if both stimuli are presented in opposite categories otherwise repress response,

$$_}}}}=\left\_}}}2}\quad &\,}}\,\,}}}_}}}1}\ne }}}_}}}2}\\ }}}\quad &\,}}\,\end\right.$$

Target output and correct criteria

The target output $y\in }}^$ for a trial entails maintaining fixation in y1 = yfix during the stimulus epoch, and then either responding in the correct direction or repressing activity in the remaining target response units y2…33 in the response epoch. Since the model should maintain fixation until response, target for fixation is set at yfix = 0.85 during preparatory and stimulus epochs and yfix = 0.05 in the response epoch. When a response is not required, as in the preparatory and stimulus epochs and with repressed activity in the response epoch, unit i takes on a target activity of yi = 0.05. Alternatively, when there is a target direction for response,

$$_=0.8\exp \left[-0.5 \times _}}}}-_| }\right)}^\right]+0.05$$

where θi is the preferred direction for unit i. Like in sensory stimuli, preferred directions for target units are evenly spaced values from [0, 2π] allocated to the 32 response units.

For a model response to count as correct, it must maintain fixation, that is, $}_}}}} > 0.5$ during preparatory and stimulus epochs. When no response is required $}_ < 0.15$. When a response is required, response activity is decoded using a population vector method and $_}}}.}\in (_}}}}-\frac,_}}}}+\frac)$. If the model fails to meet any of these criteria, the trial response is incorrect.

Model training

Again following ref. 18, model parameters are updated in a supervised fashion according to a masked mean squared error loss (mMSE) computed between the model motor response, $}_=\hat$, and the target, y1…T = y, for each trial.

$$L=}}}(\,y,\hat)=} \times _-_}}\right)}^\Big\rangle }_$$

Here, the multiplication sign denotes element-wise multiplication. Masks weigh the importance of different trial epochs. During preparatory and stimulus epochs, mask weights are set to 1; during the first five time steps of the response epoch, the mask value is set to 0; and during the remainder of the response epoch, the mask weight is set to 5. The mask value for the fixation is twice that of other values at all time steps.

For all models, we update Θ = during training on our task set. For instructed models, we additionally update Linearembed in the process of normal training. We train models using standard PyTorch machinery and an Adam optimizer. An epoch consists of 2,400 mini-batches, with each mini-batch consisting of 64 trials. For all models, we use the same initial learning rate as in ref. 18, lr = 0.001. We found that in the later phases of training, model performance oscillated based on which latest task presented during training, so we decayed the learning rate for each epoch by a factor of γ = 0.95, which allowed performance to converge smoothly. Following ref. 18, models train until they reach a threshold performance of 95% across all tasks (and train for a minimum of 35 epochs). We found that training for GPTNET tended to asymptote below performance threshold for multisensory versions of comparison tasks. This held true over a variety of training hyperparameters and learning rate scheduler regimes. Hence, we relax the performance threshold of GPTNET to 85%. For each model type, we train five models that start from five different random initializations. Where applicable, results are averaged over these initializations.

Language model fine-tuning

When fine-tuning models, we allow the gradient from the motor loss experienced during sensorimotor training to fine-tune the weights in the final layers of the transformer language models. During normal training, we checkpoint a copy of our instructed models after training for 30 epochs. We then add the last three transformer layers to the set of trainable parameters, and reset the learning rates to lr = 1 × 10−4 for Θ = and lrlang = 3 × 10−4 for Θlang = where transformer−3,−2,−1 denotes the parameters of the last three layers of the relevant transformer architecture. We used these reduced learning rates to avoid completely erasing preexisting linguistic knowledge. Similarly for RNN parameters, we found the above learning rate avoided catastrophic forgetting of sensorimotor knowledge while also allowing the RNN to adapt to updated language embeddings across all models. Autoregressive models were much more sensitive to this procedure, often collapsing at the beginning of fine-tuning. Hence, for GPTNETXL and GPTNET, we used lrlang = 5 × 10−5, which resulted in robust learning. Models train until they reach a threshold performance of 95% across training tasks or 85% correct for GPTNET.

Hold-out testing

During hold-out testing, we present models with 100 batches of one of the tasks that had been held out of training. For the instructed model, the only weights allowed to update during this phase are Θ = . All weights of SIMPLENET and STRUCTURENET are trainable in this context. In this hold-out setting, we found that in more difficult tasks for some of our more poorly performing models, the standard hyperparameters we used during training resulted in unstable learning curves for novel tasks. To stabilize performance and thereby create fair comparisons across models, we used an increased batch size of 256. We then began with the standard learning rate of 0.001 and decreased this by increments of 0.0005 until all models showed robust learning curves. This resulted in a learning rate of 8 × 10−4. All additional results shown in the Supplementary Information section 4 follow this procedure.

CCGP calculation

To calculate CCGP, we trained a linear decoder on a pair of tasks and then tested that decoder on alternative pairs of tasks that have an analogous relationship. We grouped tasks into eight dichotomies: ‘Go’ versus ‘Anti’, ‘Standard’ versus ‘RT’, ‘Weakest’ versus ‘Strongest’, ‘Longest’ versus ‘Shortest’, ‘First Stim.’ versus ‘Second Stim’, ‘Stim Match’ versus ‘Category Match’, ‘Matching’ versus ‘Non-Matching’ and ‘Mod1’ versus ‘Mod2’. As an example, the ‘Go’ versus ‘Anti’ dichotomy includes (‘Go’, ‘AntiGo’), (‘GoMod1’, ‘AntiGoMod1’), (‘GoMod2’, ‘AntiGoMod2’), (‘RTGo’, ‘AntiRTGo’), (‘RTGoMod1’, ‘AntiRTGoMod1’) and (‘RTGoMod2’, ‘AntiRTGoMod2’) task pairs. For ‘RNN’ task representations, we extracted activity at the time of stimulus onset for 250 example trials. For language representations, we input the instruction sets for relevant tasks to our language model and directly analyze activity in the ‘embedding’ layer or take the sequence-averaged activity in each transformer layer. For nonlinguistic models, we simply analyze the space of rule vectors. Train and test conditions for decoders were determined by dichotomies identified across the task set (Supplementary Note 1). To train and test decoders, we used sklearn.svm.LinearSVC Python package. The CCGP score for a given task is the average decoding score achieved across all dichotomies where the task in question was part of either the train set or the test set. For model scores reported in the main text, we only calculate CCGP scores for models where the task in question has been held out of training. In Supplementary Fig. 9, we report scores on tasks where models have been trained on all tasks, and for models where instructions have been switched for the hold-out task.

For Fig. 3e, we calculated Pearson’s r correlation coefficient between performance on held-out tasks and CCGP scores per task, as well as a P-value testing against the null hypothesis that these metrics are uncorrelated and normally distributed (using the scipy.stats.pearsonr function). Full statistical tests for CCGP scores of both RNN and embedding layers from Fig. 3f can be found in Supplementary Fig. 9. Note that transformer language models use the same set of pretrained weights among random initialization of Sensorimotor-RNNs, thus for language model layers, the Fig. 3f plots show the absolute scores of those language models.

Conditional clause/deduction task analysis

We first split our task set into two groups (listed below): tasks that included conditional clauses and simple deductive reasoning components (30 tasks) and those where instructions include simple imperatives (20 tasks). We computed the difference in performance across the mean of generalization performance for each group across random initialization for each model (Fig. 2f). We compared these differences to a null distribution constructed by performing a set of 50 random shuffles of the task set into groups of 30 and 20 tasks and computing differences in the same way, again using two-sided unequal-variance t-tests. Because STRUCUTRENET is a nonlinguistic model, we then compared performance of STRUCUTRENET to our instructed models to disassociate the effects of performing tasks with a deductive reasoning component versus processing instructions with more complicated conditional clause structure. Results of all statistical tests are reported in Supplementary Fig. 6).

Simple imperative tasks include: ‘Go’, ‘AntiGo’, ‘RTGo’, ‘AntiRTGo’, ‘GoMod1’, ‘GoMod2’, ‘AntiGoMod1’, ‘AntiGoMod2’, ‘RTGoMod1’, ‘AntiRTGoMod2’, ‘RTGoMod2’, ‘AntiRTGoMod2’, ‘DM’, ‘AntiDM’, ‘MultiDM’, ‘AntiMultiDM’, ‘DMMod1’, ‘DMMod2’, ‘AntiDMMod1’ and ‘AntiDMMod2’.

Conditional clause/deduction tasks include: ‘ConDM’, ‘ConAntiDM’, ‘Dur1’, ‘Dur2’, ‘MultiDur1’, ‘MultiDur2’, ‘AntiDur1’, ‘AntiDur2’, ‘AntiMultiDur1’, ‘AntiMultiDur2’, ‘Dur1Mod1’, ‘Dur1Mod2’, ‘Dur2Mod1’, ‘Dur2Mod2’, ‘COMP1’, ‘COMP2’, ‘MultiCOMP1’, ‘MultiCOMP2’, ‘AntiCOMP1’, ‘AntiCOMP2’, ‘AntiMultiCOMP1’, ‘AntiMultiCOMP2’, ‘COMP1Mod1’, ‘COMP1Mod2’, ‘COMP2Mod1’, ‘COMP2Mod2’, ‘DMS’, ‘DNMS’, ‘DMC’ and ‘DMNC’.

Language production trainingSelf-supervised language production network training

Our language production framework is inspired by classic sequence-to-sequence modeling using RNNs53. Our Production-RNN is a GRU with 256 hidden units using ReLU nonlinearities. At each step in the sequence, a set of decoder weights, Linearwords, attempts to decode the next token, wτ+1, from the hidden state of the recurrent units. The hidden state of the Production-RNN is initialized by concatenating the time average and maximum sensorimotor activity of a SBERTNET (L) and passing that through weights Linearsm. The linguistic instruction used to drive the initializing sensorimotor activity is in turn used as the target set of tokens for the Production-RNN outputs. The first input to the Production-RNN is always a special start-of-sentence token, and the decoder runs until an end-of-sentence token is decoded or until input reaches a length of 30 tokens. Suppose $_\ldots _}}},k}\in }}_^$ is the sequence of tokens in instruction k where k is in the instruction set for task i and Xi is sensory input for a trial of task i. For brevity, we denote the process by which language models embed instructions as Embed() (see ‘Pretrained transformers’). The decoded token at the τth position, $}_$, is then given by

$$\begin_^=}}}\left(^,Embed\left(_\ldots _}}},k}\right)\right)\quad\quad_^\in }}^\\ sm\_out=\left.\right(}}}}_\left(_^\right),\mathop\limits_\left(_^\right)\quad\quad\quad\quad\quad\quad\quad\quad\quad\;\;\_\in }}^\\ \overline_^}}}}}=}}}\left(}}}}_}}}}(sm\_out)\right)\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\;\overline_^}}}}}\in }}^\\ _^}}}}=}}}\left(\overline_^}}}}}\right)\;\;\;\quad\quad\quad\quad\quad\quad\quad\quad\quad\qquad\quad_^}}}}\in }}^\\ _^}}}}=}}}\left(}_\ldots }_;_^}}}}\right),\quad\quad\quad_^}}}}\in }}^\\ _}_}=}}}\left(}}}}_}}}}\left(_^}}}}\right)\right)\quad\quad\quad\quad\quad\quad\quad\quad\quad_}_}\in }}^,\\ }_=}}}\left(_}_}\right)\end$$

The model parameters Θproduction = are trained using cross-entropy loss between the $_}_}$ and the instruction token wτ,k provided to the sensorimotor-RNN as input. We train for 80 epochs of 2,400 batches with 64 trials per batch and with task type randomly interleaved. We found that using an initial learning rate of 0.001 sometimes caused models to diverge in early phases of training, so we opted for a learning rate of 1× 10−4, which led to stable early training. To alleviate similar oscillation problems detected in sensorimotor training, we also decayed the learning rate by γ = 0.99 per epoch. Additionally, the use of a dropout layer with a dropout rate of 0.05 improved performance. We also used a teacher forcing curriculum, where for some ratio of training batches, we input the ground truth instruction token wτ,k at each time step instead of the models decoded word $}_$. At each epoch, $}\,}}}}}$ $}=0.5 \times \frac}}}}$.

Obtaining embedding layer activity using motor feedback

For a task, i, we seek to optimize a set of embedding activity vectors $^\in }}^$ such that when they are input as task-identifying information, the model will perform the task in question. Crucially, we freeze all model weights Θ = and only update Ei according to the standard supervised loss on the motor output. For notional clarity, GRU dependence on the previous hidden state ht−1 has been made implicit in the following equations.

$$\begin}^&=&\sigma \Big(}}}}_}}}}\left(}}}(^,^)\right)\Big)\\ L&=&}(\;y,\hat)\end$$

We optimized a set of 25 embedding vectors for each task, again using an Adam optimizer. Here the optimization space has many suboptimal local minimums corresponding to embeddings for related tasks. Hence, we used a high initial learning rate of lr = 0.05, which we decayed by γ = 0.8 for each epoch. This resulted in more robust learning than lower learning rates. An epoch lasts for 800 batches with a batch length of 64, and we train for a minimum of 1 epoch or until we reach a threshold performance of 90% or 85% on ‘DMC’ and ‘DNMC’ tasks.

Producing task instructions

To produce task instructions, we simply use the set Ei as task-identifying information in the input of the sensorimotor-RNN and use the Production-RNN to output instructions based on the sensorimotor activity driven by Ei. For each task, we use the set of embedding vectors to produce 50 instructions per task. We repeat this process for each of the 5 initializations of sensorimotor-RNN, resulting in 5 distinct language production networks, and 5 distinct sets of learned embedding vectors. Reported results for each task are averaged over these 5 networks. For the confusion matrix (Fig. 5d), we report the average percentage that decoded instructions are in the training instruction set for a given task or a novel instruction. Partner model performance (Fig. 5e) for each network initialization is computed by testing each of the 4 possible partner networks and averaging over these results.

Sample sizes/randomization

No statistical methods were used to predetermine sample sizes but following ref. 18 we used five different random weight initializations per language model tested. Randomization of weights was carried out automatically in Python and PyTorch software packages. Given this automated randomization of weights, we did not use any blinding procedures in our study. No data were excluded from analyses.

Software

All simulation and data analysis was performed in Python 3.7.11. PyTorch 1.10 was used to implement and train models (this includes Adam optimizer implementation). Transformers 4.16.2 was used to implement language models and all pretrained weights for language models were taken from the Huggingface repository (https://huggingface.co/docs/transformers/). We also used scikit-learn 0.24.1 and scipy 1.7.3 to perform analyses.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

View original article

NATURE NEUROSCIENCE

分享书签

0 0 0 0 0 0 0

More from this channel

Natural language instructions induce compositional generalization in networks of neurons

留言 (0)