Sparse attentional subsetting of item features and list-composition effects on recognition memory

Many mathematical memory models treat an item in a memory task as a list of features comprising a vector (Fig. 1a). Features are deliberately kept abstract, for mathematical convenience and to emphasize the distributed nature of representations (Murdock, 1995b), but also to express the generality of the functioning of models across a hypothetical range of features. There are some interesting exceptions to this, where modellers have incorporated assumptions about how various features might function differently in a model. One important example is the Feature Model (Nairne, 1990), which distinguishes features of an item that are present every time an item is presented, from features that are specific to the modality or form in which the item was presented (Cyr et al., 2021, Saint-Aubin et al., 2021). Another example is MINERVA 2, where modellers have incorporated assumptions about ranges of features being specific to particular conditions such as particular ways participants process word stimuli (Hintzman, 1988, Jamieson et al., 2016). Retrieving Effectively from Memory (REM) can have a range of features dedicated to associative features between two items (Cox and Shiffrin, 2017, Criss and Shiffrin, 2005).

Caplan et al. (2022) introduced the idea that different features of an item’s representation might be activated when the item is accompanied by one particular item versus a different particular item (Tversky, 1977 see examples below). I take this idea further. I assume that participants do not attend to the vast majority of features of an item, which seems unwieldy and implausible. Rather, they attend to a small subset of features (cf. Glanzer et al., 1993), and only those subsetted features can be encoded into an episodic memory (Fig. 1b,c). What drives attention could be quite specific, and one can often specify something about what determines the set of attended features. Such factors can include task set, due to explicit instructions or participants’ prior experience, where the model (subject) has some idea of what to expect in the experiment and which stimulus features are relevant versus irrelevant (e.g., Medin et al., 1993, Osgood, 1949), as well as context (e.g., Gagné & Spalding, 2007) and recent experience. My main focus here is how the set of attended features may depend on how the participant processes an item.

What enables a model with sparsely encoded items to excel at episodic recognition is that I assume (as Caplan et al., 2022), that at test, roughly the same feature-subset is attended in target probe items (Fig. 5), either because the participant re-processes the item as during the study phase, or because the participant’s assumption about which features are relevant carries over to the test phase. For example, if the participant forms a visual image of a word, turtle, during study, the chances are the features of that image and its main details will be similar if recreated on the fly at test. This achieves several things. It results in relatively sparse representations stored in an episodic memory. It does so while not eliminating the cumulative knowledge (entire vector) associated with an item, which I assume to be stored elsewhere, in a “lexicon” or “semantic memory”, as models of episodic memory do (e.g., Humphreys et al., 1989, Murdock, 1982).2 It can then potentially explain differences across conditions. When the attended feature-subset at test mismatches that at study, performance will be hurt. The model is sandwiched amongst numerous prior models in the same spirit. The main novel ideas I introduce are the twin assumptions: that the attended subset of features is quite small and nearly the same attentional subset attended on an item during study often reiterates itself at test.

As a testbed for these ideas I use the matched filter model (Anderson, 1970), old/new recognition (judging whether each test item was on the just-studied list or not) and list-composition effects. In list-composition experiments, each list is composed of items subject to a single experimental condition (pure lists) or items in both conditions (mixed lists), elaborated next. The matched filter model is an extremely impoverished model; the goal is not to support or test the matched filter model, itself. Rather, the simplicity of the model and its central dependence on similarity across item vectors distills the effects of attentional subsetting. Lessons learned can then be extended to more complicated models.

A major driver of research on recognition memory has been to explain the highly replicated null list-strength effect finding (Ratcliff et al., 1990), that a strength manipulation produces nearly the same difference in recognition memory when items were studied in pure lists of a single strength versus mixed lists containing items of both strengths. “Strength” refers to any pair of conditions wherein one condition (the stronger condition, which we label “D” to be reminiscent of deeper levels of processing) produces higher recognition accuracy than the other (the weaker condition, “S”, reminiscent of shallow levels of processing), most commonly, repeated presentations of an item or longer versus shorter presentation times. Less frequently, levels of processing is treated as a manipulation of strength (Ensor et al., 2021, Ratcliff et al., 1990). Ratcliff et al. (1990) quantified list-strength effects with a ratio-of-ratios index, RoR=d′(D mixed)/d′(S mixed)d′(D pure)/d′(S pure),which is typically close to 1, a null list-strength effect. A RoR of 1 means that a strong item is recognized just the same whether it is embedded in a list of other strong items or a list with some strong and some weak items. This result was surprising because in existing models in 1990, including the matched filter model, strong items should benefit more in mixed lists, where half their competition is from weaker items than in pure lists. Weak items would be disadvantaged in mixed lists, competing against strong items. The near-null list-strength effect implied that recognition judgements are not susceptible to competition from other items within a list. This had a profound influence on the development of mathematical models of recognition, especially because a model had to still be able to explain why performance decreases with list length, the list-length effect. Murdock and Kahana (1993) proposed that competition is present, but saturates over the course of prior lists, so the composition of the current list, per se, contributes very little to recognition. Other modellers constructed item representations to be orthogonal to one another (e.g., Chappell & Humphreys, 1994) but this compromises the list-length effect. Still others designed local-trace models that prevented item-traces from directly competing with one another, starting with Shiffrin and Steyvers (1997) and McClelland and Chappell (1998).

However, it may be overstating the data to talk about a null list-strength effect. RoRs are often around 1.1, albeit not statistically distinguishable from 1, and even below 1 (Ratcliff et al., 1990). Those small deviations may simply be measurement noise about the true value of 1. However, below we will see that there are good reasons to expect there to be some true variability in list-composition effects. Rather than explain null list-strength effects, a better question is why list-strength effects are often rather small and what determines their magnitude and direction.

Moreover, some interesting exceptions are known. One example is the production effect, where participants either read words aloud or silently. This manipulation can produce a large positive list-strength effect (e.g., MacLeod et al., 2010). Articles since 2010 that have reported (near-)null list-strength effects have generally not cited the production effect as a contrasting finding. MacLeod et al. (2010), indeed, explicitly distinguished the production effect from manipulations that exhibit null list-strength effects, suggesting that production influences distinctiveness (dissimilarity between items in memory) rather than strength. But strength, itself, is a slippery term, and as already noted, has been operationalized several different ways. Models that were designed to explain null list-strength effects face a challenge in explaining results like the large production-effect list-strength effect. To foreshadow, with the attentional subsetting mechanism, near-null list-strength effects and large positive list-strength effects can be produced by the same model, operating the same way, differing only in terms of the size of the attentional subset relative to the full size of the feature-space that attention is operating within.

Writing vectors in boldface, let fi represent the full, n-dimensional vector representation of a particular item, i, such as a word (illustrated in Fig. 1a). As is common in episodic memory models, feature values, indexed in parentheses, fi(k), where k=1..n, are assumed to be independent (except when incorporating feature similarity between items), identically distributed (i.i.d.) with a mean of zero and a variance of 1/n so that they will be approximately normalized, |fi|≃1 and approximately (but not strictly) mean-centred (zero-mean). Features could be viewed as fine-grained as firing rates of individual neurons in the brain, but it is usually more helpful to think of them as reflecting activity of a population of neurons. To appreciate the paradox of dimensionality, consider that n, the total number of known features of an item may be quite large, say 100,000. This may seem like a large number, but consider that for word stimuli, a typical person’s vocabulary is in the tens of thousands. This existence proof, that people can distinguish such a large set of words, implies on the order of 100,000 or more dimensions of knowledge of words to avoid linear dependence. However, this considers a set of items that are all words. Words, compared to other conceivable items (faces, real-world objects, odours, colours, etc.) presumably have a large number of features in common (deviating from the independence assumption). They are composed of letters, they are readable and pronounceable, they can be combined to express complex concepts, etc. The dimensionality of the vector representations of words must be even larger to incorporate these common features.

The matched filter model (described in more detail below) simply summates item vectors to store them in memory. In a standard old/new recognition task, the participant/model is presented with a probe item and asked to judge whether the item was on the target list (old, a “target”) or not (new, a “lure”). Old/new decisions are driven by the calculation of matching strength, the dot product (measuring similarity) of the probe vector with that episodic memory. This model thus very quickly achieves arbitrarily high performance, d′=(μtarget−μlure)/.5σtarget2+σlure2 as n increases (Fig. 2a)— excelling as soon as the dimensionality of the vector representation comfortably exceeds L, the list length (n≫L). The intuition behind this is that with more dimensions, the angles between any randomly constructed pair of vectors will tend to be quite large. Random vectors are quite dissimilar in high-dimensional space. This makes it easy for the model to discriminate targets, which have very small angles relative to the memory, from lures, which have large angles relative to the memory.

Thus far, the matched filter model appears too good to be a plausible model of behavioural data. However, features that all words have in common are, by definition, not diagnostic of one word from another. If we partition the n dimensions into p dimensions that are common to all words and q dimensions that could distinguish words from one another, where p+q=n, it is clear that the similarity between pairs of word vectors is quite high. Because the item vectors are approximately normalized, fi⋅fj≥p/n,j≠i. This high amount of similarity makes items hard to distinguish from one another. In an example, if 99% of features are common across the stimuli, including targets and lures, d′ drops drastically (to near chance for the range of n values plotted). This is because the numerator, μtarget−μlure is the difference only due to the non-similar features and both lure and target variance increase because all items are matching memory essentially 99% like targets (Fig. 2b). Performance is quite similar to a hypothetical case of adding 99% noise to the original model. Even if, in principle, a highly similar pair of items i and j can be distinguished because they are not strictly linearly dependent, a more realistic model would also include some level of noise. The presence of noise reduces performance even more, and the more computational operations contribute to the calculation of matching strengths, the more noise in the calculation. If the common features are included in the memory judgement process, this will not only demand more computation, but will introduce more noise into similarity judgements. The advantage due to high dimensionality of item representations is undermined by this overwhelming similarity. On the other hand, if all the common p dimensions could be ignored, the items could be acted upon with far less confusion. If the task is to remember a 10-word list, to distinguish the 10 words from one another, one needs at least 10 dimensions, but perhaps not many more than that. The fact that short lists can be mastered to a level of perfect accuracy suggests that episodic memory can function as though item representations are, in fact, of very low dimensionality and avoid being swamped by the theoretically massive number of common features or massive cumulative amount of noise at the feature-level. But then, if the dimensionality of representations is too small, items will become confusible for the opposite reason: because the representation subspace cannot support enough distinguishable vectors.

I propose what I think is a non-controversial idea that research participants do, in fact, adapt the functional dimensionality of their working representations of stimuli to trade off knowledge versus discriminability demanded by a particular task, in a rational, if often not optimal, way. We will follow these effects by introducing the idea of attentional masks applied to features, with notation following Caplan et al. (2022). Masks are written as vectors, w, of the same dimensionality as the complete item vectors, n. Subscripts and superscripts will denote task-specificity. The values of w(k) could be real-valued, positive, negative or zero, but for tractability, values will be only 1 or 0. At any given time within a task, a mask is applied via elementwise multiplication. Thus, an item fi, masked by w can be written fĩ, where fĩ(k)=w(k)fi(k), depicted in Fig. 1b,c.

In general, the mask could vary quite a lot from study to test phases of a memory paradigm, as well as for other reasons, including the participant’s expectations about the task and their recent experience, including other stimuli presented recently or even simultaneously to an item of interest. This offers a very large number of degrees of freedom, which may be plausible. But for this reason, I distinguish between the general framework, and any particular instantiation of a model that incorporates the ideas within the framework. For a particular application, the sets of relevant/irrelevant features and their dynamics must be sufficiently constrained to produce a testable model. In many concrete examples, these constraints may be straight-forward to identify (such as with visual stimuli comprised of a handful of features, e.g., Osth et al., 2023).

Consider an experiment involving lists of nouns. All features that designate the stimulus as a noun may be safely disregarded for any judgement such as recognition, the main focus in this manuscript. However, in a recall task, the noun-ness cannot be completely ignored; if it were, the participant might be tempted to “recall” a dance move (by performing it) or a tangible object (by handing an object to the experimenter). The participant’s broader knowledge is thus much higher-dimensional than the working dimensionality required for the main challenge of the task. However, the broader knowledge is important for constraining the participant’s behaviour. Now assume the lists were exclusively composed of names of birds. An optimal mask would now exclude features common to birds. However, a participant (or model) that does not identify this constraint within the stimuli would miss out on the opportunity to optimize their mask in this way, and would presumably be more susceptible to confusing birds with one another, and to producing stimuli other than birds as responses. Therefore, the subset of attended features will be far smaller than the full dimensionality of vector representations of items. That attention-driven subset, during the study phase of a task, gates which features can even be encoded. Then, the attention-driven subset during the test phase (which can be the same or different from that at study) determines which features can be used as retrieval cues or, as in the case of recognition that we will focus on here, compared to the memory to drive judgements.

In general, the more features are stored, the stronger and more specific the memory will be. Next I derive the effects of putative manipulations that act in this way. Exploring the similarity structure of those stored vector representations to one another and to potential probe stimuli, I stop after computing d′ for a hypothetical yes/no item-recognition task. I consider mixed-versus pure-list effects on d′ to understand list-strength effects. The next sections develop various instantiations of attentional subsetting as follows.

留言 (0)

沒有登入
gif