The effect of topic familiarity and volatility of auditory scene on selective auditory attention

Selective auditory attention has been suggested as underlying the “cocktail party effect,” which describes listening to a sound of interest in the presence of competing sounds (Cherry, 1953; Shamma et al., 2011). A growing body of research has observed that information on an attended speech is more strongly represented in the brain than information on an ignored speech (Ding & Simon, 2012a, 2012b; Mesgarani & Chang, 2012; Zion Golumbic et al., 2013). This observation indicates that selective attention could modulate the neural representation of speech, which reflects the cognitive process of selecting the target auditory stream in a cocktail party environment (Obleser & Kayser, 2019). Additionally, attempts to explore how selective attention is engaged in acoustically more challenging environments—for example, where the speech of interest is masked by noise or degraded by reverberation (Ding & Simon, 2013; Fuglsang et al., 2017)—allow us to understand how the modulation effect of selective attention depends on the amount of accessible acoustic information on the target sound.

In contrast to the findings from leveraging bottom-up acoustic features, top-down influences on the neural representation of selectively attended speech are not well documented, even though it has long been argued that high-level cognition is also considerably involved in selective auditory attention (Deutsch & Deutsch, 1963). In fact, various top-down factors have been shown to affect the analysis of a complex auditory scene. For instance, voice familiarity can promote the segregation of speech in a competing-speaker environment, as demonstrated by behavioral and functional neuroimaging studies (Holmes & Johnsrude, 2021; Johnsrude et al., 2013; Newman & Evers, 2007), and musical knowledge—proposed to be processed by the brain network that is partially shared with speech (Patel, 2011)—can be helpful for the perception and cortical tracking of speech when noise or competing sounds are present in an auditory scene (Du & Zatorre, 2017; Puschmann et al., 2018). For language-related factors, semantic information (or priming) has been shown to improve the intelligibility of speech in noise (Bhandari et al., 2021; Chan & Alain, 2021; Warzybok et al., 2021; Zekveld et al., 2011, 2013), and its neural substrates have also been relatively well documented (Obleser, 2014; Obleser & Kotz, 2010, 2011; Rysop et al., 2021). However, the advantage of semantic knowledge was not consistent in the neural representation of selectively attended speech in competing-speaker environments where a target speech was primed (Wang et al., 2019) or when experience in the language of the speech was varied (Reetzke et al., 2021; Zou et al., 2019).

This inconsistency in previous studies is probably because different levels of linguistic factors were confounded from lower segmental features (e.g., consonants, vowels, and phonemes) to higher structural knowledge (e.g., semantics and syntax), which precludes a clear understanding of these top-down influences. As such, it is essential to isolate each top-down factor and examine its influence. However, it is hard to dissociate mingled top-down features, particularly in naturalistic speech, because linguistic information at different hierarchical levels is recursively combined to form higher-level structures and, therefore, covary (e.g., Brodbeck & Simon, 2020). One feature that is worth considering is topic familiarity. A topic, which is a higher-level information structure in a text, arguably facilitates the processing of sentences by delimiting the context and, in turn, priming the upcoming words (Brothers et al., 2015; Foss, 1982; Jordan & Thomas, 2002). However, this facilitatory effect has seldom been tackled in speech processing. Therefore, we examine the top-down influence of topic familiarity on the selective attention process by adjusting topic familiarity in naturalistic speech.

The above discussions so far have assumed that selective attention operates in stationary auditory scenes. However, the listening environments that we encounter in everyday life are non-stationary, and dynamically and irregularly changing. It comprises sound sources that varies in locations (for a review, see van der Heijden et al., 2019), speakers (e.g., Shamma et al., 2011), and semantic information (e.g., Gregg & Samuel, 2009; Gregg & Snyder, 2012). Here, we define the listening environments where auditory scenes dynamically and irregularly change as volatile listening environments. The volatility of listening environments is important to extend our understanding of the selective auditory attention process because it can affect the engagement of attention to identify and follow an auditory object in a scene (Alain & Arnott, 2000; Best et al., 2008; Shinn-Cunningham, 2008). In contrast to static listening conditions where selective attention could be enhanced over time by focusing on a continuous auditory object (Best et al., 2008), in volatile conditions, selective attention could weaken by additional perceptual loads due to the requirements for re-analyzing an altered auditory scene and for identifying a new object (Lim et al., 2019, 2021; Shinn-Cunningham, 2008). However, note that volatile listening environments should be differentiated from regularly changing listening conditions because, in regularly changing environments, the listeners can have a perceptual benefit from predicting an auditory scene to be changed (Choi & Perrachione, 2019; Winkler et al., 2009; Zhao et al., 2019).

Despite the potential implications for the selective attention process and its underlying neural mechanisms, the effects of volatile listening conditions have not been well documented. Several studies have shown that talker discontinuity could modulate evoked potentials in response to the target speech (Getzmann, 2020; Lim et al., 2021; Mehraei et al., 2018). However, the time windows when this modulation effect was observed varied across these studies, and only Mehraei and colleagues (2018) found the modulation of N1 response that is known to reflect early auditory processing and to be reduced in speech-in-noise perception (Koerner & Zhang, 2015). In addition, all of the previous studies used controlled speech stimuli, for instance, syllable and digit trains or word-and-digit pairs, which made it hard to generalize their findings to naturalistic speech processing. In fact, Teoh and Lalor (2019) could not find evidence that supported the effect of talker discontinuity on the selective attention process using narrative naturalistic speech. Talker discontinuity could be one of the elements that contribute to the volatility of listening environments. However, to introduce talker discontinuity, the previous studies only manipulated the location of sound sources, and other features were kept constant (Getzmann, 2020; Mehraei et al., 2018; Teoh & Lalor, 2019), which does not seem to be enough to make auditory scenes dynamic and irregular. Therefore, we altered various features to form volatile listening conditions, such as source locations of the target sounds, speakers, and contents, and examined how this listening volatility could influence the neural representation of attended speech — as a first step to looking for the effect of listening volatility on the selective auditory attention process.

In this study, we investigated how topic familiarity or the volatility of a listening environment influences selective auditory attention and its underlying neural process. To this end, we employed a dichotic listening paradigm with naturalistic narrative speech sounds and observed the listener's neural activity using electroencephalography (EEG). We manipulated topic familiarity by presenting stories that the listener had never known or heard before, and we built a volatile listening environment by randomly varying the listener's spatial attention with different contents and speakers. Since each condition was driven by different (or perhaps even orthogonal) factors, we hypothesized that the engaged neural mechanisms for selective auditory attention were distinct, although both of them might have strongly involved top-down processing. We investigated the cortical representations of the attended speech using a neural decoding approach and temporal response function (TRF) analysis. We tested the effect of each manipulation by comparing the neural decoding result and the TRF with the ones in the control condition, where the listener attended to one of two stories with high topic familiarity and without changing the spatial attention, contents, and voices (i.e., with low volatility).

Furthermore, we used another approach to determine changes in attentional states in different listening conditions. Previous studies have observed asymmetric brain activation patterns for different spatial attention (Das et al., 2016; Ding & Simon, 2012a; Power et al., 2012), which implied the presence of a neural process specialized for attention in each left and right ear. Based on these findings, we compared the asymmetric patterns of the TRFs over the different listening conditions. This approach informed us of the listener's attentional state or of the extent to which the listener was engaged in the target speech during dichotic listening. For example, when the listener was fully engaged in the target speech, we observed clear directional asymmetry of the TRFs because there was no disruption while the listener was paying attention to the sounds either on the left or right ear and thus, the neural activation pattern specific to each spatial attention was well captured. Otherwise, the directional asymmetry of the TRFs would be weakened due to inadequate attention. With this idea, we hypothesized that the latter case would occur in volatile conditions because the increased perceptual load (through frequent re-engagement with a new auditory object; Best et al., 2008; Lim et al., 2019; 2021; Shinn-Cunningham, 2008) could hamper proper attention to the target speech.

留言 (0)

沒有登入
gif