Large-scale benchmark yields no evidence that language model surprisal explains syntactic disambiguation difficulty

Language comprehension proceeds quickly and efficiently. A central factor invoked to explain this fact is prediction: by anticipating upcoming words, readers can rapidly integrate them into their interpretation of the sentence (Kutas, DeLong, & Smith, 2011). This explanation fits with the growing evidence that such next-word prediction is a fundamental principle of linguistic cognition (Dell et al., 2021, Pickering and Garrod, 2013) and has a key role to play in language acquisition (Chang et al., 2006, Elman, 1990). In parallel, much recent work has shown that language models – computational systems trained to predict the next word in a sentence – serve as a powerful foundation for language understanding by computers (Brown et al., 2020, Peters et al., 2018). The conjunction of these two trends has given rise to the hypothesis that there is a close correspondence between the predictive mechanisms used by language models and humans (Goldstein et al., 2022, Schrimpf et al., 2021). In this paper we ask, using predictability estimates derived from language models, to what extent human language comprehension at the sentence level can be explained by next-word prediction.

The hypothesis that prediction plays a central role in human language comprehension is supported by comprehenders’ pervasive sensitivity to word-level predictability, which is reflected by measures such as word-by-word processing difficulty (Ehrlich and Rayner, 1981, Staub, 2015) and the N400 electrophysiological response (Kutas et al., 2011). Traditionally, word predictability was estimated using the cloze task, in which participants were asked to provide the next word in a sentence (Taylor, 1953). As the quality of computational language models has improved, these models have been increasingly used as a proxy for human predictability (Goldstein et al., 2022, Goodkind and Bicknell, 2018, Smith and Levy, 2013). There is growing evidence that the processing difficulty on a word that can be attributed to its predictability, as estimated by a language model, is proportional to the word’s surprisal (Hale, 2001, Levy, 2008), that is, the negative log probability assigned by the language model to that word in context (Shain et al., 2022, Smith and Levy, 2013, Wilcox et al., 2020, Wilcox et al., 2023; though see Brothers and Kuperberg, 2021, Hoover et al., 2023.); in this work, we adopt this linking function between predictability and reading times.

While there is compelling evidence that word predictability affects human language comprehension, just how much of language comprehension difficulty can be explained using word predictability remains an open question. On what is perhaps the strongest view on this matter, word surprisal is a “causal bottleneck” that explains most, if not all, of word-level processing difficulty (Levy, 2008, Smith and Levy, 2013). This strong view is appealing on parsimony grounds: Since prediction is independently necessary to explain findings from language comprehension and other cognitive domains (Bar, 2007), it is worthwhile to explore the extent to which it can account for findings that have traditionally been explained using other factors. This methodological approach has been invoked to qualitatively explain a number of phenomena in sentence processing. These phenomena, most of which are described in more detail below, include antilocality effects (Konieczny, 2000, Levy, 2008), garden path effects (Bever, 1970, Hale, 2001, Levy, 2013), the relative difficulty of object-extracted compared to subject-extracted relative clauses (Gibson, 1998, King and Just, 1991, Vani et al., 2021), and the so-called “ambiguity advantage effect” (Traxler, Pickering, & Clifton, 1998).

These qualitative accounts of processing difficulty associated with specific syntactic phenomena join quantitative studies based on measurements made while participants read natural texts (such as newspaper articles); these studies, which have found that up to 80% of the explainable variance in word reading times and nearly 100% of the explainable variance in neural responses to sentences can be predicted by the internal vector representations of next-word-prediction language models (Schrimpf et al., 2021), were taken to further suggest that prediction can explain much of sentence comprehension (though for a note of caution about the interpretation of such studies, see Section “Surprisal-based vs. embedding-based linking functions” and Antonello & Huth, 2023).

There are limits to the conclusions we can draw from studies that use materials from naturalistic sources, however. Such materials may contain predominantly simple, unchallenging structures, and at most a small number of low-frequency syntactic constructions (Futrell et al., 2021). Crucially, the predictions of cognitive theories often diverge most sharply in less frequent constructions (Levy, 2008, Levy et al., 2012). Even if the corpus does occasionally contain such examples, they are likely to be vastly outnumbered by syntactically simple sentences, and as such will have a negligible impact on the model’s fit to reading times (for a similar argument in the case of language model evaluation, see Marvin & Linzen, 2018).

Adopting a more targeted approach to the quantitative assessment of predictability as an explanatory account of syntactic processing difficulty, van Schijndel and Linzen (2021) tested the predictions made by surprisal for three types of garden path sentences. Such sentences contain a temporary syntactic ambiguity that is ultimately disambiguated towards a less preferred, and typically less likely, structure. They are referred to as garden path sentences because they are said to “lead the reader down the garden path” (that is, give the reader misleading signals). For example, in (1a) below, the word conducted signals that the probable analysis of the preceding material (i.e., that the soldiers warned someone about the danger) is incorrect; the correct analysis is the low probability reduced relative clause parse (i.e., the soldiers were the ones warned about the danger). Compare this sentence to (1b), which is a minimally different sentence that does not display such ambiguity.

(1a)

The experienced soldiers warned about the dangers conducted the midnight raid.

(1b)

The experienced soldiers who were warned about the dangers conducted the midnight raid.

Following prior work, we use the term garden path effect to refer to the amount of excess reading time triggered by the disambiguating word in (1a) relative to the baseline condition (1b), where the syntax of the sentence is instead disambiguated prior to the critical word. Under the strongest version of the surprisal hypothesis, the excess processing difficulty on the boldfaced words in (1a) can be fully explained by the fact that these words constitute a highly improbable continuation compared to the same words in (1b). In other words, for surprisal to truly link neural language models to the garden path effect, it needs to not only predict the existence of garden path effects, but also predict their full magnitude.

van Schijndel and Linzen tested this hypothesis using surprisal estimates derived from long short-term memory (LSTM) recurrent neural network language models. While in their study surprisal correctly predicted that reading times on the boldfaced words in (1a) are longer than the reading times on the same words in (1b), it predicted a much smaller excess processing difficulty on (1a) than empirically observed (for similar results for other linguistic constructions, obtained using the maze task, see Wilcox, Vani, & Levy, 2021). They interpreted this substantial underestimation of processing difficulty by surprisal as indicating that processes other than prediction, such as syntactic reanalysis (Fodor and Ferreira, 1998, Paape and Vasishth, 2022), are recruited during the comprehension of syntactically complex sentences.

While van Schijndel and Linzen (2021) provide a blueprint for testing whether processing difficulty in complex sentences can be reduced to surprisal, the empirical scope of their work is limited in a number of ways. First, they only examined three garden path constructions out of the range of syntactically complex English constructions documented in the psycholinguistic literature. Second, they were unable to determine conclusively whether surprisal predicts the relative processing difficulty across different constructions: The two evaluation sets used by van Schijndel and Linzen, collected from 73 and 224 participants respectively, did not permit drawing statistically significant conclusions regarding the relative difficulty among the three garden path constructions. Third, again due to limited power, they only report results at the construction level, and did not examine whether surprisal can explain item-wise variability; this is despite the fact that, as we show below, language models’ predictability estimates vary not only from construction to construction, but also from item to item in the same construction (Frank and Hoeks, 2019, Garnsey et al., 1997). Finally, their ability to compare processing difficulty across constructions was limited by the fact that each of the constructions was read by a different set of participants, precluding within-subjects comparisons.

This is a typical situation in psycholinguistics: Datasets from existing experiments with classic factorial designs, which enable researchers to carefully control irrelevant factors and isolate the comparisons of interest, typically involve a relatively small number of participants. Such datasets sometimes do not even afford enough power to test coarse, directional predictions at the construction level (Vasishth, Mertzen, Jäger, & Gelman, 2018), let alone the precise quantitative predictions at the construction and item level that can be derived from language models. For all these reasons, a thorough empirical test of the surprisal hypotheses requires a new data collection effort.

Motivated by these issues, we present the Syntactic Ambiguity Processing (SAP) Benchmark, a large-scale dataset that consists of self-paced reading times from a range of constructions that have motivated psycholinguistic theories. This benchmark seeks to strike a balance between classic factorial designs and broad-coverage model evaluation that prioritizes explaining item-level variability. Our goal is to create a dataset that will yield effect size estimates precise enough to evaluate the predictions of language models at the level not only of constructions but also individual items. Unlike most prior work, we have the same participants read all of the types of constructions included in the experiment; this makes it possible to carry out within-participant comparisons of the magnitude of effects across constructions. Overall, by including various syntactic phenomena in the same study, and analyzing reading times in the same way across constructions, we can address more directly the question of whether prediction can serve as a unified account for language comprehension. Beyond the specific theoretical question that we set out to address as to the scope of the explanatory power of predictability, we see this dataset as a standard yardstick against which any quantitative theories of human sentence processing can be evaluated.

In summary, we aim to address four central questions regarding prediction in language comprehension, using surprisal estimates from neural language models to operationalize next-word prediction (Hale, 2001, Levy, 2008; for alternative ways to operationalize prediction, see Brothers and Kuperberg, 2021, Hoover et al., 2023 and Section “Implications for theories of sentence processing”).

First, we ask to what degree processing difficulty can be explained by surprisal in some key constructions that have driven psycholinguistic theorizing. Our dataset includes the three garden-path constructions examined by van Schijndel and Linzen (2021); this subset of the SAP Benchmark can be seen as a high-power replication of their work, with materials that are more tightly matched across constructions (see Section “Materials”). In addition to these three constructions, we also evaluate whether surprisal can explain the difficulty of object-extracted relative clauses compared to subject-extracted ones, the ambiguity advantage in relative clause attachment, and the ungrammaticality penalty in subject-verb agreement dependencies.

Second, we ask whether language model surprisal can correctly predict the relative difficulty among the three garden path constructions. While in van Schijndel and Linzen’s study language models made predictions that appeared not to match the rank order of human processing difficulty across constructions, their analyses had limited statistical power to detect differences between constructions. This issue is addressed in the current large-scale study, which has 8000 observations per condition.

Third, while van Schijndel and Linzen used only LSTM language models, we also evaluate a more more powerful language model based on the Transformer architecture. This makes it possible to examine whether our conclusions with regards to surprisal theory are sensitive to the technical aspects of the model used to derive surprisal estimates (see Section“Computing language model surprisal”).

Finally, we ask how well language model surprisal can explain itemwise variation in processing difficulty within the same syntactic construction. Existing work evaluating the item-level predictions of surprisal on targeted linguistic contrasts has been limited to small sample sizes (Frank & Hoeks, 2019). In this study, we collect between 220 and 440 observations per item. As we show below, this results in effect sizes for individual items that are much more precise than has been possible before, and enables robust item-wise analyses.

留言 (0)

沒有登入
gif