How reliable are standard reading time analyses? Hierarchical bootstrap reveals substantial power over-optimism and scale-dependent Type I error inflation

Reading time data are widely used across the cognitive and neurosciences, medical contexts, and research on education. They have informed theories of language understanding, serve a key role in testing theories of reading, and lend insight to pedagogical contexts. However, many of the most common methods used to analyze reading times (RTs)—such as t-tests, analysis of variance (ANOVA), or linear mixed models (LMM)—assume that residual RTs are normally distributed with variance that is independent of the mean (homogeneity of variance). Both the normality and the homogeneity assumptions are likely to be false. Reading is a complex task that spans perceptual, cognitive, and motor processes. To the extent that the component processes are not perfectly information encapsulated, their contributions are not expected to be purely additive in raw RTs, making raw RTs unlikely to follow a normal distribution (Stephen & Mirman, 2010). RTs are also known to exhibit a soft lower bound and a positive skew, both of which are unexpected under the assumption of normality (similar properties are found for related psychometric data, such as reaction times in two-alternative forced choice tasks, Ratcliff & Smith, 2004; multi-choice tasks, Usher & McClelland, 2001; picture or color naming times, Heathcote et al., 1991, Snodgrass and Yuditsky, 1996).

These and related considerations have motivated a number of alternative approaches to RT analyses, ranging from relatively simple to increasingly complex. On the more advanced end, researchers have proposed time series and related models that correct for the lack of independence between temporally adjacent observations, capturing that each RT can reflect processing of not just the current input, but also preceding input (“spillover” analyses, Ehrlich and Rayner, 1983, Mitchell, 1984; generalized additive mixed models with auto-correlations, Baayen et al., 2016; or continuous-time deconvolutional regression, Shain & Schuler, 2021). Others have explored approaches that analyze RTs as a sum of several independent processes (e.g., log-shift, Ex-Gaussian, or other mixture models; see Nicenboim and Vasishth, 2018, Rouder, 2005, Staub and Benatar, 2013), or developed process models of reading that can be fit against RTs or eye-movements during natural reading (e.g., SWIFT, Engbert et al., 2005; ACT-R, Lewis et al., 2013, Lewis and Vasishth, 2005; EZ-READER Reichle et al., 2003). Any of these more advanced approaches offers unique opportunities to researchers to better understand their data, and to increase the reliability of their analyses. They do, however, also come with challenges, such as additional computational complexity and the need for additional statistical training.

While the future of reading analyses likely lies in these more advanced approaches (for an excellent review, see Shain & Schuler, 2021), it does not appear that this future is immanent: the majority of reading research continues to employ analysis approaches that are computationally less demanding, and require less expertise to interpret. The most common of these simpler approaches employ variants of the linear model—such as t-tests, ANOVAs, or LMMs—over inverse- or log-transformed, rather than raw RTs. Recent informal surveys of the field suggest linear models over raw or power transformed RTs account for as much as 98 % of RT analyses (Nicklin & Plonsky, 2020; see also Liceralde & Gordon, 2022). The same reviews suggest that analyses over raw, untransformed RTs remain the most common approach—despite the obvious problems with the assumptions they make—followed by analyses over log-transformed RTs. Given the continued prevalence of these approaches, the present work aims to shed light on how this choice affects the reliability—i.e., the Type I error rate and statistical power—of RT analyses. After all, ease of interpretability—sometimes evoked as an argument in favor of simpler approaches (e.g., Osborne, 2002)—is only of value if the interpretations drawn from the data are valid (see also Lo & Andrews, 2015). We thus assess whether one of the two approaches—analyses over raw or log-transformed RTs—is to be consistently preferred over the other in terms of Type I error rates and/or power. Inflated Type I error rates of either approach would call into question theories that are built on those analyses and findings. And inflated power estimates would lead researchers to be over-confident in their results, a matter that has only gained in relevance with an increasing focus on replicability. To further contextualize our results, we compare both approaches to a simple alternative described in more detail below, the log-shift transform.

The findings we present below differ in important ways from previous work, and we present evidence that this is likely due to common (unvalidated) parametric assumptions made in previous work about the distribution of RTs. We find that neither LMMs over raw nor LMMs over log-transformed RTs offer a one-size-fits-all analysis solution. Both can lead to substantial Type I error inflation and wasted power, depending on the experiment’s design. Indeed, there are reasons to believe that even more advanced methods like the ones described above are unlikely to completely resolve these issues. Our results do, however, suggest a pattern of when which of the two analysis approaches is to be preferred.

Based on our results, we estimate that a large proportion of RT studies might be underpowered, and that analyses with interactions additionally suffer from inflated Type I errors. Reading researchers should either employ simulation studies like ours to show that the issues we identify do not apply to their data, or demonstrate that their results are robust to at least the most common transformations. If the latter is not the case, researchers should clearly motivate how their theoretical assumptions justify the interpretation of their results, and/or carefully discuss why different analysis approaches yield different results for their data (see also Baayen et al., 2016, Staub, 2021). Beyond improving the reliability of RT analyses, this will also facilitate comparison of results across studies, which suffers when different studies (even by the same authors—our own work included) employ different types of analyses.

We present four statistical simulation studies (complemented by extensive auxiliary studies in the supplementary information, SI). We focus on RTs in self-paced reading paradigms (SPR), though future studies could employ the hierarchical bootstrap approach we present to validate analysis approaches for eye-tracking during reading. SPR continues to be frequently used in reading research. It is inexpensive, easy to implement, and yields a single RT measure.

Study 1 begins by characterizing the distribution of RTs in three different SPR data sets. While these questions have received attention for simpler psychometric tasks (Baayen and Milin, 2010, Brysbaert and Stevens, 2018, Kliegl et al., 2010, Lachaud and Renaud, 2011, Lo and Andrews, 2015, Rouder, 2005, Rouder et al., 2005, Schramm and Rouder, 2019, Wagenmakers and Brown, 2007), it is by no means clear that an ability as complex as reading yields distributions that resemble those of simpler two-alternative forced-choice tasks (see also Wagenmakers et al., 2005). Together, the three RT data sets span three common types of SPR experiments that psycholinguistic research draws on: factorial experiments conducted in the lab, factorial experiments conducted over the web via crowdsourcing, and studies conducted over reading corpora.

We first ascertain that RTs in all three data sets indeed violate the assumptions of normality and homogeneity. We then introduce the non-parametric hierarchical bootstrap approach employed in the remainder of the article. We use this approach to compare the distribution of bootstrapped natural RTs against the distribution of RTs that are parametrically generated under common power transformations. We find that none of the common transformations yields distributions that match those of natural RTs, though the log-transform provides a better fit than other common power transformations. This motivates the question we address in the remainder of the article: do the distributional properties of RTs have detrimental consequences for the reliability of the most common approaches to RT analyses? Not all unmet analysis assumptions have practical consequences on statistical power and Type I error rates. For example, while linear mixed models can be relatively robust to violations of normality (Knief & Forstmeier, 2018), heterogeneous variances are known to inflate Type I error rates for analyses of categorical data (e.g., if ANOVA or LMMs are used to analyze binomially distributed responses, Dixon, 2008, Jaeger, 2008).

Studies 2–4, as well as five auxiliary studies in the SI, address this question for the two power transformations (Box & Cox, 1964) that are most commonly employed in RT analyses: the identity transform (raw RTs) and the log-transform. Study 2 begins to investigate the consequence of naturally distributed RTs on power and Type I error rates in a simple by-2 design (a two-way manipulation, e.g., comparing treatment against control). Previous work has addressed this question for simpler psychometric paradigms (Brysbaert and Stevens, 2018, Lachaud and Renaud, 2011, Liceralde and Gordon, 2022, Ratcliff, 1993, Schramm and Rouder, 2019). These studies exclusively employed parametrically generated data. Liceralde & Gordon (2022), for example, fit LMMs to reaction time data, and then generate new reaction times from this parametric model (while adding the assumption that trial-level residuals follow a Gamma distribution). Type I and power analyses are conducted over these parametrically generated reaction times. This raises questions about the extent to which the results from these studies generalize to actual (rather than parametrically generated) reaction time data, as collected in experiments. Study 2 begins to address this question for reading times by means of non-parametric hierarchical bootstrap. This approach avoids specific parametric assumptions about the distribution of RTs.

Study 3 confirms that the results of Study 1 would indeed differ if we had used parametrically generated data, instead of bootstrapped naturally distributed RTs. This offers an explanation as to why our results support different conclusions than previous work, and highlights the need to carefully consider the approach to data generation when evaluating statistical power and Type I errors. Study 3 also has far reaching consequences for reading research: we find that reading research might have routinely and substantially over-estimated statistical power (the issues we identify hold in addition to other wide-spread issues with power estimates, such as the “significance filter”, Vasishth et al., 2018).

Finally, Study 4 moves beyond the simple by-2 design and tests how statistical tests of interactions are affected by assumptions about the distribution of RTs. Interactions, or a lack therefore, are often used to argue for or against theories. But interactions are also known to be particularly vulnerable to inadequate assumptions about the data. For example, a frequent finding is that conditions with overall slower RTs are more strongly affected by a manipulation than conditions with overall faster RTs. While such differences are routinely interpreted as meaningful, they can be the consequence of the soft lower bound of RTs: in conditions in which reading is already fast, it is difficult to detect further increases in reading speed. As we discuss as part of Study 4, reasoning about interactions is further complicated due to their “scale-dependence” (e.g., Loftus, 1978)—an issue that is conceptually independent of, but might in practice interact with, assumptions about the distribution of RTs (e.g., Lo and Andrews, 2015, Staub, 2021, Sternberg, 1969b, to which we return in the general discussion). These issues are likely not specific to linear model analyses, but rather are expected to extend to most advanced analyses (including all of the approaches mentioned further up). They are thus likely to persist even if the field eventually embraces those more advanced approaches.

The source data, simulation summaries, and the R code for our simulation studies are shared via OSF (https://osf.io/uymfp/). The three source data sets from which we bootstrapped contain a total of almost 1 million per-word RTs from over 400 subjects and 600 sentence items. The R code includes general code for both the hierarchical bootstrap and parametric generation of RTs, allowing researchers to conduct analyses similar to ours for their own data. The code affords flexible parallelization over multiple (local or remote) cores via R’s future package (Bengtsson, 2019). We hope that this will help the field to jointly build a stronger understanding of how the distributional properties of RTs affect statistical power and Type I errors, and the extent to which these effects depend on the experimental paradigm.

留言 (0)

沒有登入
gif