Power and optimal study design in iPSC-based brain disease modelling

Differences between commonly used iPSC-based study designs

Different study designs are currently used for iPSC-based disease modelling. Figure 1 outlines the most common designs for different research questions. Design 1 is a design that compares multiple iPSC lines derived from two donor populations (typically patients and controls). Designs 2A & 2B are isogenic comparisons using gene-editing either by introducing genetic variants in standard control lines or by repairing them in e.g., patient-derived iPSCs. Design 2 is also applied in experiments that investigate treatments (e.g., test a compound) in iPSC lines. Design 2A involves a single gene-edit or treatment in one iPSC line. Design 2B expands this to a series of gene-edits or treatments within one iPSC line. Lastly, Design 3 combines these features, comprising a series of isogenic pairs of different (genetically heterogeneous) individuals. Notably, these designs could be extended by including multiple clonal iPSC lines. This could serve to control for unwanted clone-specific aberrations that may occur during reprogramming, gene editing or iPSC passaging. However, using multiple clones inflates the false positive rate, and statistical power benefits are limited [21, 22]. Therefore, the current study focuses on the use of a single clone per individual.

Fig. 1: iPSC study designs address different research questions.figure 1

Design 1 (case–control) is characterized by including iPSC lines from multiple independent donors per condition. For design 2A (single isogenic pair), an iPSC line derived from a single donor is gene-edited (or treated), to create two iPSC lines that share the same genetic background and only differ at the gene locus of interest (or treatment). Design 2B is an extension of 2A, in which multiple variants (or treatments) are edited into the same genetic background. In Design 3, multiple isogenic pairs are created. The table represents the ability of each design to answer the research questions described on the left. Green: suitable; Orange: possible, but not optimal; Red: not possible.

These different designs differ in (1) which types of disorders they can model, (2) to what extent conclusions can be generalized, (3) to what extent they can be applied for personalized medicine, and (4) which data structures they generate and statistical approaches they dictate. First, gene-edited isogenic designs (Designs 2 A&B, 3) can model monogenic disorders, but modelling polygenic disorders is challenging and disorders for which the genetic component is not yet fully elucidated (idiopathic/sporadic disorders) cannot be modelled. Design 1 defines ‘cases’ based on diagnostic status and is therefore suitable for mono- and polygenic disorders as well as idiopathic disorders. Second, Designs 1, 2B, and 3 allow generalization of conclusions to the gene of interest, genetic background, and/or the patient population. Design 2B is uniquely suited to study effects of different genetic variants in one specific gene, but conclusions cannot be drawn beyond the specific genetic background used. Conversely, Design 3 is optimal to investigate the effect of a specific genetic variant in different genetic backgrounds. Designs 1 and 3 can model population-level genetic heterogeneity. Therefore, conclusions can be generalized to the overall patient population. Third, iPSC-based studies are also suited for personalized medicine: especially Designs 2 A&B are suited to test a single (Design 2A) or multiple (Design 2B) therapeutic options in a specific individual patient. Fourth, different designs differ in the way the data are collected. Consequently, data structures differ and this has statistical ramifications (see Supplementary textbox 1&2). Taken together, study designs differ in many dimensions, depending on research question and application. The choice between different designs has drastic consequences for data structure and statistical requirements. To assess how these choices affect statistical power and required sample sizes, a systematic analysis of variation sources and power is indispensable.

Estimation of the variance contributed by iPSC line and culture batch

To obtain representative estimates of data variation, we quantified variance in real experimental data obtained from three assays commonly used in iPSC-based studies: mass spectrometry proteomics (Fig. S1), morphological analyses using immunocytochemistry (Fig. S2), and synapse physiology using patch-clamp (Fig. S3). Measurements were taken from iPSC-derived neurons from five different individuals (from here on referred to as ‘lines’) and multiple culture batches. In a separate experiment, iPSC-derived neurons that were differentiated by NGN2 expression driven from a ‘Safe Harbour’ locus were recorded using patch-clamp electrophysiology, which allows for controlled dosage of NGN2 expression between neurons (Fig. S3, green boxplots).

Neurons were studied 6 weeks after differentiation. At this time point, neurons had a mean dendrite length of 1107–1311 um, and a mean synapse density of 0.0585–0.1146 synapses per um (Fig. S2). Synapses were functional, showing spontaneous and evoked responses as well as short-term plasticity (Fig. S3). Mass spectrometry proteomics showed similar protein detection in both culture conditions, with 4079 proteins detected in neurons of both conditions, and 97% (4079 out of 4208) of the neuron-only proteins detected in both conditions (Fig. S1C). After filtering for synaptic proteins using SynGO [23], 658 out of 674 (i.e., 98%) of synaptic proteins from neurons were detected in both conditions (Fig. S1D).

Together, these datasets serve as pilot experiments to estimate variance and subsequently perform power analyses to inform future studies. The total variation per parameter was quantified by the coefficient of variation (CoV, Supplementary table 1). A statistical comparison of the CoVs for all measured parameters in neurons induced by lentiviral NGN2 expression or expression from a ‘Safe harbour’ locus, revealed no significant difference in total variation between these iPSC-lines (Fig. S4A). Figure S4 shows the variance measured in the present study together with previously published studies with similar culture methods and experimental readouts for both mouse primary and iPSC-derived neuron datasets (Fig. S4B–F).

We set out to quantify two known sources of variation in human iNeuron studies: batch and line. The variation contributed by multiple “culture batches” (Fig. 2A, B) was estimated as the proportion of variance (R2) explained by culture batch [17]. For morphological and synapse physiology parameters, the variance contributed by culture batch varied substantially between parameters, ranging from 0.2 to 13% (Fig. 2B; Supplementary Table 2). For these datasets, glia feeder layers were included which were previously shown to promote neuronal maturation [24,25,26]. However, glia feeder layers may also increase variance introduced by culture batch. To assess this, neurons cultured with and without glia were compared using proteomics. Indeed, PCA analysis showed that 51% of the variance between the proteomics samples was explained by culture conditions. Moreover, glia feeders increased the variation contributed by culture batch effects: with glia, culture batch R2 was 15%, as opposed to 5.5% without (Fig. 2C). Taken together, the different datasets show that including multiple culture batches adds variation to the data (as previously demonstrated: for review, see Volpato and Webber, 2020 [9]) and this source of variation should be considered in statistical analyses (Supplementary textbox 1 and 2) by including culture batch as covariate. In the context of a priori power analyses, culture batch variation can be controlled for by mean-centering the data before using the estimated (unknown) variance to define the expected effect size (Fig. 3, Step 2&3), and including culture batch as a covariate when estimating dependency in the data (Fig. 3, Step 4).

Fig. 2: Culture batch and inter-individual variation contribute to the total variance.figure 2

A Schematic overview of the study, indicating sources of variation: inter-individual variation (i.e., variation introduced by including different individuals; orange box) and culture batch variation (i.e., variation introduced by acquiring data from different culture batches; purple box). Note that for this study, five iPSC-derived lines were used for morphological characterization, four lines for electrophysiology, and three lines for proteomic analyses. For the proteomics analysis, neurons were either cultured on a glia feeder layer (‘co-culture’) or on coating without glia (‘neuron-only’). B The contribution of different sources of variance is assessed by plotting the proportion of variance explained by culture batch (purple) and line (orange) for each parameter, as calculated following Nakagawa 2017 [17] for details, see Supplementary Methods). Values per parameter are included in Supplementary Table 2. Together, the variance contributed by line and culture batch accounted for a median of 10% (IQR: 6.0–21%) of total variance. C The explained variance calculated for all proteins in the proteomics dataset from neuron-glia co-cultures (top violin plot) and neuron-only cultures (bottom violin plot). For co-cultures, median variance explained by culture batch is 14.8% and for line 9.4%. For neuron-only cultures, the median variance explained by culture batch is 5.5% and 34.3% for line.

Fig. 3: Flow chart for a priori power analysis.figure 3

Flow chart describing the steps for performing power analysis per study design. Based on pilot data, variance estimates can be obtained and used to calculate effect sizes. For Designs 1 and 3, intra-cluster correlation (ICC) values should be taken into account. Power analyses for a wide range of study design scenarios can be performed using the freely available online application.

Designs 1 and 3 include data from multiple independent donors as another source of variation. As multiple neurons are derived from the same donor, inclusion of multiple donors introduces dependency in the data, i.e., dependency between data points derived from the same donor (Supplementary textbox 1 and 2). The degree of dependency is expressed as the intra-class or intra-cluster correlation (ICC; Supplementary textbox 1). In our data (Figs. S13), the ICC, i.e., the contribution of ‘iPSC line’ to the total explained variance, differed substantially between data types and specific parameters, ranging from 0.0 to 0.35 (Fig. 2B). Depending on the number of observations taken from each iPSC line, even limited dependency can result in inflated false positive rates or lower power [5, 6]. Hence, including the estimated ICC in power analyses is crucial to accurately determine the number of independent iPSC lines to be included to achieve sufficient power (Fig. 3, Step 4&5).

The variance estimates provided here for morphological, electrophysiological, and proteomic parameters can be used as a first guidance for power predictions of future iPSC-based studies. As the contribution of independent iPSC lines and culture batches to the total variance varies considerably between parameters, assay types, and culture conditions, and may additionally vary as a function of e.g., source material, donor characteristics, and reprogramming methodology. Additionally, for certain assay types, correction for multiple testing will affect the alpha level and thus alter the attainable statistical power. Thus, we created a web tool delivering power curves for a wide range of parameter settings. This tool enables iPSC researchers to perform a priori power analyses using pilot data-derived parameter settings (Fig. 3, Step 6) for each of these designs, while accounting for clustering of the data points introduced by using multiple iPSC lines. Additionally, the R-scripts used to perform our power simulations can be downloaded from the web tool, allowing researchers to tweak and add parameters to fit specific experimental circumstances and then perform customized power simulations for all scenarios. In the next sections, examples of power calculations are provided to illustrate the main determinants of statistical power for the four different designs.

Power analysis for case–control designs

To predict the statistical power for Design 1, we performed power simulations for a typical (hypothetical) case–control study: two experimental groups with N number of independent iPSC lines and n number of observations per independent iPSC line (Fig. 4A). In a multilevel design, the power to detect mean differences on the dependent variable between cases and controls depends not only on N, n, and the effect size, but also on the ICC, i.e., the similarity of observations taken from the same iPSC line [5]. To select representative ICC values, the observed (batch-corrected) ICCs from Fig. 3 were sorted in ascending order (Fig. 4B) and three ICCs (low, medium, high) were chosen that cover the observed range (0.01, 0.15, and 0.35). Effect sizes of the mean group difference on the dependent variable were expressed as Cohen’s d, which divides the difference between group means by the pooled (batch-corrected) standard deviations. Notably, since effect size Cohen’s d depends not only on the mean difference between the two groups but also on the variation in the data, d can vary considerably between experimental set ups and parameters. For our power simulations, we selected effect sizes based on our (batch-corrected) morphology data, specifically the parameter ‘Synapse Density’ corresponding to selected mean group differences of 15%, 50 and 70%. All power simulations were subsequently performed using a simplified statistical model that did not include any covariates, i.e., as one would do for data which are corrected for possible covariates like batch effects.

Fig. 4: Power simulations to calculate statistical power of Design 1-type studies.figure 4

A Schematic overview of the study design for the power analysis. In this hypothetical scenario, two conditions (control and case) are compared. Within each condition, multiple individuals are sampled. The statistical power is calculated by a simulation experiment (1000 simulations per scenario) varying the number of iPSC lines (either 2, 4, 6, 10, 20 or 50 per group, i.e., 4, 8, 12, 20, 40 or 100 lines in total) and the number of observations per individual. To model a series of scenarios, the simulations were performed for three representative ICC values. B ICC values (as shown in Fig. 3B) per parameter sorted in ascending order. Three representative ICC values (reflecting low, medium and high clustering) were selected for the power simulation. CK Simulated power curves, showing the relationship between statistical power and the total number of observations (number of iPSC lines times the number of observations per iPSC line).For each plot, the grey dotted line represents the cut-off value of 80% power. To assess statistical power for a range of effect sizes, three scenarios were compared, in which the two groups showed a mean difference of 15% (small), 50% (medium) or 70% (large). Corresponding Cohen’s d values were calculated using these mean differences and measured variance of the morphology parameter ‘Synapse Density’ (SD: 0.038) from the data example: 15% mean difference: d = 0.32; 50% mean difference: d = 1.1; 70% mean difference: d = 1.54.

For each simulation scenario, the power to detect a mean difference on the dependent variable between cases and controls was estimated for an increasing number of total observations, where the total number of observations is a function of the number of independent iPSC lines N (either 2, 4, 6, 10, 20 or 50 lines per experimental group; Fig. 4C–K), and the number of observations per line n. In each graph, the 80% power criterion is indicated, as this is conventionally considered acceptable power.

As expected, simulations generally showed that the lower the ICC and the larger the effect size, the higher the maximum power with the same number of independent iPSC lines N and observations n. In none of the scenarios, sufficient power was reached to detect a mean difference of 15% based on our data (Cohen’s d of 0.32; Fig. 4C, F, I). Additionally, these simulations indicate that studies with only 2 independent iPSC lines per condition (dark red lines in Fig. 4C–K) are bound to fail to detect real effects, except when dealing with (very) large effects and parameters with low ICC values (ICC = 0.01; Fig. 4J, K). For medium to high ICC values, in many instances the power has an asymptote below 100% and thus reaches a point where adding more observations n per line does not yield more power. Including more independent iPSC lines (N) does, however, increase the maximum attainable power, and sufficient power can be reached to draw generalizable conclusions for disease modelling studies involving genetically heterogeneous iPSC lines. The effect of increasing the number of lines N is most noticeable for small effect sizes, but observed for all effect sizes included: sufficient power to detect a medium-sized effect in the context of a high ICC can only be achieved by including a minimum of 10 independent lines per condition, whereas with fewer lines, the power plateaus below 80% (Fig. 4D). Overall, and in line with previous studies (e.g [5]), across all effect sizes and ICC values, the inclusion of more independent iPSC lines N increases power more than inclusion of more observations per line n. For the same total number of observations (N*n), studies involving more lines consistently have a higher power in all scenarios.

To assess the number of independent iPSC lines generally included in iPSC-neuron case–control studies, we performed a PubMed literature search in high-impact journals (Fig. S5). The number of independent iPSC lines per condition ranged from 1 to 14, with a median of 3 independent iPSC lines per condition (Fig. S5); 75% of the reviewed studies included 4 or less independent iPSC lines per condition. Our power simulations show that the majority of high-impact published iPSC case–control studies have included a lower number of iPSC lines than necessary to meet conventional power requirements.

Power analysis for isogenic designs

Next, power simulations were performed for Designs 2 A and B. Similar to Design 1, (batch-corrected) data from synapse density was used to apply realistic parameter settings for our simulations. Since in these designs only a single founder iPSC line is included, simulations were performed based on data both from the line showing highest data variation (i.e., high variable line; C1 in Figs. S13) and the line showing lowest variation (i.e., low variable line; C2 in Figs. S13). The variance was kept equal between experimental groups, based on the assumption that the variance did not change due to experimental manipulations like the gene editing process. For Design 2A, power was estimated for an increasing number of observations between the two conditions showing a simulated mean difference of 15% (Cohen’s d of 0.29 for high and 0.43 for low variable line), 30% (Cohen’s d of 0.58 for high and 0.86 for low variable line) and 50% (Cohen’s d of 0.96 for high and 1.43 for low variable line) (Fig. 5A–C). As expected, the total number of observations required to reach 80% power decreased when effect sizes increase. For the low variable line, 175 observations were required to achieve 80% power to detect a mean difference of 15% (Cohen’s d of 0.43), and 35 observations for a 50% mean difference (Cohen’s d of 1.43). The high variable line required higher numbers of observations to reach the 80% power cut-off, especially with smaller effect sizes; 400 observations for a 15% mean difference (Cohen’s d = 0.29) and 100 observations for a 30% mean difference (Cohen’s d = 0.58). Thus, power analysis in a single isogenic pair design scales with the anticipated mean difference and with the intrinsic variability of the founder iPSC line.

Fig. 5: Power simulations to calculate statistical power of Design 2-type studies.figure 5

AC Design 2A describes a comparison within a single isogenic pair. Simulated power curves are shown for three mean difference-scenarios. The corresponding Cohen’s d values were calculated for an iPSC-line showing low variability (SDC2 = 0.031; thick line) or high variability (SDC1 = 0.044; dashed). Corresponding Cohen’s d values: 15% mean difference: d = 0.29 (high) and 0.43 (low); 30% mean difference: d = 0.58 (high) and 0.86 (low); 50% mean difference; d = 0.96 (high) and 1.43 (low). DG For Design 2B, a hypothetical study was simulated with three experimental groups for the high- and low-variability iPSC lines, as for Design 2A. Four scenarios were tested, comparing the impact of having a small (15%) or medium (30%) mean difference (D and E), and the impact of including two groups with the same (F) or different (G) effect sizes.

To illustrate the most important power considerations for Design 2B, we performed simulations for a hypothetical design of three conditions (e.g., control and two experimental conditions) showing different combinations of mean effects. For a scenario in which only one experimental condition had a 15% mean difference to control, a total of 240 observations were required for the low variable line (Fig. 5D). This was reduced to 60 observations for a 30% mean difference (Fig. 5E). When both experimental conditions showed a 30% mean difference, the total number of observations remained unchanged at 60 (Fig. 5F). However, if one experimental group had a 15% mean difference, in combination with a 30% mean difference for the other group, the number of required observations increased to 80 (Fig. 5G), which was higher than scenarios described in Fig. 5E and F. Thus, in case of 3 conditions, inclusion of multiple groups that are expected to show the same experimental effect size does not change power, whereas including groups that are expected to show differential effect sizes negatively impacts power to detect an overall effect. As expected, a substantially higher number of observations was required for the high variable line in all scenarios. Thus, required observations to reach 80% power for Design 2B did not only depend on the size of anticipated mean differences and intrinsic iPSC line variability, as in Design 2 A, but also on the pattern of mean differences between the experimental groups.

Power analysis for multiple isogenic pairs

As shown above, experiments using isogenic lines (Designs 2 A and 2B) are superior in terms of attainable power to case–control designs (Design 1). However, because these Designs feature only 1 line, the results are limited in terms of generalizability of findings compared to experiments using multiple genetically heterogeneous case and control iPSC lines. For example, the effect of a genetic mutation can differ considerably between individuals as a function of genetic background. As a paired design for multiple isogenic lines, Design 3 combines the benefits of both approaches (Fig. 6A). We performed simulations to assess the power for this study design. Simulation parameters were selected that were previously used for the power simulations for Design 1: a high ICC value (0.35), and either a medium (50% mean difference based on our dataset; Cohen’s d = 1.1) or small (15% mean difference in our dataset; Cohen’s d = 0.32) effect size. Besides estimation of the main effect of the introduction or repair of the genetic mutation on the outcome variable, Design 3 allows the possibility to assess whether the effect of the genetic mutation is the same in all lines, or varies as a function of genetic background. Consequently, an additional variance parameter can be estimated in this type of design: the ‘slope variance’ (see Supplementary textbox 2), i.e., variance in the effect of ‘Condition’ between different isogenic pairs. To illustrate the effect of this variance parameter, four values were included: a very small (almost negligible) variance (0.001); a small (0.05) and a large (0.15) value, as previously used by (Aarts et al., [6]) based on guidelines of (Raudenbush and Liu, 2000 [27]), and an extreme value (0.5). Simulation showed that for Design 3, much higher power is achieved compared to Design 1 with a limited number of iPSC lines. For instance, to detect a Cohen’s d of 1.1, 80% power can be achieved with 4 isogenic pairs even if the slope variance is considerable (Fig. 6B–D), whereas power reaches a plateau under the same input conditions in a Design 1 situation even when a total of 12 lines (6 vs 6; Fig. 4D) is used. Thus, using multiple isogenic pairs considerably improves statistical power compared to a case–control design, limiting the number of lines required to detect true differences. However, it should be noted that this design does not improve power endlessly: to detect small differences (e.g., a mean difference of 15%; Fig. 6F–I), power still reaches a plateau, or a large number of observations is required to reach sufficient power. Moreover, extreme values of slope variance compromise power substantially. However, such extreme slope variation would likely compromise the interpretability of the experiment as a whole, as it implies extreme differences in the effects of the mutation in different genetic backgrounds. Our web application includes a wider range of scenarios, paralleling the ICC values and effect sizes also assessed for Design 1.

Fig. 6: Power simulations to calculate statistical power of Design 3-type studies.figure 6

A Schematic overview of the study design. Several independent iPSC lines are sampled and for each line, an isogenic ‘control’ is generated. Thus, two conditions (‘cases’ and ‘controls’) are compared, while accounting for clustering in the data that is due to the use of multiple individual iPSC lines BI The statistical power is calculated by a simulation experiment (1000 simulations per scenario) as for Design 1 (Fig. 4), for the highest ICC value from Fig. 4 (0.35), the high (SDC1 = 0.044) and low (SDC2 = 0.031) variable lines as for Design 2 A, and two mean difference scenarios (50%: BE; 15%: FI). In addition, four different slope variance values are tested: 0.001 (negligible); 0.05 (‘medium’; Aarts et al. 2015); 0.15 (‘high’: Aarts et al. 2015); 0.5 (‘extreme’).

Statistical power considerations

Together, the power simulations in Figs. 46 illustrate several general conclusions on the impact of different factors on statistical power in iPSC-based disease modelling. First, of course the bigger an effect size, the lower the number of independent iPSC-lines and total observations needed to reach sufficient statistical power. Second, including multiple iPSC-lines leads to dependency in the data and results in a “power-plateau”: a point in the power curve where adding more observations from the same lines does not increase power (Fig. 4C–K). Instead, increasing the number of independent iPSC-lines does increase the maximum attainable power. Third, lower levels of dependency in the data (low ICC) support a higher maximum attainable power with the same number of iPSC-lines N and observations n (Fig. 4C–K). Importantly, including more lines increases not only the statistical power to detect the experimental effect, but also the generalizability of the results (Fig. 1). In contrast, in isogenic designs only one founder iPSC-line is included. In this design, increasing the number of total observations will increase statistical power but generalization of findings is limited (Figs. 5A–C; 1). Fourth, in isogenic designs, within-line variation should be taken into account because higher within-line variability results in lower statistical power for the same mean difference and sample size (Fig. 5A–G). Fifth, for isogenic series, the statistical power is affected by the pattern of means: inclusion of multiple groups that are expected to show the same experimental effect size does not affect power, whereas including groups with varying effect sizes negatively impacts power to detect an overall effect (Fig. 5D–G). Sixth, when using multiple isogenic pairs, the variance in the effect of the experimental manipulation between different isogenic pairs, such as gene editing, affects the statistical power. The lower this random slope variance and the higher the effect size, the higher the maximum power with the same number of independent iPSC lines N and observations n (Fig. 6B–I). Lastly, for the same ICC, number of observations, and effect size, using multiple isogenic pairs results in substantially higher statistical power then using a case–control design (Figs. 4, 6). Although the power simulations presented here only cover a limited number of scenarios, these concepts are true for all possible combinations of settings. Our online tool can be used to explore and visualize these concepts for a wide range of scenarios.

留言 (0)

沒有登入
gif