In this section, following the workflow chart in Fig. 1 from top to bottom, we provide specific references and worked examples of sample size calculations. Each subsection corresponds to one node in Fig. 1; and one row in Table 3 shows key formulae or references for sample size and power calculations for the specific hypothesis being tested.
Table 3 Key formulae or references for sample size and power calculations for the specific hypothesis being tested.Comparing microbial community structure between groups versus within groups using beta-diversity or distance metrics (Fig. 1 , node 2)The most commonly used analytic approach when working with the full spectrum of microbiome counts is to use beta-diversity, or measures of distance or dissimilarity between samples. To estimate sample size or power, one must choose first a distance metric, then find or generate plausible distances that are relevant for the proposed study, that is, obtain likely distances between pairs of samples. Sample size calculations are then based on the distributions of these distances, by comparing distances for pairs from the same group to pairs from different groups. The first row of Table 3 (Comparing microbial community structure between groups versus within groups using betadiversity or distance metrics (Fig. 1, node 2)) describes how to setup sample size calculations for this situation, with links to recommended methods and formulae.
Simple calculations that assume well-behaved distances (e.g., normally distributed distances for pairs in the same group) can be performed using information that may be easy to extract from published papers. If the means and standard deviations of beta-diversity distances are reported, then sample size calculations based on Equation A (Table 2) can be used.
An analysis of variance (ANOVA) compares variability within a group to variability between groups, which is exactly the concept desired for beta-diversity analyses. Furthermore, when comparing only two groups, there is an algebraic equivalence between sample size calculations using the t-test (Equation A, Table 2), derived from an ANOVA F-test, or based on correlation (Equation G, Table 2), i.e., the square root of the model-captured R2 17; F-statistics or R2 values are often reported in publications, for example see Table 5 in Sugino et al.18. Box 1 describes how the squared correlation, R2, is related to the effect size Δ from Equation A. It is worth noting that one should only use the R2 reported in a regression model for a microbiota beta-diversity sample size calculation when the publication has used the same distance measure as planned for one’s own study.
Madan et al. compared beta diversity between infants who were born by vaginal delivery versus cesarean section19, and we use these data to illustrate calculations with Equation A. Means and standard deviations can be extracted by eye from Fig. 1b in their paper, and are shown in Table 4. Madan et al. did not provide estimates of R2 19.
Table 4 Means and standard deviations of Unifrac beta diversity among 6-week-old infants by mode of delivery, extracted from Fig. 1b of Madan et al.19.Using these values, we can estimate the sample size required to compare vaginal birth and cesarean section with 80% power. To perform conservative calculations, we assume a common standard deviation equal to the larger within-group value: 0.0046. Then, we can calculate the effect size from this study as \(}}}}}}}} \,=\, \frac}} \,=\, 0.0026/0.0046 \,=\, 0.565\). Thus, according to Equation A (Table 2) for a desired 5% type I error for two-sided testing, and 20% type II error, the required sample size, per group, to detect differences between vaginal birth and C-section is:
$$n_1 \,=\, 2\frac \right)^2}}} \,=\, 49.11 \,\approx\, 50,$$
so, the total sample size needed for the two groups would be estimated as about 100.
We can also estimate \(f \,=\, }}}}}}}}/2 \,=\, 0.2825\), and hence \(R^2 \,=\, \frac} )}} \,=\, 0.0739\). Therefore, the correlation can be estimated as \(\tilde \rho _ \,=\, \sqrt \,=\, 0.272\). Then according to Equation G (Table 2), the required sample size for testing \(\rho _ \,=\, 0\) (assuming h = 0) is:
$$n \,=\, 3 \,+\, \frac \right)^2}}_e\left( \right)/\left( \right)} \right)/2} \right)^2}} \,=\, 103.71 \,\approx\, 104$$
This is slightly larger than the results based on Equation A provided above. These two estimates agree very well for larger sample sizes, but Equation G tends to be more conservative for small sample sizes20.
We could also estimate sample size using estimates of SSB and SSW calculated from the same information in Table 4. Following the principles in Box 1, the sum of squares for the vaginal birth group can be calculated as 70 × 0.00262 = 4.732 × 10−4, and the sum of squares for the cesarean group is 32 × 0.00462 = 6.771 × 10−4. Therefore, SSW is their sum, i.e., 0.001150. To calculate SSB, the overall mean is first obtained by a weighted average, as
$$\left( \right)/\left( \right) \,=\, 0.56048.$$
Therefore, SSB is \(70\,\ast\, \left( \right)^2 \,+\, 32\,\ast\, \left( \right)^2 \,=\, 0.000148\).
With SSW and SSB in hand, then:
$$f^2 \,=\, \frac}} \,=\, 0.000148/0.001150 \,=\, 0.1287,}}}}}}} \,=\, \sqrt \,=\, 0.3587.$$
Hence \(R^2 \,=\, \frac} \right)}} \,=\, 0.1140\).
Therefore, the correlation can be estimated as \(\widetilde _ \,=\, \sqrt \,=\, 0.3376\), and according to Equation G (Table 2), the required sample size for testing \(\rho _ \,=\, 0\) is:
$$n \,=\, 3 \, +\, \frac \right)^2}}_e((1 \,+\, 0.3374691)/(1 \,-\, 0.3374691))/2)^2}} \,\\ =\, 66.55 \,\approx\, 67$$
This sample size estimate of 67 is based on the two standard deviations shown in Table 4, whereas the earlier calculation based on Equation A used the larger of the two, which explains the discrepancy between the two sample size estimates.
These simple beta-diversity sample size calculation are based on normality of the within-group pair distances. A richer approach that relaxes this assumption can be built on the full distribution of pairwise distances, then analyzing the data using concepts from multivariate statistics. Since distance distributions tend to be strongly skewed, such alternative methods are commonly used for analysis after collecting study data. However, performing sample size calculations for these nonparametric analyses is challenging, since information about the distribution of the distances is needed. In a previous paper by the IMPACTT consortium we described approaches for estimating sample size and power when distances are available21. Furthermore, since these distances are often difficult to obtain, we also demonstrated how to generate distances by simulation21.
Box 1 Relationship between effect size, Δ, sums of squares from ANOVA, and correlation for studies comparing two groupsWithin-group and between-group sums of squares (\(SSW,SSB,\) respectively) are basic elements of ANOVA; the ANOVA F-statistic is calculated as \(\frac}}\) where \(G\) is the number of groups and \(n\) is the sample size. Cohen17 defined a statistic, \(f^2 \,=\, \frac}}\), where this formula can also be seen as the ratio of the variance between-group means to the variance of the data within the groups. If there are only two groups, then the effect size for a two-group t-test, \(\), can be written \( \,=\, 2f\) where \(\) is the effect size from Equation A (Table 2). Furthermore, the \(R^2\) measure from a regression model or an ANOVA, can be written as:
$$R^2 \,=\, \frac} \right)}}$$
when there are two groups. The square root of \(R^2\), for the two-group comparison, is the \(\rho _\) of Equation G. For groups of unequal size, for more than two groups, or unequal standard deviations, adjustments to the simple formulas have been developed17.
Using an entire vector of abundances to describe the microbiome of a sample (Fig. 1 , node 3)Multivariate methods can be used to compare microbial community structures through examination of distributions of the counts of taxa abundances. These distributions tend to have a large and heavily skewed dynamic range, with some very large counts and many near zero. No simple distributions match the shape and variability well, and hence specific methods for sample size calculation are needed. Row Using an entire vector of abundances to describe the microbiome of a sample (Fig. 1, node 3) in Table 3 shows the usual setup and method. Although resampling-based comparisons could be considered (e.g., permutation tests), they rest on assumptions which may not hold, such as that the within-group variability is consistent across groups. Therefore, La Rosa et al.22 proposed tests for comparing community structures based on the Dirichlet-Multinomial distribution. Since their approach is based on parametric distributions, it contains parameters that can be interpreted as measures of how different the community structures are. Thus, their method can be expected to be more powerful than any nonparametric procedure. The combination of the Dirichlet distribution with the multinomial allows capture of the inter-sample variability needed for microbiome data; in statistics this feature is referred to as ‘over-dispersion’. There are two key parameters: \(\pi \,=\, \left( \right)\) represents the expected taxa frequencies averaged across the groups being compared, and θ represents the over-dispersion.
In La Rosa et al.22 three tests are introduced and demonstrated: (a) comparing one community structure to an expectation, i.e., \(H_0:\pi \,=\, \pi _0,\)where \(\pi _0\) is known, (b) comparing two groups, \(H_0:\pi _1 \,=\, \pi _2,\) and (c) comparing multiple groups, \(H_0:\pi _1 \,=\, \pi _2 \,=\, \pi _3 \,=\, \ldots\). All these tests, and corresponding power calculations are built into their software package: HMP23.
To give an example, here we describe their calculations comparing community structures between two groups. Their data were taken from three oral sites (subgingival, supragingival, and saliva) in 24 subjects of both genders from the USA. Power calculations are based on a modified version of Cramer’s φ⬚ criterion, φm, which is based on a contingency table chi-squared test statistic (χ2):
$$\varphi _m \,=\, \root \of }^2}}}}$$
The value of this normalized chi-squared statistic is determined by the two key parameters, \(\pi \,and\,\theta\). When the authors compared the distributions for subgingival plaque and supragingival plaque in their subjects, their value of the modified Cramer’s φm was 0.16. They then calculated power for different numbers of subjects per group, and for different numbers of reads, at significance thresholds of 1 and 5%. For 1000 reads per group, power increased from 29.46% with 10 subjects per group to 89.76% with 25 subjects per group, using a significance threshold of 1%. It is worth noting that the authors recommend aggregating very rare taxa with abundance <1% into a single category.
Testing association between total microbial alpha-diversity, or taxon-specific alpha-diversity and an exposure or grouping variables (Fig. 1 , node 4)In community ecology, alpha-diversity refers to the number of species present in an ecosystem (richness)15 as well as the frequency of occurrence of each type of organism (evenness). This ecological metric is found to be reduced in several disease states24, making it a relevant factor to consider when proposing microbiome research hypotheses. The most commonly used metrics/indices are Shannon, Inverse Simpson, Simpson and Chao indices25. These indices do not consider the phylogeny of the taxa identified in sequencing. One measure of phylogenetic diversity (Faith’s PD) is based on phylogeny and can be calculated when a microbial phylogenetic tree is available24.
When alpha-diversity is normally distributed or can be log-transformed, basic equations can be used for sample size calculations—see row Testing association between total microbial alpha-diversity, or taxon-specific alphadiversity and an exposure or grouping variables (Fig. 1, node 4), node 4 in Table 3.
For an example of how to calculate sample size with an alpha-diversity metric, we will consider a study presented by Casals-Pascual et al.7 which used Faith’s phylogenetic diversity (Faith’s PD). This study aimed to compare the diversity of gut microbial communities in two phenotypically distinct groups of patients with Crohn’s disease (CD). The null hypothesis was that gut microbiota phylogenetic diversity did not differ by CD phenotype. To test this hypothesis, CD patients with the B1 phenotype would be compared with those with either a B2 or B3 phenotype. In this case, the CD phenotype was the independent variable and gut microbial diversity (Faith’s PD), the dependent variable.
To determine the number of patients required to find a statistically significant difference in Faith’s PD between CD phenotypes, researchers searched for summary statistics on Faith’s PD. They found a gut microbiota study of 100 patients with the B1 CD phenotype that reported a standard deviation of 3.45 for Faith’s PD, a mean of 13.5, and the distribution seemed to be approximately normal. To determine a clinically meaningful effect size, and because a similar previous study did not exist, researchers considered an analogous study where patients treated with antibiotics were compared to healthy controls. In the analogy, an effect size of 1.5 units was observed with Faith’s PD metric with a significance level of 0.0001. Using Equation A in Table 2 and a standard deviation of 3.45, selecting a conventional level of statistical significance of 5% and a statistical power of 80%, a total sample size of 110 patients (55 per group) was recommended to detect differences in Faith’s PD of ≥2 units7.
The median value can also be used to calculate sample size and is particularly appropriate for skewed richness and diversity values; the formulae needed for converting medians and interquartile ranges to means and standard deviations are shown and referenced in Footnote 1 Table 3.
When the exposure variable is continuous, methods based on correlations can be used. For example, soil bacteria metagenome alpha-diversity has been associated with mean annual precipitation gradients26. Suppose a researcher wants to test the null hypothesis that alpha-diversity is unrelated (\(r \,=\, 0\)) to mean annual precipitation with \(\alpha \,=\, 0.05\) and power of \(0.95\). The researcher assumes that the alternative correlation coefficient (h) is approximately \(- 0.5\). Therefore, following Equation G in Table 2:
$$\tilde \rho _^ \ast \, =\, \frac\ln \left( }}}}} \right) \,=\, \frac\ln \left( \right)}} \right)}}} \right) \,\\ =\, - 0.549$$
$$z_} \,=\, 1.96\left( \right);\,z_ \,=\, 1.64\left( \right)$$
For a null hypothesis (\(H_0\)) of no correlation, the required sample size is approximately:
$$\begin} } \right)} \,+\, z_} \right)^2/\left( ^ \ast \,-\, h ^\ast } \right)^2} \right)} \\ \right)^2/\left( \right)^2} \right)} \\ \end$$
Testing association between taxon abundances and an exposure or grouping variable (Fig. 1 , node 5)Researchers may also hypothesize about abundances of a specific microbial taxon of interest. These hypotheses can be expressed either using mean abundances, or by examining the proportion of samples with abundances over a chosen threshold. Sample size calculations can be based on either choice of metric, see row Testing association between taxon abundances and an exposure or grouping variable (Fig. 1, node 5) corresponding to node 5 in Table 3.
An important consideration here, as well as for nodes 6 and 7 in Fig. 1, is the choice of type 1 error. If only one taxon is of interest, then the type 1 error threshold, α, should not need adjustments for multiple testing. However, if the study plans to test association at all available taxons, then power calculations should be performed using a value α* which controls family-wise error rate. For example, use of the Bonferroni correction would suggest \(\alpha^\,=\, \alpha /M\) where M is the planned number of tests.
For example, in the observational study of Koleva et al.27 looking at the gut microbiome of mother-infant pairs (total population size of 1,021) from the Canadian Healthy Infant Longitudinal Development (CHILD) Study, the authors hypothesized the genus Lactobacillus was reduced in gut microbiota of male infants born to an asthmatic mother27. The abundance of 16S data from fecal samples collected at 3–4 months after birth was compared between infants born to mothers who received asthma treatment during pregnancy (i.e., infants high risk for allergic diseases) and those who were not. Results supported the Lactobacillus hypothesis in a sex-dependent and ethnicity-dependent manner27. In male Caucasian infants, the reduction of Lactobacillus was independent of other study covariates known to also influence the infant gut microbiome, such as pre-pregnancy overweight, atopy status, breastfeeding and intrapartum antibiotic treatment, strengthening these conclusions. Infant fecal Lactobacillus abundance was transformed into a binary variable using the cut-off value for the highest tertile (Table 5).
Table 5 Percent distribution of highest Lactobacillus abundance (highest tertile cut-off) between asthmatic mothers and control group (adapted from Table 1 of Koleva et al.27.The R script in Box 1 of the Supplement can be used to implement the sample size and power calculations using equations D and E of Table 2. However, we illustrate calculations assuming the sample sizes are equal in each group (\(r \,=\, 1\), Equation C of Table 2). Assuming a 2-sided test with an α of 0.05, 87 samples in each group can provide 80% power to detect a difference of this size in the proportion of infants with Lactobacilli above the highest tertile.
In the same study, abundances of bacterial taxa other than Lactobacillus were made using the Benjamini–Hochberg method28 to adjust for multiple testing (which is built into the multi-test procedure in SAS). Tests for interactions between infant sex and maternal prenatal asthma on Lactobacillus abundance were performed using an adjusted rank transform (ART) nonparametric test.
The taxon abundance comparisons presented in this paper are also useful to calculate sample size for other research questions. Due to the non-normal distribution of taxon abundance data, we provide an example that first requires converting median abundance into mean abundance for use in the sample size equation (Table 6). This conversion may be more appropriate at the phylum or other higher classification level, even family level, in which abundance data may be least skewed. Nevertheless, for illustration we used the median abundance of fecal Bacteroidetes in female infants of mothers with and without asthma to calculate sample size (Table 6). Medians and IQR (Q1 and Q3) were provided in the paper, and we transform these to estimate the mean and standard deviation. In this case, we use mean = median and SD = IQR/1.35:
$$}}}}}}} \,=\, }}}}}}}\quad \quad SD \,=\, \frac}} \,=\, \frac}}$$
Table 6 Median relative abundance (interquartile range) of Bacteroidetes in female infants of mothers with and without asthma (Adapted from Table 3 of Koleva et al.27.Therefore, in female infants of mothers with asthma, the mean and SD are estimated to be:
$$Mean \,=\, median \,=\, 72.8; \,n_1 \,=\, 17; \,SD_1 \,=\, \frac}} \,=\, \frac}} \,=\, 43.26$$
In female infants of mothers without asthma, the mean and SD will be:
$$Mean \,=\, median \,=\, 31;n_1 \,=\, 145;SD_1 \,=\, \frac}} \,=\, \frac}} \,=\, 45.78$$
$$\beginSD_ \,=\, \sqrt \right)SD_1^2 \,+\, \left( \right)SD_2^2}}}} \\ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \,=\, \,\sqrt \right)\ast 43.26^2 \,+\, \left( \right)\ast 45.78^2}}} \,=\, 45.53} \end$$
The effect size, Δ, is therefore:
$$ \,=\, \frac}}} \,=\, \frac}} \,=\, 0.918.$$
Based on this effect size, we calculate sample size using Equation A in Table 2 for testing differences in means:
$$n \,=\, \frac} \,+\, z_} \right)^2}}^2}},$$
where, n is the sample size in each group (assuming sizes of the two groups are equal), \(z_} \,=\) 1.96 (\(\alpha \,=\, 0.05\)), and \(z_\) = 0.84 (\(\beta \,=\, 0.20\)). Therefore:
$$n \,=\, \frac} \,+\, z_} \right)^2}}^2}} \,=\, \frac}} \,=\, 18.60 \,\approx\, 19$$
Assuming a 2-sided \(\alpha\) of 0.05, 19 samples in each group can achieve 80% power to compare Bacteroidetes abundance between asthmatic and non-asthmatic mothers of female infants.
Testing higher or lower rates of cluster membership between groups (Fig. 1 , node 6)Microbiota community types or clusters are increasingly being used to characterize whole microbial community composition. The format of these variables is categorical, and therefore sample size calculations are straightforward, see row Testing higher or lower rates of cluster membership between groups (Fig. 1, node 6) for node 6 in Table 3.
We provide an example here based on a study by Tun et al. (2021)29 that identified 4 longitudinal gut microbiota clusters during infancy (Table 7). A sample size calculation is performed to determine the association between Asian ethnicity vs. any other, and the presence or absence of the C1–C1 cluster in the infant. A binary variable was created for the presence of the C1–C1 cluster vs. three other clusters.
Table 7 Percent distribution of the C1–C1 cluster vs. the other three clusters between Asian ethnicity and others (Table 1 of Tun et al. (2021))29.R code and results for estimating sample size and power from these data, following Equations C, E in Table 2, are shown in the Supplementary Material, Box 2. There, we consider the sample sizes are equal in each group \((r \,=\, 1)\). Assuming a 2-sided test with an α of 0.05, 138 samples in each group can achieve 80% power to find an association between Asian ethnicity (vs. other) and the C1–C1 gut microbiota cluster (vs. others).
Testing higher or lower rates of taxon membership between groups (i.e., colonization with a microbe) (Fig. 1 , node 7)To determine sample sizes for colonization with specific microbiota (presence/absence or yes/no), one can use the equations for proportions or odds ratios (Table 3, row Testing higher or lower rates of taxon membership between groups (i.e., colonization with a microbe) (Fig. 1, node 7) for node 7).
To illustrate sample sizes for colonization with specific microbiota (presence/absence or yes/no), we draw information from Drall et al.30 to test the question whether infant C. difficile colonization differs by exclusivity of breastfeeding (Table 8). In 853 exclusively breastfed infants (EBF), the C. difficile colonization rate was 22.63%, i.e., 193 exclusively breastfed infants were colonized with C. difficile. In 431 partially breastfed infants (PBF), the C. difficile colonization rate was 35.96%; hence 155 partially breastfed infants were colonized, and in 270 exclusively formula-fed infants (EFF), the C. difficile colonization rate was 49.63% implying 134 colonized infants (Table 8).
Table 8 C. difficile colonization rate between exclusively breastfed infants vs. partial breastfed or formula-fed infants.The pooled proportion of C. difficile colonization from the PBF and EFF groups is obtained as:
$$}}}}} \,=\, \frac}} \,=\, \frac}} \,=\, 0.4123$$
R code and results for estimating sample size and power from these data, assuming an equal sample size in each of the two groups (\(r \,=\, 1\)), can be found in Supplementary Material, Box 3. Calculations follow Equation C in Table 2. Assuming a 2-sided test with an α of 0.05, 95 samples in each group are needed to achieve 80% power to find differences in C. difficile colonization rates between exclusively breastfed infants and other infants.
This sample size calculation example will apply to any comparison of populations with respect to presence or absence of a microbe of interest. For example, this approach would be appropriate for a study comparing the presence of shared microbial species between animals and humans. The sample size calculation might also be used to compare samples where the entry or exit of microbial species into/from an ecosystem is expected. For instance, a study where a single species (probiotic) or an entire community is introduced, such as fecal microbiota transplantation.
留言 (0)