The impact of co-housing on murine aging studies

A murine lifespan database

Murine lifespan studies are a cornerstone of research in the basic biology of aging. Our objective was to determine if the survival of co-housed individuals is independent and to explore the implications of ICC. We obtained data from four published and two unpublished murine lifespan cohorts for which housing identifiers were available totaling 22,385 mice (Table 1). These studies include mice of both sexes representing 42 inbred strains and two outbred populations. The studies were carried out in different facilities, at different times, and with a variety of housing configurations. The mice included in the database were genetically diverse, including 3763 inbred mice (JAX32\(+\)JAXCC studies) and 18,622 outbred mice (JAXDO\(+\)DRiDO\(+\)UTITP\(+\)JAXITP). Among the outbred mice, we obtained data for 17,062 UM-HET3 mice (UTITP\(+\)JAXITP) and 1560 DO mice (JAXDO\(+\)DRiDO). Both sexes are well represented (50.8% female). With these data, we were able to evaluate the independence of murine lifespan across more than 20,000 mice and nearly 6000 housing units. Survival outcome varied by study, site, randomization group, strain, and sex (Fig. 2).

Housing densities varied by study. Mice were most commonly housed four per cage (45.4% of mice), then three per cage (36.0% of mice). All mice in the JAXDO study were housed 5 per cage, and all mice in the DRiDO study were housed 8 per cage. Housing densities varied within other studies—the JAXITP study housing density was the most variable with a standard deviation of 1.25 mice (versus SD=0.933 (JAX32), SD=0.781 (JAXCC), and SD=0.487 (UTITP)). Housing density affects several factors associated with lifespan [20,21,22,23,24]. Mice are sometimes reassigned to new housing units during a study, often to mitigate aggressive behavior [25]. We have analyzed these data “as randomized” based on the original housing assignments when possible.

In the absence of confounders, we might expect that survival outcome would be largely explained by study factors, especially among inbred mice where genetic heritability of lifespan is accounted for. Yet after adjusting for study-specific covariates, we found consistently large dispersion among survival times within cage (median intra-cage IQR of residual survival (JAX32 (3.9 months)<JAXCC (4.6 months)< UTITP (5.5 months)< JAXITP (5.9 months)< JAXDO (8.8 months)< DRiDO (9.8 months)). As an illustration, see per-housing-unit adjusted lifespan distribution for JAXDO (Supplementary eFigure 1). In the figure, each boxplot represents one housing unit. Boxplots of lifespan by housing identifier were sorted within study by median adjusted lifespan. Taller boxplots within study indicate cages with larger intra-cage variability, and the non-zero slope of the medians indicates inter-cage variability within study. While all studies will exhibit some inter-cage variability, those with greater intra-cage correlation will exhibit a more pronounced slope of median residual lifespan. Thus, in a large and highly generalizable dataset of murine aging studies, simple descriptive statistics reveal that residual variability in lifespan, after adjusting for study-specific factors, is patterned by cage ID.

Lifespans of co-housed animals are not independent

The intra-class correlation (ICC; [26]) provides a quantified measure of the degree of clustering in lifespans by housing units. We considered two approaches to estimating ICC, linear mixed models (LMM) and generalized estimating equations (GEE). LMM assumes that co-housed mice share a random contribution to lifespan that is common to all mice within a housing unit and that varies between housing units. GEE explicitly models the covariance matrix based on group structure. Consequently, GEE is an unbiased estimator of ICC in that it can estimate positive, null, or negative ICC values while LMM is a biased ICC estimator in that it can estimate only positive or null ICC values, a distinction with important downstream effects on analytic results [27]. In the presence of negative ICC, LMM will constrain random effect variance to zero, thereby forcing the statistical model to assume data are independent.

Table 2 Conditional (LMM) and marginal (GEE) estimation of intra-class correlations in lifespan outcome

We applied LMM and GEE methods to each of the studies to estimate ICC in lifespan outcome (Table 2, Fig. 3). GEE estimated a small negative intra-cage correlation in DRiDO data which was missed by the LMM estimator (and failed to converge using default settings). LMM results indicated positive intra-cage correlation in JAXCC data, while GEE results indicated these data were consistent with positive, null, or negative intra-cage correlation. LMM results indicated a large positive intra-cage correlation in both JAXITP and UTITP relative to other studies; qualitatively similar results were obtained via GEE estimation. LMM results in JAX32 indicated a relatively large positive intra-cage correlation and qualitatively similar results were obtained via GEE estimation. LMM results in JAXDO indicated these data were consistent with positive or (nearly) null intra-cage correlation while GEE results indicated these data were consistent with positive, null, or small negative intra-cage correlation. LMM estimation of ICC for the pooled database after adjustment for study indicated that overall lifespan data were weakly positively correlated within cage (ICC (95% CI), 0.049 (0.038, 0.059)). Similarly specified GEE modeling also found ICC hovered around 0.05 (ICC (95% CI), 0.045 (0.033, 0.057)). The two models approached equal precision (difference in ICC CI width consistently less than 0.1). Thus, two statistical methods for multilevel data, LMM and GEE, quantified the extent to which the lifespans of co-housed mice are not independent. We demonstrate that while LMM can only estimate positive correlations among lifespans, GEE can estimate both positive and negative correlations. Both methods generally indicated weak positive correlations for lifespan outcomes within cages.

Fig. 3figure 3

Intra-cage correlations by study and estimation method. LMM specifications: For the DRiDO study, diet fixed effect and HID random effect (did not converge; estimate not shown) survival in months adjusted for generation batch effect prior to modeling. For the JAX32 study, sex fixed effect, strain, and housing ID random effect. For JAXCC, sex fixed effect, strain, and housing ID random effects; survival in months adjusted for generation and cohort prior to modeling. For JAXITP and UTITP studies, sex, and binary treatment/control fixed effects and housing ID random effect; survival in months adjusted for cohort prior to modeling. GEE specifications: Same as above for DRiDO, JAXDO, JAXITP and UTITP. In GEE analyses, JAX32 and JAXCC data were stratified by strain, and the strain random effect term was dropped from the model to isolate the clustering effect of housing ID

Fig. 4figure 4

Intra-cage correlations by physiologic trait domain. ICC was estimated via LMM methods applied to longitudinally collected phenotypic trajectories in the DRiDO study. LMM estimated ICC±95% confidence interval shown. The dotted vertical line indicates the median ICC for the phenotypic domain. LMM specified as diet fixed effect and HID random effect. DRiDO phenotypic trajectories included biannual frailty and body temperature (Frailty; a), annual whole blood analysis (CBC; c), annual metabolic cage (MetCage; b), annual wheel running (Wheel; d), annual bladder function (Void; e), annual body composition (PIXI; f), weekly body weights (BW; g), annual acoustic startle (AS; h), annual immune cell profiling (FACS; i), annual echocardiogram (Echo; j), annual rotarod (Rotarod; k), annual fasting and non-fasting glucose (Glu; l), and annual grip strength (Grip; m). Color indicates co-housing effect values>0.05. All quantitative traits excluding body weights were corrected for batch effects as described in [13]. LMM estimated ICC and 95% confidence intervals are reported in (Supplementary eTable 1)

We used permutation testing to quantify the significance of the observed ICC statistic. As recommended by R. A. Fisher [28], we treated the experimental unit assigned to treatments through randomization as the units that are permuted to assess ICC significance. In the experiments in the database, these units correspond to housing identifiers, as cages, not individual mice, are directly assigned to specific treatments, sexes, or strains. Non-parametric statistical methods estimate significance levels of non-independence by cage that were not significantly different from the permuted (null) ICC distribution for DRiDO (p=0.887), JAXCC (p=0.401), and JAXDO (p=0.161), but were significantly different from the permuted ICC distribution (\(p<0.001\)) for JAX32, JAXITP, and UTITP (Supplementary eFigure 2). These results qualitatively correspond to GEE estimates of ICC significance.

Longitudinally collected phenotypic trajectories of co-housed animals are not independent

Biomarkers are important indicators of normal and pathologic aging biology and/or anti-aging therapeutic response. The DRiDO study, a comprehensive investigation of dietary restriction interventions in Diversity Outbred mice [13], offered an exceptional opportunity to explore clustering effects in aging biomarkers. Longitudinal phenotyping was conducted across multiple physiological domains, including weekly body weights, assessments of frailty index, grip strength, and body temperature every 6 months, and yearly evaluations such as metabolic cage analysis, body composition, echocardiogram, wheel running, rotarod performance, acoustic startle response, bladder function, fasting glucose levels, immune cell profiling, and whole blood analysis.

We applied LMM methods to all longitudinally collected phenotypic trajectories in DRiDO to quantify ICC across hundreds of physiologic measurements of aging (Supplementary eTable 1). Examining the maximum estimated ICC by physiological domain indicated positive intra-cage correlation could exceed 0.05 in several physiological domains, including frailty, metabolic cage, whole blood analysis, and wheel running. Domains with the top five median ICC included frailty, bladder function, metabolic cage, glucose, and wheel running (ICC=0.0674, 0.0356, 0.0302, 0.0227, and 0.0142, respectively), indicating these domains were sensitive to co-housing relative to other physiological domains (Fig. 4). Alopecia and whisker trimming may be induced by barbering among co-housed mice, providing a potential explanation for why these frailty indicators exhibited more pronounced effects than other traits. In sum, co-housed mice exhibited similarities in aging biomarkers such as frailty and metabolic health that were not explainable by known study factors in a large longitudinal study on murine aging. The correlation of outcomes by cage appears to extend beyond lifespan.

Impact of co-housing on statistical tests for effects in murine aging studies is modest in practice

Analytic approaches varied for the different studies as appropriate given varying study design complexity, yet none of the primary publications accounted for non-independence by cage. (Lifespan data for two studies have not previously been reported (JAXCC and JAXDO).) Underutilization of analytic methods that account for clustering is not unique to preclinical research (e.g., cancer trials [29]; nutrition and obesity [30]). After incorporating various study design features via pre-analysis batch adjustment or model specification, we found that tests for the main effect in each study with and without incorporating a random effects term for housing identifier to capture intra-cage correlation usually showed no practical difference (Supplementary eTable 2).

Repeated measure designs may wrongly be thought to circumvent statistical artifacts of co-housing by comparing mice to themselves over time. In repeated measures applications, the individual becomes the first level of aggregation and the cage or other clustering unit becomes a second level of aggregation. To arrive at an unbiased estimate of outcome trajectories (how individual units on average change over time) from serial data, the second level of aggregation must be accounted for. For longitudinally collected phenotypic trajectories in the DRiDO study, we found that tests for the main effect of diet with and without incorporating a random effects term for housing identifier to capture intra-cage correlation often showed no practical difference (Supplementary eTable 3). Where tests for the main effect of diet conflicted, as was the case for several frailty items, the p-value for the main effect of diet was underestimated when co-housing was ignored, likely due to misattribution of variance to diet instead of housing identifier, which resulted in underestimated standard errors. Ultimately, while primary publications often did not account for non-independence by cage, our analysis showed that accounting for co-housing usually made little practical difference on statistical tests due to the relatively small magnitudes of intra-cage correlation observed.

Fig. 5figure 5

Power curves for simulated trial data with null treatment effects by ICC by model type. Simulations demonstrate that applying rules of thumb for murine research of longevity that ignore intra-cage clustering may underpower studies and reduce replicability. We simulated two-arm randomized lifespan studies with null effects and different ICC values. We applied conventional tests for uncensored data (LM, LMM, GEE) and calculated empirical power. The figure shows how power varies with ICC, model type, and sample size. When generating data from a null fixed effects model, any \(p<\)0.05 indicates a false positive error. The results showed that tests assuming independence (LM) overestimated power in the presence of positive ICC values. Sample sizes computed by the LM models ignoring ICC would be too small for positive ICC, leading to wasted resources and low replicability

Sample size guidelines in murine aging research lack broad applicability

Murine lifespan exhibits heterogeneity by strain, sex, and site, as highlighted by previous studies [16, 31, 32]. Despite this complexity, researchers may rely on sample size recommendations based on survival data from a single study, inbred strain, or sex ([33,34,35,36,37,38] cite [39]). Similarly, non-lifespan murine outcomes demonstrate considerable heterogeneity [40,41,42], yet existing sample size guidelines draw primarily from data sources that are strain-specific (e.g., [43, 44]). Failure to consider important sources of heterogeneity can limit the generalizability of recommendations and may lead to underpowered or inefficient study designs [45, 46]. Comprehensive databases providing sample size recommendations across strains, sexes, and experimental conditions would benefit the field [47], yet even in strain- and sex-matched samples, investigators risk non-negligible effects on statistical validity when implementing existing lookup tables for a simplified version of the study design and extrapolating to the more complex case planned [48]. Simulations demonstrate that applying rules of thumb for murine research of aging that ignore intra-cage clustering may underpower studies and reduce replicability. We simulated two-arm randomized lifespan studies with null effects and different ICC values. We applied conventional tests for uncensored (LM, LMM, GEE) and censored (COX, COXME) data and calculated empirical power. Figure 5 (LM, LMM, GEE) and Supplementary eFigure 3 (COX, COXME) show how power varies with ICC, model type, and sample size. When generating data from a null fixed effects model, any \(p<\)0.05 indicates a false positive error. Results showed that tests assuming independence (COX, LM) overestimated power in the presence of positive ICC values. Sample sizes computed by COX and LM models ignoring ICC would be too small for positive ICC, leading to wasted resources and low replicability.

Outcome dependence results in biased p-value distributions

Why does this overinflation of power occur for LM and for COX models where ICC is non-null? The behavior of the p-value distribution, which is assumed to be uniform from zero to one under the null hypothesis, is central to the hypothesis testing paradigm. Via simulation, we show that violated assumptions can cause the p-value distribution to be skewed under the null. For example, if a test assumes ICC=0, as in LM or COX, but ICC=−0.1, then the p-value distribution will be skewed towards 1 (Supplementary eFigure 4). This will result in a loss of power and an increase in type II errors (failing to reject the null hypothesis when it is false). Conversely, if again a test assumes ICC=0 but ICC=\(+\)0.1, the p-value distribution will be skewed towards zero (Supplementary eFigure 5). This will result in an overestimate of power and an increase in type I errors (rejecting the null when it is true). Therefore, in murine aging studies, it is important to check the assumptions of the test and use appropriate methods to account for any violations. Statistical biases resulting from intra-cage correlations with similar magnitudes as observed in these case studies have important implications for power and reproducibility.

Table 3 Sample size to detect fractional change in mean outcome under CRT design

While lifespan outcome is characterized by a single quantitative variable, quantification of health effects or molecular changes in response to aging interventions is often a multi-outcome endeavor (e.g., [13, 49]). With the advent of digital cages capable of 24/7 automated data capture, multi-outcome preclinical aging research is likely to become only more highly dimensional in the near future [50]. It is recommended practice to employ false discovery rate (FDR) adjustment to account for multiple tests. However, when conducting many tests and applying an FDR adjustment, skew in the null p-value distribution can affect the estimation of the proportion of true null hypotheses and the calculation of adjusted p-values [51, 52]. Biased p-value distribution can thus result in either too many or too few false discoveries when applying an FDR adjustment to large-scale simultaneous hypothesis tests. Consequently, the importance of considering biased p-value distribution due to non-independence by housing unit under the null hypothesis may be even greater for biomarkers of aging than for lifespan, despite similar observed ICC magnitude.

Sample size guidelines used in murine aging research may be derived from formulas that ignore potential correlation induced by co-housing (e.g., [39]). Quantified ICC in lifespan data pooled across multiple sites, sexes, diets, and in hundreds of traits in a large sample of genetically diverse mice allows for improved estimates. For census observed data, assuming the same mean and variance reported in [39] (mean days=912, variance=143\(^2\)) and assuming n per cluster is 4 (mean in database=4.07), we estimated n per group as follows: n=39 for ICC=0 (as in [39]; rounded to the nearest 10), n=40 for ICC=0.01, n=45 for ICC=0.05, and n=51 for ICC=0.1. We also provide a small grid of required number of mice per group according to the closed form solution [53] for a 10% effect size by ICC and cluster size for inbred and outbred mice (Table 3, Fig. 6). Our expectation is that these recommendations will be continuously updated by statisticians in the preclinical aging research community as new data become available. These results imply that strain and sex show large effects on n per group relative to plausible ICC and cluster size for murine aging studies.

Fig. 6figure 6

Sample size to detect fractional change in mean outcome under CRT design. The impact of the ICC on the planned trial size is often discussed as being dependent on its magnitude and on the number of subjects recruited per cluster, n, through the so-called design effect (DE), (1) DE=[1 + (\(n-1\))\(\alpha \)] [53]. For a simple comparison of means in a two-arm trial with equal allocation per group, the DE is used as an inflation factor multiplied by the total sample size for independent data. The table shows a grid of estimates for the required number of mice per group according to closed form solution in equation (1) for 10% ES (0.1\(\times \)strain-specific-mean lifespan in database) with census follow-up, and anticipated power 1-\(\beta \) of 0.8. Values are presented separately by sex, ICC, and cluster size. ICC varied \(\in \), and cluster size varied \(\in \). ICC, intra-class correlation coefficient; ES, effect size; k, cluster size. Results imply that strain and sex show large effects on n per group relative to plausible ICC and k for murine aging studies. Note that source data for mean and variance in C57BL/6 were more sparse than for outbred mice and so are less reliable. Our expectation is that these recommendations will be continuously updated by statisticians in the preclinical aging research community as new data become available

留言 (0)

沒有登入
gif