Controlling taxa abundance improves metatranscriptomics differential analysis

IBDMDB data processing

RNA and DNA abundances of microbial metabolic pathways for each species, taxonomic abundances, and sample metadata of HMP2:IBDMDB dataset were downloaded from the IBDMDB website (https://ibdmdb.org/). The downloaded dataset contains the metagenomic and metatranscriptomic information of longitudinal samples collected from patients with CD and UC. The datasets were already processed by bioBakery Version 3.0 [9]. Unmapped reads or unknown taxa were removed from the data. Downloaded RNA and DNA abundances are in cpm (counts per million), and they were further log2 transformed after adding a pseudocount of 1. Taxa abundance values were centered log-ratio (CLR) transformed. Taxon with zero abundances were removed prior to CLR.

For each feature (gene pathway in a species), only samples that have all positive values in RNA, DNA, and taxonomic abundances are kept. Such filtering addresses the excessive amounts of zeroes in RNA abundances [9], and similar approach has been shown to have superior performance [10].

Generating data for the first simulation study

Simulation datasets were created using the processed IBDMDB data (log2 transformed CPM values for DNA and RNA abundances, and CLR transformed taxonomic abundances). Figure 3A shows the schematic of generating the simulated data. All samples were randomly assigned to two groups with equal chances to create a null situation where no differential feature is expected between the two groups. Then, artificial differential signals were added to selected features with absolute values of partial correlations (either between RNA and DNA abundances controlling taxonomic abundances or between RNA and taxonomic abundances controlling DNA abundances) larger than 0.3. Specifically, all positive RNA abundance values across all samples and features are sorted and assigned to 10 bins, so that the minimum value in the kth bin is no smaller than the maximum value in the \((k-1)\)th bin. The 10 bins represent different levels of differential signals. The first bin has the weakest differential signals and the last bin has the strongest differential signals. To add differential signals from kth bin to one feature, one of the two sample groups is randomly chosen to receive the differential signals, and RNA abundance values of all samples in the chosen sample group are added by values randomly selected from the kth bin. Differential features are then identified by each DE model and compared to the gold standard differential features. The simulation study is repeated for each choice of k (\(k=1,2,...,10\)). The injection of differential signals only causes negligible changes in total library size (Supplementary Fig. 3) so the library size was not recalibrated for the simulated datasets.

Generating data for the second simulation study

The simulation data were generated using the code in a Github repository (https://github.com/biobakery/MTX_synthetic) that implements the same procedures described in a previous study [10]. In true-exp mode, the data were generated with parameters –spike-exp 0.1 –spike-exp-strength j. In true-combo-dep-exp mode, the data were generated with parameters –spike-dep –spike-exp 0.1 –spike-exp-strength j. In true-combo-bug-exp mode, the data were generated with parameters –spike-bug 0.5 –spike-exp 0.1 –spike-exp-strength j. In group-true-exp mode, the data were generated with parameters –spike-groups –spike-exp 0.1 –spike-exp-strength j. Here j controls the signal strengths and takes values 0.1, 0.2, ..., 1. The weakest signal strength of 1 corresponds to j being 0.1, and the strongest signal strength of 10 corresponds to j being 1. The values of all other parameters were the same as the values used in the original Github repository.

Statistical models

The HMP2:IBDMDB study provides longitudinal measurements for each subject. After data filtering described in the “IBDMDB data processing” section, we observed in some features that most subjects only have one longitudinal observation, while in other features many subjects have multiple longitudinal observations. Thus we first used a model selection procedure to determine if a linear model with only fixed effects or a linear-mixed effect model is needed for each feature. The two models differ in that the linear-mixed effect model includes a random effect indicating which subject each longitudinal observation was collected. For each feature we fitted both models and compared their model fittings using a likelihood ratio test. Throughout the study, the linear-mixed effect model was fitted using lmerTest::lmer() function in R and the ordinary linear model was fitted using stats::lm() function in R. The p-values were obtained with asymptotic chi-squared tests (lmerTest::anova() function in R) and adjusted by BH procedure [16] (stats::p.adjust() function in R) to obtain FDRs. For all features with FDR < 0.05, linear-mixed effect models were used. For all features with FDR > 0.05, linear models with only fixed effects were used.

For simulation studies based on HMP2:IBDMDB data, the linear mixed-effect models are \(RNA \sim DNA+Taxa+group+(1 \vert subject)\) for the DNA+Taxa model, \(RNA \sim DNA+group+(1 \vert subject)\) for the DNA model, and \(RNA \sim Taxa+group+(1 \vert subject)\) for the Taxa model. For model selection described above, their corresponding ordinary linear regression models are \(RNA \sim DNA+Taxa+group\), \(RNA \sim DNA+group\), and \(RNA \sim Taxa+group\). Here RNA, DNA, and Taxa represent the processed RNA, DNA, and taxonomic abundances. group indicates the two sample groups being compared with, and subject indicates the subjects from which samples are longitudinally collected. The p-values of the group differences (group) are of primary interest and extracted from the model fitting (lmerTest::summary() function in R).

For real data analysis of HMP2:IBDMDB data, the linear mixed-effect models are \(RNA \sim DNA+Taxa+active+age+antibiotics+(1 \vert subject)\) for the DNA+Taxa model, \(RNA \sim DNA+active+age+antibiotics+(1 \vert subject)\) for the DNA model, and \(RNA \sim Taxa+active+age+antibiotics+(1 \vert subject)\) for the Taxa model. For model selection described above, their corresponding ordinary linear regression models are \(RNA \sim DNA+Taxa+active+age+antibiotics\), \(RNA \sim DNA+active+age+antibiotics\), and \(RNA \sim Taxa+active+age+antibiotics\). Here active indicates the dysbiotic and non-dysbiotic sample groups being compared with, and age and antibiotics are the age and antibiotics status for each subject, similar to a previous study [9]. The p-values of the differences between dysbiotic and non-dysbiotic sample groups (active) are of primary interest and extracted from the model fitting (lmerTest::summary() function in R).

For simulation studies based on a previous study [10], we used ordinary linear regression since there is no longitudinal design. The DNA+Taxa model is: \(RNA \sim DNA+Taxa+group\). The DNA model is: \(RNA \sim DNA+group\). The Taxa model is: \(RNA \sim Taxa+group\). The p-values of the group differences (group) are of primary interest and extracted from the model fitting (stats::summary() function in R).

DE analysis was performed only in features with at least 10 observations in both sample groups after data filtering. To test for the fixed effect of interest, the Satterthwaite method [17] was used to estimate degrees of freedoms and p-values were obtained with t-tests. These procedures are already implemented in lmerTest::summary() function in R. FDRs were obtained using BH procedure [16] (stats::p.adjust() function in R).

PubMed support search for real data analysis

For each feature, we manually identified a keyword of its associated metabolic gene pathway. We searched PubMed with the combination of “crohn’s disease” and the keyword. We additionally require that both keywords need to appear in either main text or title/abstract of a paper. For example, the PubMed search is “(“crohn disease”[Text Word] OR “crohn disease”[Title/Abstract]) AND (“adenosine”[Text Word] OR “adenosine”[Title/Abstract])” if the keyword is “adenosine”. We then recorded the number of publications found by PubMed. For each keyword with at least one Pubmed record, we performed a manual validation to identify one example publication that mentions both “crohn disease" and the keyword of the target feature. The name of the feature, the search keyword, the Pubmed search term, the number of Pubmed records, Pubmed ID of the manually validated publication, and the date the search was performed are all listed in Supplementary Table 1.

Software used in this study

All data processing and statistical analysis were performed using R version 4.1.1. All figures were generated using ggplot2 [18] version 3.3.6. Linear mixed-effect models were performed using lmerTest version 3.1-3.

Code availability

All code, scripts, and processed HMP2:IBDMDB data to reproduce the analysis in this study have been made freely available on Github: https://github.com/zji90/microbiome_taxacontrol. A static version of the code can be found on Zenodo: https://doi.org/10.5281/zenodo.7222142.

留言 (0)

沒有登入
gif