Structure-adaptive canonical correlation analysis for microbiome multi-omics data

1 Introduction

The human microbiome is the collection of microorganisms and their genetic makeup associated with the human body. It plays a critical role in human health and disease ranging from gastrointestinal diseases to various cancers (Sepich-Poore et al., 2021). To gain more mechanistic insights, multi-omics approaches have been increasingly employed in microbiome studies to elucidate the intricate interplay between the environment, the human microbiome and the host at different molecular levels (Hasin et al., 2017; Lloyd-Price et al., 2019). Although many multi-omics datasets have been generated in the past few years, it is unclear how to integrate them efficiently. One useful tool for multi-omics data integration is to perform canonical correlation analysis (CCA). CCA, due to Hotelling (1936), connects two sets of variables by finding a linear combination of variables that maximally correlate. However, the standard CCA fails when the sample size is strictly less than the number of variables as one can find meaningless solutions with correlations equal to one. Also, it does not perform variable selection and hence lacks interpretability. To circumvent these problems, sparse CCA (sCCA) has been proposed, aiming to find pairs of sparse canonical directions by imposing sparsity penalty. The first sCCA algorithm was presented by Parkhomenko et al. (2007), which, however, lacks exact criterion and biconvexity. Witten et al. (2009) applied the penalized matrix decomposition to cross-product matrix and yielded a straightforward formulation for sCCA. Some closely related methods include Parkhomenko et al., 2009; Lê Cao et al., 2009. Hardoon and Shawe-Taylor (2011) expressed the sCCA model as a primal-dual Rayleigh quotient, which takes the primal representation and kernel representation as the first view and second view, respectively. Chu et al. (2013) reformed CCA into a trace maximization problem and computed the sparse solution by the linearized Bregman method. To exploit the potential structural information among features, various forms of structure-adaptive sCCA have been proposed (Lin et al., 2013; Chen et al., 2012; Mohammadi-Nejad et al., 2017). In particular, Chen et al. (2013) proposed the structure-constrained sCCA (ssCCA) to exploit the phylogenetic structure in microbiome data.

Advances in next-generation sequencing technologies have enabled the direct sequencing of microbial DNA to determine microbiome composition, using either targeted or shotgun approaches (Wensel et al., 2022). The resulting microbiome data is typically in the form of a count table that records the frequencies of detected taxa in specific samples. However, due to the complexities inherent in the sequencing process, the total count for a sample reflects the sequencing effort rather than the actual microbial load at the sampling site. Consequently, microbiome data are inherently compositional, meaning that we only have information about the relative abundances of taxa. This compositionality presents significant challenges in the statistical analysis of microbiome data. A change in the (absolute) abundance of one taxon can lead to apparent changes in the relative abundances of all other taxa, complicating the identification of the actual causal taxa (Yang and Chen, 2022). The compositional nature also renders many standard multivariate statistical models inappropriate or inapplicable (Aitchison, 1982). Many efforts have been made to address the compositionality in different contexts of microbiome data analysis. For example, Friedman and Alm (2012) developed an iterative procedure named SparCC that allows inference of correlations for compositional data by assuming that the number of taxa is large and the true correlation network is sparse. Lin et al. (2014) dealt with the variable selection in regression with compositional covariates. Jiang et al. (2019) addressed zero inflation and detected pairs of associated compositional and non-compositional covariates using a Bayesian zero-inflated negative binomial regression model. However, existing CCA methods including ssCCA could not address the compositional effects, potentially reducing its precision in recovering relevant taxa.

We propose a new sCCA framework for integrating microbiome data with other high-dimensional omics data. The framework specifically addresses the compositional nature of the microbiome data. It also allows integrating prior structure information by imposing a “soft” constraint on the coefficients through varying penalization strength. As a result, the method provides significant improvement when the structure is informative while maintaining robustness against a misspecified structure. The developed tool aims to be an important resource for investigators to understand the interplay between the microbiome and host, decipher the molecular mechanisms underlying microbiome-disease association, and identify potential microbial targets for intervention.

This paper is organized as follows. Section 2 introduces the new sCCA framework for integrating microbiome compositional data with (non-)compositional high-dimensional data. Section 3 extends the new framework to incorporate additional prior structural information. In Section 4, we conduct numerical simulations to demonstrate the effectiveness of our proposed methods. Section 5 applies the proposed methods in a real microbiome study to investigate the association between gut bacteria and its metabolic output. We conclude with a discussion in Section 6.

2 Compositional sCCA2.1 Formulation

Let us consider two random vectors X=(X1,…,Xp)⊤ and Y=(Y1,…,Yq)⊤, where X contains the composition of p taxa and Y is a q-dimensional vector of non-compositional covariates. The nature of the composition makes X lie in a (p−1)-dimensional positive simplex. To address the compositionality, Aitchison and Bacon-Shone (1984) proposed applying the log-ratio transformation to compositional covariates resulting in Z/p=(log(X1/Xp),…,log(Xp−1/Xp)), where Xp is chosen as the reference component. CCA for compositional data can be formulated to find canonical coefficients a=(a1,…,aq)⊤ and b−p=(b1,…,bp−1)⊤ so that the correlation between a⊤Y and b−p⊤Z/p is maximized. Note that b−p⊤Z/p=∑j=1p−1bj⁡log(Xj/Xp)=∑j=1pbj⁡log(Xj) with bp=−∑j=1p−1bj. To avoid the choice of a reference component, we can write the term b−p⊤Z/p in a symmetric form by noticing that b−p⊤Z/p=b⊤Z, where Z=(log(X1),…,log(Xp)) and b=(b1,…,bp) with

b∈Bp≔c=c1,…,cp⊤:∑j=1pcj=0.

Therefore, the compositional CCA aims to find ã and b̃ such that

ã,b̃=argmaxa∈Rq,b∈BpCorra⊤Y,b⊤Z=argmaxa∈Rq,b∈BpCova⊤Y,b⊤ZVara⊤Yvarb⊤Z.

When the dimensions p and q are high (as compared to the sample size), regularization is required to encourage sparsity and to obtain a unique solution to the optimization problem. We let ΣYZ=Cov(Y,Z), ΣY=Cov(Y,Y) and ΣZ=Cov(Z,Z). Define the weighted l1 norm based on a vector of non-negative weights w=(w1,…,wp) for a vector b as ‖b‖1,w=∑j=1pwj|bj|. The compositional sCCA problem can then be formulated as

maxa∈Rq,b∈Rpa⊤ΣYZbs.t.a⊤ΣYa≤1, ‖a‖1≤Ca, b⊤ΣZb≤1, ‖b‖1,w≤Cb, b∈Bp.(1)

Here Ca,Cb>0 are some positive tuning parameters that control the global shrinkage level. The weight wj allows different penalization strengths according to the data or prior structure information. We will elaborate it in Section 2.3.

It has been shown that in high dimensions, treating the covariance matrix as diagonal can yield good results. Following the same strategy adopted by many of the existing high-dimensional CCA algorithms (e.g., Witten et al., 2009), we substitute in the identity matrix for ΣY and ΣZ in the CCA formulation (Equation 1). Moreover, we write the (weighted) l1 constraints on a and b in the Lagrangian form. Given a set of n samples i=1n, let Σ̂YZ be the sample cross-covariance between Y and Z. We formulate the feasible CCA problem as

mina∈Rq,b∈Rp−a⊤Σ̂YZb+λa‖a‖1+λb‖b‖1,ws.t.‖a‖2≤1, ‖b‖2≤1, b∈Bp,

which can be solved by iteratively optimizing the objective function with respect to one parameter while fixing the other parameter. Specifically, we have the following two updating steps.

1. Update b: Fix a(t) and update b through

bt←argminb∈Rp−at⊤Σ̂YZb+λb‖b‖1,w  s.t.  ‖b‖2≤1, b∈Bp.(2)

2. Update a: Fix b(t) and update a through

at+1←argmina∈Rq−a⊤Σ̂YZbt+λa‖a‖1  s.t.  ‖a‖2≤1.(3)2.2 Algorithm

In this section, we discuss the updates in Equations 2, 3. Define the operator

gh,λ,w=argminb∈Bp12‖h−b‖22+λ‖b‖1,w.

By exploring the Karush-Kuhn-Tuchker conditions, we obtain the following result.

Proposition 2.1. Set b̆=argminb∈Bp−(a(t))⊤Σ̂YZb+λb‖b‖1,w. The solution to (2) is given by

bt=1,2,…,p  and  d∈1,2,…,D},

where D represents the number of groups and CU denotes an upper bound on the weights.

3.2 Structure-adaptive compositional sCCA

Following Pramanik and Zhang (2020), we impose a penalty term on the weights and propose an algorithm to jointly estimate weights and parameters. Specifically, we define

hwj;γ=wj1−1γ1−1γ,if 0<γ<1,wj,if γ=1.

We estimate (a,b) and w jointly by solving the following problem

mina∈Rq,b∈Rp,w∈M−a⊤Σ̂YZb+λa‖a‖1+λb∑j=1pwj|bj|−log⁡hwj,γs.t.‖a‖2≤1, ‖b‖2≤1, b∈Bp.

The design of the function h is to reduce our method to the classic (iterative) adaptive Lasso when there is no external information. The readers are referred to Pramanik and Zhang (2020) for more discussions on the motivation.

Next we introduce the algorithm to solve the above problem. We focus on the update for w as the updates for a and b remain the same as in Section 2.2. In particular, we update w through

wt+1←argminw∈M∑j=1pwj|bjt|−log⁡hwj,γ.

When M=MGroup, it is straightforward to verify that

wjt+1=wj|bjt|−log⁡hwj,γ.

4. Update a: Fix b(t) and update a through

at+1←SΣ̂YZbt,λaSΣ̂YZbt,λa2.(10)

5. Iterate Steps 2-4 until convergence.

4 Simulation studies

In this section, we evaluate the finite sample performance of the proposed methods through numerical simulations.

4.1 Compositional data versus non-compositional data

We first consider the CCA problem between compositional data (i.e., microbiome data) and non-compositional data (e.g., metabolomics data) following a similar setting considered in Chen et al. (2013). To capture the dependence between the two sets of high-dimensional data, we use a latent variable model to generate the compositional variables (log scale) and non-compositional variables (original scale), where the dependence between these two sets of variables is governed by a latent variable ν. Specifically, we assume that

logXi=νiωX+εX,i,Yi=νiωY+εY,i,

where νi∼N(0,σν2) and εX,i,εY,i follow N(0p,σε2Ip×p) and N(0q,σε2Iq×q), respectively. The coefficients ωX∈Rp and ωY∈Rq control the relative contributions of individual variables to the overall association. The ratio σν/σε determines the overall association strength between log(X) and Y, with a larger value indicating stronger association. For the dimensions, we set (p,q)=(100,100),(200,200). We consider two setups for ωX.

S1 ωX=0.8510×(1,1,1,1,1,1,1,1,1−9,0p−10)⊤;

S2 ωX=0.856×(1,1,1,0,0,1,1,−5,0,0,0p−10)⊤;

where 1p⊤ωX=0 for both setups. The constraints imply that the association between X and Y is mediated through the log ratios for X. We focus on the group structure (i.e., M=Mgroup) and assume that the p taxa form 20 groups, with the group size equal to 5 for p=100 and equal to 10 for p=200. For example, in Setup S2 with p=100, the grouping is defined as

ωX=0.856×(1,1,1,0,0⏟Group 1,1,1,−5,0,0⏟Group 2,0p−10⏟Groups 3-20)⊤

The first two groups contain both zero and nonzero entries reflecting the fact that the external structure information is imperfect and noisy. We set ωY=0.85×(0.08,0.084,0.089,…,0.12,0q−10⊤)⊤. Next, we fix σε=1 and vary σν within to control the strength of the canonical correlation. We report the true positive rate (TPR), false positive rate (FPR), Matthew’s correlation coefficient (MCC), and Precision to measure the performance of different methods. Here,

TPR=TPTP+FN,FPR=FPFP+TN,MCC=TP×TN−FP×FNTP+FPTP+FNTN+FPTN+FN,Precision=TPFP+TP,

where TP, FP, TN, and FN represent the true positives, false positives, true negatives, and false negatives, respectively. The TPR, FPR, MCC, and Precision are computed by averaging over 100 simulation replicates. Denote the estimated canonical coefficients by â and b̂. Their estimation targets are ωX/‖ωX‖2 and ωY/‖ωY‖2, respectively, where this normalization is to ensure comparability. The estimation accuracy is evaluated using the root mean square error (RMSE).

We compare the performance of the following four methods.

1. sCCA: sCCA without considering the compositional effect;

2. C-sCCA: compositional sCCA;

3. AC-sCCA: adaptive compositional sCCA, i.e., M=[0,CU]p.

4. SAC-sCCA: structure adaptive compositional sCCA, i.e., M=MGroup.

For AC-sCCA and SAC-sCCA, we also apply adaptive weights on a with M=[0,CU]p in implementation. The value of CU is set to 105.

Table 1 summarizes the results for the above four methods when fixing σν=4. For both Setups S1 and S2, C-sCCA outperforms sCCA in terms of all four measures, especially in reducing the false positive rates and increasing the precision in identifying relevant compositional components, demonstrating the advantage of taking into account the compositional constraint. Compared to the first two methods, AC-sCCA and SAC-sCCA further reduce the FPR and thus lead to higher MCC in estimating a and b. For identifying b in Setup S1, SAC-sCCA outperforms the other three methods by exhibiting higher TPR, nearly zero FPR, and thus higher MCC because of incorporating grouping information. Figure 1 is in general consistent with these findings. As association strength increases, the TPR, MCC, and Precision of C-sCCA, AC-sCCA, and SAC-sCCA increase, whereas their FPRs show a declining trend. When σν≥3, the precision and FPR in estimating b of sCCA becomes worse as association strength increases, which means sCCA identifies more true variables at the cost of including more false variables. Figure 2 presents the RMSE in estimating the canonical coefficients, which decreases as the association strength σν increases. By accounting for the compositional effect, the C-sCCA, AC-sCCA, and SAC-sCCA outperform sCCA, with SAC-sCCA providing the most accurate estimation. This demonstrates that methods accounting for the compositional nature yield more accurate estimations than those that do not, and considering structural information can further enhance performance.

www.frontiersin.org

Table 1. Performance of sCCA for the association between compositional data and non-compositional data (σν=4). Numbers in the parentheses represent the corresponding standard deviations.

www.frontiersin.org

Figure 1. TPR, FPR, MCC, and Precision of sCCA for the association between compositional data and non-compositional data across association strength. Here, the range of σν is . Line with solid blue squares: sCCA; line with open red squares: C-sCCA; line with green crosses: AC-sCCA; line with open purple triangles: SAC-sCCA.

留言 (0)

沒有登入
gif