Block coordinate descent algorithm improves variable selection and estimation in error‐in‐variables regression

1 INTRODUCTION

Modern medical research is increasingly built on modeling of high-dimensional data. Sparse regression methods, such as the Lasso (Tibshirani, 1996), Generalized Lasso (Tibshirani et al., 2011), Grouped Lasso (Yuan & Lin, 2006), adaptive Lasso (Zou, 2006), and Elastic Net (Zou & Hastie, 2005), have been widely applied to perform estimation and variable selection at the same time. However, high-dimensional data sets often contain less precise measurements of phenotypes than those that might be available in smaller studies. For example, large biobanks often use billing codes from electronic health care records as proxy measures for a physician-made diagnosis. It is well known that applying naïve regression methods to predictor variables that are measured with error can lead to attenuation of effect estimates (Chesher, 1991; Rosenbaum et al., 2010). Analogously, questionnaire data from large cohorts often contain many missing values (Obermeyer & Emanuel, 2016). Removing subjects who are missing at least one measurement can easily lead to removal of most subjects when data are high dimensional.

Many error-in-variables solutions have been proposed. In addition to simple complete case analysis and pairwise deletion, more rigorous methods, such as expectation-maximization algorithms (Dempster, 1977; Schafer, 1997), multiple imputation methods (Buuren, 2011), and full information maximum likelihood estimation (Enders, 2001; Friedman et al., 2010), have been developed, but these computationally expensive methods cannot be easily extended to high-dimensional settings. In contrast, Loh and Wainwright (2011) developed a penalized method for error-in-variables regression. Within a properly chosen constraint radius, a projected gradient descent algorithm will converge to a small neighborhood of the set of all global minimizers, and is promising for variable selection in a high-dimensional setting (Loh & Wainwright, 2011). Nevertheless, proper choice of this constraint radius depends on knowledge of the parameters yet to be estimated (Datta et al., 2017). Hence, Datta and Zou (2017) developed the Convex Conditioned Lasso (CoCoLasso) that does not require prior knowledge of the unknown parameters. The CoCoLasso algorithm is able to correct for both additive measurement error and missing data, and showed a substantial increase in estimation accuracy and stability compared with the naïve Lasso.

However, when the data are only partially corrupted (i.e., some features are free of measurement error), the CoCoLasso still performs estimation for all features in an undifferentiated manner, limiting the implementation of the approach for large feature sets due to the intensive matrix computations required. Such circumstances of partial corruption are common for genetic epidemiology studies based on large genotyped cohorts, where the genotypes are accurately measured by highly reliable high-throughput sequencing or microarrays, but lifestyle or clinical risk factors (except for age and sex) are measured with various types of error. For instance, in the UK Biobank, one of the largest health registries to date, participants had accurately measured hundreds of thousands of single nucleotide polymorphisms (SNPs) with little missing data, but most covariates based on questionnaires or health care records contained missing data (Bycroft et al., 2018). Samples with such corrupted covariates are usually discarded, potentially leading to underuse of information. Therefore, inspired by the CoCoLasso, we propose here a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm that makes it possible to perform higher-dimensional error-in-variables regressions by separately optimizing estimation of the parameter estimates for uncorrupted and corrupted features in an iterative manner. Our proposal requires the implementation of a carefully calibrated cross-validation strategy. Furthermore, we build in the smoothly clipped absolute deviation (SCAD) penalty (Fan & Li, 2001) in the new algorithm. In simulations, we confirm that our algorithm provides equivalent results to the CoCoLasso, and demonstrates better performance than the naïve Lasso, with increasing benefit as the dimension increases. Although this approach will still encounter computational limitations for many corrupted features, we substantially enlarge the magnitude of problems that can be analyzed with an error-in-variables approach. We demonstrate the potential practical utility of the BDCoCoLasso by deriving covariate-adjusted genetic risk scores to predict body mass index, bone mineral density, and lifespan in a subset of the UK Biobank (Bycroft et al., 2018).

The rest of the manuscript is organized as follows. In Section 2, we briefly review the CoCoLasso method, and then we describe our new version that allows blocks of features with different corruption states—BDCoCoLasso. We describe simulation settings and results in Section 3. Section 4 illustrates the performance of our algorithm on the UK Biobank data.

2 METHODS

In this section, we first review the principles of the CoCoLasso. We then seek to improve its computational efficiency and stability when the covariate matrix is partially corrupted or when different types of measurement error exist simultaneously, by implementing a block coordinate descent algorithm (Rosenbaum et al., 2013). We also implement a SCAD penalty (Fan & Li, 2001) to avoid overshrinkage when some features have strong effects.

2.1 The CoCoLasso Suppose a true covariate matrix urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0001, with urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0002 observations and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0003 features, is measured as a corrupted covariate matrix urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0004, where measurement error can be: 1.

Additive error: urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0005, where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0006 represents additive error;

2.

Missing data: urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0007, where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0008 or urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0009.

It has been shown that using a classical Lasso with an objective function taking the form urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0010(1) can lead to biased estimation of urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0011 (Datta et al., 2017; Loh & Wainwright, 2011), where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0012 is the continuous response. Alternatively, this objective function can be reformulated as urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0013(2) where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0014 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0015. Loh and Wainwright (2011) proposed that urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0016 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0017 could be replaced by their unbiased estimators urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0018 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0019 such that urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0020 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0021. However, since the new covariance matrix urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0022 can have negative eigenvalues, particularly when the covariate matrix is high dimensional (urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0023), the new optimization problem with the objective function urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0024(3) is not necessarily convex. Loh and Wainwright (2011) showed that by setting certain constraints on urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0025, the problem could become convex, yet it is necessary to have prior knowledge of urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0026 to find a suitable constraint. Datta and Zou (2017) therefore proposed the CoCoLasso that adopts the adapted objective function but finds a nearest positive semidefinite matrix for urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0027: urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0028(4) where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0029. Here, the elementwise maximum norm for matrix urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0030 is defined as urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0031. This nearest positive semidefinite matrix can then be solved by an alternating direction method of multipliers (ADMM) algorithm (Boyd et al., 2011). 2.2 Two-block coordinate descent for partially corrupted covariate matrix The CoCoLasso enables error-in-variables regression in general, but when the feature set is large, the required matrix calculations are demanding. Implementing a block coordinate descent could substantially improve the computational efficiency when the covariate matrix is only partially corrupted. Specifically, projection of the covariance matrix onto a positive semidefinite subspace, that is, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0032, within the CoCoLasso, requires multiple operations on matrices of dimension urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0033, which are order urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0034. In contrast, our BDCoCoLasso requires these operations only on the corrupted subblocks of the covariance matrix, which are anticipated to be much smaller. Suppose the true covariate matrix urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0035 is now measured as urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0036, where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0037 is measured without error, and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0038 is measured with error. We then need to estimate urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0039 where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0040 is a coefficient vector for the noncorrupted covariates, and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0041 is a coefficient vector for the corrupted covariates. We derive the objective function as urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0042(5) We conceive a two-step block coordinate descent algorithm based on (2)–(4): 1.

We first consider urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0043 fixed, and we solve

urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0044(6) where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0045 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0046. urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0047 is defined as (a)

in the additive error setting, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0048;

(b)

in the missing-error setting, specifically, we define a ratio matrix urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0049 indicating the presence or absence of data as

urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0050 where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0051 is the number of samples for which both the urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0052th and the urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0053th features are measured and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0054 is the number of samples for which the urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0055th feature is measured. Note that urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0056 is used to correct for measurement error in the corrupted covariates. We then have urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0057, that is, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0058 for urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0059 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0060. 2.

We next consider urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0061 fixed, with a value optimized in the previous step, and we solve

urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0062(7) where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0063 is an unbiased surrogate of urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0064 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0065 is the nearest positive semidefinite matrix of urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0066. For urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0067 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0068, (a)

in the additive error setting, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0069 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0070, where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0071 is a known variance–covariance matrix for features measured with additive error;

(b)

in the missing error setting, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0072 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0073. Here, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0074 represents elementwise division.

We then alternate between the two steps until convergence. Following similar arguments as in Datta et al. (2017), we can ensure that both problems are equivalent to a Lasso problem. The complete optimization procedure is described in Algorithm 1.

Of note, the estimation problem can be defined as finding the global solution for urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0075, and our two-step procedure can be seen as equivalent to replacing urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0076 by its nearest positive definite matrix, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0077, in (5). Use of this substitution might not lead to a jointly convex problem. However, since both marginal problems (6) and (7) are convex, and both have suitable properties (i.e., both are strongly convex and smooth), our generalized alternating minimization algorithm can guarantee global minimization (Jain & Kar, 2017; Kelley, 1999).

Algorithm 1 Two-block coordinate descent Input urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0078, error Initialize urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0079; urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0080 while until convergence do if error = missing then urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0081 end if if error = additive then urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0082 end if urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0083 urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0084 if error = missing then urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0085 end if if error = additive then urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0086 end if urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0087 Update urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0088; urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0089 end while Output urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0090 Cross-validation to choose the penalization parameter, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0091, must be appropriately implemented for the block implementation. Therefore, extending the concept in CoCoLasso (Datta et al., 2017), a urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0092-fold cross-validated urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0093 can be obtained by minimizing the total cross-validation error while correcting for the two blocks separately, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0094(8) Here, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0095 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0096 are estimated as described above for urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0097 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0098 based on data not in the urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0099th-fold; urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0100 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0101 are derived as described above for urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0102 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0103 based on data in the urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0104th-fold. urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0105 is an unbiased surrogate of urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0106, where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0107 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0108 are in the urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0109th-fold. More specifically, 1.

in the additive error setting, where the additive error is centered to have zero mean, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0110;

2.

in the missing error setting, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0111 where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0112.

Although either an additive error setting or a missing error setting can be approached in the aforementioned two-step manner, data often contain variables subject to both types of errors. Therefore, we further propose a generalized algorithm that copes with a mixed error setting, described in Supporting Information.

2.3 Implementation of a SCAD penalty For potential application in scenarios where the causal variables are few but have large effect sizes, using the Lasso penalty may lead to overshrinkage (Fan & Li, 2001). To resolve this potential issue, we have also implemented a nonconcave SCAD penalty (Fan & Li, 2001). The SCAD penalty is given by urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0113(9) and its first derivative with respect to urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0114 is given by urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0115(10) Substituting the regular urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0116 penalty used in the Lasso by the SCAD penalty can retain large coefficients while shrinking smaller coefficients to zero. Thus, the SCAD penalty is able to produce a sparse solution and more accurate estimation for large coefficients. Following Zou and Li (2008), we implement a local linear approximation of the penalization function: urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0117(11) where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0118 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0119 are given by Equations (9) and (10), respectively, and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0120 is the estimate obtained from the previous iteration. Equivalently, urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0121(12) where a weight urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0122 specific to the urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0123th feature is introduced to the regular urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0124 penalty and is updated after each iteration. This implementation enables an adaptive BDCoCoLasso.

In principle, the hyperparameter urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0125 in the SCAD penalty should be estimated through cross-validation. However, the resulting two-dimensional cross-validation would be computationally expensive. Fan and Li (2001) proposed that urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0126 should be suitable for many problems, and that the algorithm performance does not improve significantly with urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0127 selected by data-driven approaches. We therefore set urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0128 in all simulations described below.

In addition to the SCAD penalty, other weighting schemes, such as the minimax concave penalty (Zhang, 2010), could be implemented in the future for improved generalizability.

3 SIMULATION STUDY

Simulations were designed to explore the performance of BDCoCoLasso as a function of the number and proportion of corrupted features. Furthermore, we wanted to ensure that our results matched CoCoLasso when both methods could be implemented, that is, for fairly modest urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0129, and a single type of error.

3.1 Simulation design We first simulated an uncorrupted covariate matrix urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0130 from a multivariate normal distribution with urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0131 observations, zero mean, and a predefined correlation structure across urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0132 features. We explored a lower-dimensional setting (urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0133 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0134) and a higher-dimensional setting (urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0135 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0136) in combination with two common covariance matrix designs to introduce correlation between features (urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0137): We then generated the response as urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0140(13) To ensure a realistic signal-to-noise ratio, we set urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0141. When assessing the performance of the CoCoLasso algorithm, Datta and Zou used urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0142 to generate strong signals from only a few features. Likewise, to start with a simulation that was similar to theirs, we set urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0143 where three of the features measured without error and three of the features measured with error were assigned to be causal with relatively large effect sizes.

Since we anticipate that this algorithm will be useful in large cohorts where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0144, and anticipating multiple associated features with small effect sizes, we simulated more scenarios with urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0145 and urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0146. We assigned different fractions of features to be causal (urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0147 or urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0148), and created higher dimensionality (urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0149, or urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0150) while sampling urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0151 from a standardized normal distribution urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0152.

Next, we introduced different types of error to the covariate matrix: 1.

For the additive error setting, the corrupted design matrix was generated as urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0153 where urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0154. We explored different urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0155 parameters in combination with different fractions (at least

留言 (0)

沒有登入
gif